具有多线程的 Python 请求

Question

提问by krypt

I've been trying to build a scraper with multithreading functionality past two days. Somehow I still couldn't manage it. At first I tried regular multithreading approach with threading module but it wasn't faster than using a single thread. Later I learnt that requests is blocking and multithreading approach isn't really working. So I kept researching and found out about grequests and gevent. Now I'm running tests with gevent and it's still not faster than using a single thread. Is my coding wrong?

过去两天，我一直在尝试构建具有多线程功能的刮刀。不知怎的，我仍然无法应付。起初我尝试使用线程模块进行常规多线程方法，但它并不比使用单线程快。后来我了解到请求是阻塞的，多线程方法并没有真正起作用。所以我一直在研究并发现了 grequests 和 gevent。现在我正在使用 gevent 运行测试，它仍然不比使用单线程快。我的编码错了吗？

Here is the relevant part of my class:

这是我班级的相关部分：

import gevent.monkey
from gevent.pool import Pool
import requests

gevent.monkey.patch_all()

class Test:
    def __init__(self):
        self.session = requests.Session()
        self.pool = Pool(20)
        self.urls = [...urls...]

    def fetch(self, url):

        try:
            response = self.session.get(url, headers=self.headers)
        except:
            self.logger.error('Problem: ', id, exc_info=True)

        self.doSomething(response)

    def async(self):
        for url in self.urls:
            self.pool.spawn( self.fetch, url )

        self.pool.join()

test = Test()
test.async()

Answer 1

回答by Will

Install the grequestsmodulewhich works with gevent(requestsis not designed for async):

安装与（不是为异步设计的）一起使用的grequests模块：geventrequests

pip install grequests

Then change the code to something like this:

然后把代码改成这样：

import grequests

class Test:
    def __init__(self):
        self.urls = [
            'http://www.example.com',
            'http://www.google.com', 
            'http://www.yahoo.com',
            'http://www.stackoverflow.com/',
            'http://www.reddit.com/'
        ]

    def exception(self, request, exception):
        print "Problem: {}: {}".format(request.url, exception)

    def async(self):
        results = grequests.map((grequests.get(u) for u in self.urls), exception_handler=self.exception, size=5)
        print results

test = Test()
test.async()

This is officially recommendedby the requestsproject:

这是项目官方推荐的requests：

Blocking Or Non-Blocking?
With the default Transport Adapter in place, Requests does not provide any kind of non-blocking IO. The Response.contentproperty will block until the entire response has been downloaded. If you require more granularity, the streaming features of the library (see Streaming Requests) allow you to retrieve smaller quantities of the response at a time. However, these calls will still block.
If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python's asynchronicity frameworks. Two excellent examples are grequestsand requests-futures.

阻塞还是非阻塞？
使用默认传输适配器，请求不提供任何类型的非阻塞 IO。该Response.content属性将阻塞，直到整个响应下载完毕。如果您需要更多粒度，库的流功能（请参阅流请求）允许您一次检索更少量的响应。但是，这些调用仍然会阻塞。
如果您担心阻塞 IO 的使用，有很多项目将请求与 Python 的异步框架之一相结合。两个很好的例子是grequests和requests-futures。

Using this method gives me a noticable performance increase with 10 URLs: 0.877svs 3.852swith your original method.

使用此方法可显着提高 10 个 URL 的性能：0.877s与3.852s使用原始方法相比。

具有多线程的 Python 请求

提问by krypt

回答by Will

相关推荐

最近更新

标签

具有多线程的 Python 请求

提问by krypt

回答by Will

相关推荐

Python Tensorflow - 输入矩阵与批处理数据的 matmul

为什么我无法导入 Tensorflow.contrib 我收到了 No module named 'tensorflow.python.saved 的错误

Python 如何让 Pycharm 更快/更轻？

Python Tensorflow 2.0 - AttributeError: 模块“tensorflow”没有属性“Session”

相关推荐

最近更新

标签