具有多线程的 Python 请求

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38280094/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 20:36:22  来源:igfitidea点击:

Python requests with multithreading

pythonmultithreadingasynchronouspython-requestsgevent

提问by krypt

I've been trying to build a scraper with multithreading functionality past two days. Somehow I still couldn't manage it. At first I tried regular multithreading approach with threading module but it wasn't faster than using a single thread. Later I learnt that requests is blocking and multithreading approach isn't really working. So I kept researching and found out about grequests and gevent. Now I'm running tests with gevent and it's still not faster than using a single thread. Is my coding wrong?

过去两天,我一直在尝试构建具有多线程功能的刮刀。不知怎的,我仍然无法应付。起初我尝试使用线程模块进行常规多线程方法,但它并不比使用单线程快。后来我了解到请求是阻塞的,多线程方法并没有真正起作用。所以我一直在研究并发现了 grequests 和 gevent。现在我正在使用 gevent 运行测试,它仍然不比使用单线程快。我的编码错了吗?

Here is the relevant part of my class:

这是我班级的相关部分:

import gevent.monkey
from gevent.pool import Pool
import requests

gevent.monkey.patch_all()

class Test:
    def __init__(self):
        self.session = requests.Session()
        self.pool = Pool(20)
        self.urls = [...urls...]

    def fetch(self, url):

        try:
            response = self.session.get(url, headers=self.headers)
        except:
            self.logger.error('Problem: ', id, exc_info=True)

        self.doSomething(response)

    def async(self):
        for url in self.urls:
            self.pool.spawn( self.fetch, url )

        self.pool.join()

test = Test()
test.async()

回答by Will

Install the grequestsmodulewhich works with gevent(requestsis not designed for async):

安装与(不是为异步设计的)一起使用的grequests模块geventrequests

pip install grequests

Then change the code to something like this:

然后把代码改成这样:

import grequests

class Test:
    def __init__(self):
        self.urls = [
            'http://www.example.com',
            'http://www.google.com', 
            'http://www.yahoo.com',
            'http://www.stackoverflow.com/',
            'http://www.reddit.com/'
        ]

    def exception(self, request, exception):
        print "Problem: {}: {}".format(request.url, exception)

    def async(self):
        results = grequests.map((grequests.get(u) for u in self.urls), exception_handler=self.exception, size=5)
        print results

test = Test()
test.async()

This is officially recommendedby the requestsproject:

这是项目官方推荐requests

Blocking Or Non-Blocking?

With the default Transport Adapter in place, Requests does not provide any kind of non-blocking IO. The Response.contentproperty will block until the entire response has been downloaded. If you require more granularity, the streaming features of the library (see Streaming Requests) allow you to retrieve smaller quantities of the response at a time. However, these calls will still block.

If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python's asynchronicity frameworks. Two excellent examples are grequestsand requests-futures.

阻塞还是非阻塞?

使用默认传输适配器,请求不提供任何类型的非阻塞 IO。该Response.content属性将阻塞,直到整个响应下载完毕。如果您需要更多粒度,库的流功能(请参阅流请求)允许您一次检索更少量的响应。但是,这些调用仍然会阻塞。

如果您担心阻塞 IO 的使用,有很多项目将请求与 Python 的异步框架之一相结合。两个很好的例子是grequestsrequests-futures

Using this method gives me a noticable performance increase with 10 URLs: 0.877svs 3.852swith your original method.

使用此方法可显着提高 10 个 URL 的性能:0.877s3.852s使用原始方法相比。