具有多线程的 Python 请求
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38280094/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python requests with multithreading
提问by krypt
I've been trying to build a scraper with multithreading functionality past two days. Somehow I still couldn't manage it. At first I tried regular multithreading approach with threading module but it wasn't faster than using a single thread. Later I learnt that requests is blocking and multithreading approach isn't really working. So I kept researching and found out about grequests and gevent. Now I'm running tests with gevent and it's still not faster than using a single thread. Is my coding wrong?
过去两天,我一直在尝试构建具有多线程功能的刮刀。不知怎的,我仍然无法应付。起初我尝试使用线程模块进行常规多线程方法,但它并不比使用单线程快。后来我了解到请求是阻塞的,多线程方法并没有真正起作用。所以我一直在研究并发现了 grequests 和 gevent。现在我正在使用 gevent 运行测试,它仍然不比使用单线程快。我的编码错了吗?
Here is the relevant part of my class:
这是我班级的相关部分:
import gevent.monkey
from gevent.pool import Pool
import requests
gevent.monkey.patch_all()
class Test:
def __init__(self):
self.session = requests.Session()
self.pool = Pool(20)
self.urls = [...urls...]
def fetch(self, url):
try:
response = self.session.get(url, headers=self.headers)
except:
self.logger.error('Problem: ', id, exc_info=True)
self.doSomething(response)
def async(self):
for url in self.urls:
self.pool.spawn( self.fetch, url )
self.pool.join()
test = Test()
test.async()
回答by Will
Install the grequests
modulewhich works with gevent
(requests
is not designed for async):
安装与(不是为异步设计的)一起使用的grequests
模块:gevent
requests
pip install grequests
Then change the code to something like this:
然后把代码改成这样:
import grequests
class Test:
def __init__(self):
self.urls = [
'http://www.example.com',
'http://www.google.com',
'http://www.yahoo.com',
'http://www.stackoverflow.com/',
'http://www.reddit.com/'
]
def exception(self, request, exception):
print "Problem: {}: {}".format(request.url, exception)
def async(self):
results = grequests.map((grequests.get(u) for u in self.urls), exception_handler=self.exception, size=5)
print results
test = Test()
test.async()
This is officially recommendedby the requests
project:
这是项目官方推荐的requests
:
Blocking Or Non-Blocking?
With the default Transport Adapter in place, Requests does not provide any kind of non-blocking IO. The
Response.content
property will block until the entire response has been downloaded. If you require more granularity, the streaming features of the library (see Streaming Requests) allow you to retrieve smaller quantities of the response at a time. However, these calls will still block.If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python's asynchronicity frameworks. Two excellent examples are
grequests
andrequests-futures
.
阻塞还是非阻塞?
使用默认传输适配器,请求不提供任何类型的非阻塞 IO。该
Response.content
属性将阻塞,直到整个响应下载完毕。如果您需要更多粒度,库的流功能(请参阅流请求)允许您一次检索更少量的响应。但是,这些调用仍然会阻塞。如果您担心阻塞 IO 的使用,有很多项目将请求与 Python 的异步框架之一相结合。两个很好的例子是
grequests
和requests-futures
。
Using this method gives me a noticable performance increase with 10 URLs: 0.877s
vs 3.852s
with your original method.
使用此方法可显着提高 10 个 URL 的性能:0.877s
与3.852s
使用原始方法相比。