Python 一个非常简单的多线程并行 URL 获取(无队列)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16181121/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
A very simple multithreading parallel URL fetching (without queue)
提问by Daniele B
I spent a whole day looking for the simplest possible multithreaded URL fetcher in Python, but most scripts I found are using queues or multiprocessing or complex libraries.
我花了一整天的时间在 Python 中寻找最简单的多线程 URL 获取器,但我发现的大多数脚本都在使用队列或多处理或复杂的库。
Finally I wrote one myself, which I am reporting as an answer. Please feel free to suggest any improvement.
最后我自己写了一个,我将其报告为答案。请随时提出任何改进建议。
I guess other people might have been looking for something similar.
我想其他人可能一直在寻找类似的东西。
采纳答案by abarnert
Simplifying your original version as far as possible:
尽可能简化您的原始版本:
import threading
import urllib2
import time
start = time.time()
urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]
def fetch_url(url):
urlHandler = urllib2.urlopen(url)
html = urlHandler.read()
print "'%s\' fetched in %ss" % (url, (time.time() - start))
threads = [threading.Thread(target=fetch_url, args=(url,)) for url in urls]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
print "Elapsed Time: %s" % (time.time() - start)
The only new tricks here are:
这里唯一的新技巧是:
- Keep track of the threads you create.
- Don't bother with a counter of threads if you just want to know when they're all done;
joinalready tells you that. - If you don't need any state or external API, you don't need a
Threadsubclass, just atargetfunction.
- 跟踪您创建的线程。
- 如果您只想知道线程何时完成,请不要理会线程计数器;
join已经告诉你了。 - 如果不需要任何状态或外部 API,则不需要
Thread子类,只需一个target函数。
回答by Daniele B
This script fetches the content from a set of URLs defined in an array. It spawns a thread for each URL to be fetch, so it is meant to be used for a limited set of URLs.
此脚本从数组中定义的一组 URL 中获取内容。它为要获取的每个 URL 生成一个线程,因此它旨在用于有限的一组 URL。
Instead of using a queue object, each thread is notifying its end with a callback to a global function, which keeps count of the number of threads running.
每个线程不使用队列对象,而是通过对全局函数的回调通知其结束,该函数保持运行线程数的计数。
import threading
import urllib2
import time
start = time.time()
urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]
left_to_fetch = len(urls)
class FetchUrl(threading.Thread):
def __init__(self, url):
threading.Thread.__init__(self)
self.setDaemon = True
self.url = url
def run(self):
urlHandler = urllib2.urlopen(self.url)
html = urlHandler.read()
finished_fetch_url(self.url)
def finished_fetch_url(url):
"callback function called when a FetchUrl thread ends"
print "\"%s\" fetched in %ss" % (url,(time.time() - start))
global left_to_fetch
left_to_fetch-=1
if left_to_fetch==0:
"all urls have been fetched"
print "Elapsed Time: %ss" % (time.time() - start)
for url in urls:
"spawning a FetchUrl thread for each url to fetch"
FetchUrl(url).start()
回答by abarnert
The main example in the concurrent.futuresdoes everything you want, a lot more simply. Plus, it can handle huge numbers of URLs by only doing 5 at a time, and it handles errors much more nicely.
中的主要示例concurrent.futures可以满足您的所有需求,而且更加简单。另外,它可以通过一次只处理 5 个来处理大量的 URL,并且它可以更好地处理错误。
Of course this module is only built in with Python 3.2 or later… but if you're using 2.5-3.1, you can just install the backport, futures, off PyPI. All you need to change from the example code is to search-and-replace concurrent.futureswith futures, and, for 2.x, urllib.requestwith urllib2.
当然,此模块仅内置于 Python 3.2 或更高版本中……但如果您使用的是 2.5-3.1,则只需安装futuresPyPI的 backport 即可。所有你需要从示例代码改变是搜索和替换concurrent.futures用futures,而且,对于2.x中,urllib.request有urllib2。
Here's the sample backported to 2.x, modified to use your URL list and to add the times:
这是反向移植到 2.x 的示例,修改为使用您的 URL 列表并添加时间:
import concurrent.futures
import urllib2
import time
start = time.time()
urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
conn = urllib2.urlopen(url, timeout=timeout)
return conn.readall()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in urls}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print '%r generated an exception: %s' % (url, exc)
else:
print '"%s" fetched in %ss' % (url,(time.time() - start))
print "Elapsed Time: %ss" % (time.time() - start)
But you can make this even simpler. Really, all you need is:
但是你可以让这更简单。真的,你只需要:
def load_url(url):
conn = urllib2.urlopen(url, timeout)
data = conn.readall()
print '"%s" fetched in %ss' % (url,(time.time() - start))
return data
with futures.ThreadPoolExecutor(max_workers=5) as executor:
pages = executor.map(load_url, urls)
print "Elapsed Time: %ss" % (time.time() - start)
回答by Daniele B
I am now publishing a different solution, by having the worker threads not-deamon and joining them to the main thread(which means blocking the main thread until all worker threads have finished) instead of notifying the end of execution of each worker thread with a callback to a global function (as I did in the previous answer), as in some comments it was noted that such way is not thread-safe.
我现在发布一个不同的解决方案,通过让工作线程不被守护并将它们加入主线程(这意味着阻塞主线程直到所有工作线程完成),而不是用一个通知每个工作线程的执行结束回调到全局函数(就像我在上一个答案中所做的那样),正如在某些评论中指出的那样,这种方式不是线程安全的。
import threading
import urllib2
import time
start = time.time()
urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]
class FetchUrl(threading.Thread):
def __init__(self, url):
threading.Thread.__init__(self)
self.url = url
def run(self):
urlHandler = urllib2.urlopen(self.url)
html = urlHandler.read()
print "'%s\' fetched in %ss" % (self.url,(time.time() - start))
for url in urls:
FetchUrl(url).start()
#Join all existing threads to main thread.
for thread in threading.enumerate():
if thread is not threading.currentThread():
thread.join()
print "Elapsed Time: %s" % (time.time() - start)
回答by jfs
multiprocessinghas a thread pool that doesn't start other processes:
multiprocessing有一个不启动其他进程的线程池:
#!/usr/bin/env python
from multiprocessing.pool import ThreadPool
from time import time as timer
from urllib2 import urlopen
urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]
def fetch_url(url):
try:
response = urlopen(url)
return url, response.read(), None
except Exception as e:
return url, None, e
start = timer()
results = ThreadPool(20).imap_unordered(fetch_url, urls)
for url, html, error in results:
if error is None:
print("%r fetched in %ss" % (url, timer() - start))
else:
print("error fetching %r: %s" % (url, error))
print("Elapsed Time: %s" % (timer() - start,))
The advantages compared to Thread-based solution:
与Thread基于 -based 的解决方案相比的优势:
ThreadPoolallows to limit the maximum number of concurrent connections (20in the code example)- the output is not garbled because all output is in the main thread
- errors are logged
- the code works on both Python 2 and 3 without changes (assuming
from urllib.request import urlopenon Python 3).
ThreadPool允许限制最大并发连接数(20在代码示例中)- 输出没有乱码,因为所有的输出都在主线程中
- 错误被记录
- 代码无需更改即可在 Python 2 和 3 上运行(假设
from urllib.request import urlopen在 Python 3 上)。

