Python urllib2.urlopen() 很慢,需要更好的方法来读取几个url

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3472515/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 11:15:02  来源:igfitidea点击:

Python urllib2.urlopen() is slow, need a better way to read several urls

pythonhttpconcurrencyurllib2

提问by Hyman z

As the title suggests, I'm working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.

正如标题所暗示的,我正在开发一个用 python 编写的网站,它多次调用 urllib2 模块来读取网站。然后我用 BeautifulSoup 解析它们。

As I have to read 5-10 sites, the page takes a while to load.

由于我必须阅读 5-10 个站点,因此页面需要一段时间才能加载。

I'm just wondering if there's a way to read the sites all at once? Or anytricks to make it faster, like should I close the urllib2.urlopen after each read, or keep it open?

我只是想知道是否有办法一次阅读所有网站?或者任何让它更快的技巧,比如我应该在每次阅读后关闭 urllib2.urlopen 还是保持打开状态?

Added: also, if I were to just switch over to php, would that be faster for fetching and Parsi g HTML and XML files from other sites? I just want it to load faster, as opposed to the ~20 seconds it currently takes

补充:另外,如果我只是切换到 php,那么从其他站点获取和解析 HTML 和 XML 文件会更快吗?我只是想让它加载得更快,而不是目前需要大约 20 秒

采纳答案by Wai Yip Tung

I'm rewriting Dumb Guy's code below using modern Python modules like threadingand Queue.

我正在使用现代 Python 模块(如threading和 )重写 Dumb Guy 的代码Queue

import threading, urllib2
import Queue

urls_to_load = [
'http://stackoverflow.com/',
'http://slashdot.org/',
'http://www.archive.org/',
'http://www.yahoo.co.jp/',
]

def read_url(url, queue):
    data = urllib2.urlopen(url).read()
    print('Fetched %s from %s' % (len(data), url))
    queue.put(data)

def fetch_parallel():
    result = Queue.Queue()
    threads = [threading.Thread(target=read_url, args = (url,result)) for url in urls_to_load]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return result

def fetch_sequencial():
    result = Queue.Queue()
    for url in urls_to_load:
        read_url(url,result)
    return result

Best time for find_sequencial()is 2s. Best time for fetch_parallel()is 0.9s.

最佳时间为find_sequencial()2 秒。最佳时间为fetch_parallel()0.9秒。

Also it is incorrect to say threadis useless in Python because of GIL. This is one of those case when thread is useful in Python because the the threads are blocked on I/O. As you can see in my result the parallel case is 2 times faster.

thread由于 GIL,在 Python 中说它没用也是不正确的。这是线程在 Python 中很有用的情况之一,因为线程在 I/O 上被阻塞。正如您在我的结果中看到的,并行情况快了 2 倍。

回答by bwawok

1) Are you opening the same site many times, or many different site? If many different sites, I think urllib2 is good. If doing the same site over and over again, I have had some personal luck with urllib3 http://code.google.com/p/urllib3/

1) 您是多次打开同一个站点,还是多次打开不同的站点?如果有很多不同的站点,我认为 urllib2 是好的。如果一遍又一遍地做同一个网站,我对 urllib3 http://code.google.com/p/urllib3/有一些个人运气

2) BeautifulSoup is easy to use, but is pretty slow. If you do have to use it, make sure to decompose your tags to get rid of memory leaks.. or it will likely lead to memory issues (did for me).

2) BeautifulSoup 很容易使用,但速度很慢。如果您确实必须使用它,请确保分解您的标签以消除内存泄漏......否则它可能会导致内存问题(对我来说)。

What do your memory and cpu look like? If you are maxing your CPU, make sure you are using real heavyweight threads, so you can run on more than 1 core.

你的内存和cpu是什么样的?如果您正在最大化 CPU,请确保您使用的是真正的重量级线程,以便您可以在 1 个以上的内核上运行。

回答by Dumb Guy

Edit:Please take a look at Wai's post for a better version of this code. Note that there is nothing wrong with this code and it will work properly, despite the comments below.

编辑:请查看 Wai 的帖子以获得此代码的更好版本。请注意,此代码没有任何问题,并且可以正常工作,尽管有以下注释。

The speed of reading web pages is probably bounded by your Internet connection, not Python.

阅读网页的速度可能受互联网连接的限制,而不是 Python。

You could use threads to load them all at once.

您可以使用线程一次加载它们。

import thread, time, urllib
websites = {}
def read_url(url):
  websites[url] = urllib.open(url).read()

for url in urls_to_load: thread.start_new_thread(read_url, (url,))
while websites.keys() != urls_to_load: time.sleep(0.1)

# Now websites will contain the contents of all the web pages in urls_to_load

回答by OTZ

How about using pycurl?

使用pycurl怎么样?

You can apt-get it by

你可以通过

$ sudo apt-get python-pycurl

回答by msw

As a general rule, a given construct in any language is not slow until it is measured.

作为一般规则,任何语言中的给定结构在被测量之前都不会慢。

In Python, not only do timings often run counter to intuition but the tools for measuring execution timeare exceptionally good.

在 Python 中,不仅计时经常与直觉背道而驰,而且用于测量执行时间工具也非常好。

回答by habnabit

Scrapymight be useful for you. If you don't need all of its functionality, you might just use twisted's twisted.web.client.getPageinstead. Asynchronous IO in one thread is going to be way more performant and easy to debug than anything that uses multiple threads and blocking IO.

Scrapy可能对你有用。如果你并不需要所有它的功能,你可能只用扭曲twisted.web.client.getPage来代替。一个线程中的异步 IO 将比使用多线程和阻塞 IO 的任何东西都具有更高的性能和更容易调试。

回答by Thomas15v

It is maby not perfect. But when I need the data from a site. I just do this:

这可能并不完美。但是当我需要来自站点的数据时。我只是这样做:

import socket
def geturldata(url):
    #NO HTTP URLS PLEASE!!!!! 
    server = url.split("/")[0]
    args = url.replace(server,"")
    returndata = str()
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.connect((server, 80)) #lets connect :p

    s.send("GET /%s HTTP/1.0\r\nHost: %s\r\n\r\n" % (args, server)) #simple http request
    while 1:
        data = s.recv(1024) #buffer
        if not data: break
        returndata = returndata + data
    s.close()
    return returndata.split("\n\r")[1]

回答by Thomas15v

Not sure why nobody mentions multiprocessing(if anyone knows why this might be a bad idea, let me know):

不知道为什么没有人提到multiprocessing(如果有人知道为什么这可能是个坏主意,请告诉我):

import multiprocessing
from urllib2 import urlopen

URLS = [....]

def get_content(url):
    return urlopen(url).read()


pool = multiprocessing.Pool(processes=8)  # play with ``processes`` for best results
results = pool.map(get_content, URLS) # This line blocks, look at map_async 
                                      # for non-blocking map() call
pool.close()  # the process pool no longer accepts new tasks
pool.join()   # join the processes: this blocks until all URLs are processed
for result in results:
   # do something

There are a few caveats with multiprocessingpools. First, unlike threads, these are completely new Python processes (interpreter). While it's not subject to global interpreter lock, it means you are limited in what you can pass across to the new process.

有一些关于multiprocessing游泳池的警告。首先,与线程不同,这些是全新的 Python 进程(解释器)。虽然它不受全局解释器锁定的影响,但这意味着您可以传递给新进程的内容是有限的。

You cannot pass lambdas and functions that are defined dynamically. The function that is used in the map()call must be defined in your module in a way that allows the other process to import it.

您不能传递动态定义的 lambda 和函数。调用中使用的函数map()必须以允许其他进程导入它的方式在您的模块中定义。

The Pool.map(), which is the most straightforward way to process multiple tasks concurrently, doesn't provide a way to pass multiple arguments, so you may need to write wrapper functions or change function signatures, and/or pass multiple arguments as part of the iterable that is being mapped.

Pool.map(),这是同时处理多个任务的最直接的方法,并没有提供一种方式来传递多个参数,所以你可能需要编写的包装功能或更改函数签名,和/或多个参数传递的迭代那一部分正在映射。

You cannot have child processes spawn new ones. Only the parent can spawn child processes. This means you have to carefully plan and benchmark (and sometimes write multiple versions of your code) in order to determine what the most effective use of processes would be.

您不能让子进程产生新的进程。只有父进程可以产生子进程。这意味着您必须仔细计划和基准测试(有时还要编写多个版本的代码),以确定最有效地使用流程。

Drawbacks notwithstanding, I find multiprocessing to be one of the most straightforward ways to do concurrent blocking calls. You can also combine multiprocessing and threads (afaik, but please correct me if I'm wrong), or combine multiprocessing with green threads.

尽管有缺点,但我发现多处理是进行并发阻塞调用的最直接的方法之一。您还可以结合多处理和线程(afaik,但如果我错了,请纠正我),或者将多处理与绿色线程结合使用。

回答by fzn0728

First, you should try multithreading/multiprocessing packages. Currently, the three popular ones are multiprocessing;concurrent.futuresand [threading][3]. Those packages could help you to open multi url at the same time, which could increase the speed.

首先,您应该尝试多线程/多处理包。目前,流行的三个是multiprocessingconcurrent.futures和 [线程] [3]。这些包可以帮助您同时打开多个 url,这可以提高速度。

More importantly, after using multithread processing, and if you try to open hundreds urls at the same time, you will find urllib.request.urlopen is very slow, and opening and read the context become the most time-consuming part. So if you want to make it even faster, you should try requests packages, requests.get(url).content() is faster than urllib.request.urlopen(url).read().

更重要的是,使用多线程处理后,如果你尝试同时打开数百个url,你会发现urllib.request.urlopen非常慢,打开和读取上下文成为最耗时的部分。所以如果你想让它更快,你应该尝试请求包,requests.get(url).content() 比 urllib.request.urlopen(url).read() 更快。

So, here I list two example to do fast multi url parsing, and the speed is faster than the other answers. The first example use classical threading package and generate hundreds thread at the same time. (One trivial shortcoming is it cannot keep the original order of the ticker.)

所以,这里我列出了两个例子来做快速多url解析,速度比其他答案都快。第一个例子使用经典线程包,同时生成数百个线程。(一个微不足道的缺点是它不能保持股票代码的原始顺序。)

import time
import threading
import pandas as pd
import requests
from bs4 import BeautifulSoup


ticker = pd.ExcelFile('short_tickerlist.xlsx')
ticker_df = ticker.parse(str(ticker.sheet_names[0]))
ticker_list = list(ticker_df['Ticker'])

start = time.time()

result = []
def fetch(ticker):
    url = ('http://finance.yahoo.com/quote/' + ticker)
    print('Visit ' + url)
    text = requests.get(url).content
    soup = BeautifulSoup(text,'lxml')
    result.append([ticker,soup])
    print(url +' fetching...... ' + str(time.time()-start))



if __name__ == '__main__':
    process = [None] * len(ticker_list)
    for i in range(len(ticker_list)):
        process[i] = threading.Thread(target=fetch, args=[ticker_list[i]])

    for i in range(len(ticker_list)):    
        print('Start_' + str(i))
        process[i].start()



    # for i in range(len(ticker_list)):
    #     print('Join_' + str(i))    
    #     process[i].join()

    print("Elapsed Time: %ss" % (time.time() - start))

The second example uses multiprocessing package, and it is little more straightforward. Since you just need to state the number of pool and map the function. The order will not change after fetching the context and the speed is similar to the first example but much faster than other method.

第二个示例使用多处理包,它更简单一些。因为您只需要说明池的数量并映射函数。获取上下文后顺序不会改变,速度与第一个示例相似,但比其他方法快得多。

from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
import time

os.chdir('file_path')

start = time.time()

def fetch_url(x):
    print('Getting Data')
    myurl = ("http://finance.yahoo.com/q/cp?s=%s" % x)
    html = requests.get(myurl).content
    soup = BeautifulSoup(html,'lxml')
    out = str(soup)
    listOut = [x, out]
    return listOut

tickDF = pd.read_excel('short_tickerlist.xlsx')
li = tickDF['Ticker'].tolist()    

if __name__ == '__main__':
    p = Pool(5)
    output = p.map(fetch_url, ji, chunksize=30)
    print("Time is %ss" %(time.time()-start))