如何修改 Pandas 的 Read_html 用户代理?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18939133/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:10:40  来源:igfitidea点击:

How to modify Pandas's Read_html user-agent?

pythonpandasweb-scrapingurllib2

提问by kbgo

I'm trying to scrape English football stats from various html tables via the Transfetmarktwebsite using the pandas.read_html() function.

我正在尝试使用 pandas.read_html() 函数通过Transfetmarkt网站从各种 html 表格中抓取英国足球统计数据。

Example:

例子:

import pandas as pd
url = r'http://www.transfermarkt.co.uk/en/premier-league/gegentorminuten/wettbewerb_GB1.html'
df = pd.read_html(url)

However this code generates a "ValueError: Invalid URL"error.

但是,此代码会生成“ValueError: Invalid URL”错误。

I then attempted to parse the same website using the urllib2.urlopen() function. This time i got a "HTTPError: HTTP Error 404: Not Found". After the usual trial and error fault finding, it turns that the urllib2 header presents a python like agent to the webserver, which i presumed it doesn't recognize.

然后我尝试使用 urllib2.urlopen() 函数解析同一个网站。这次我收到了“HTTPError: HTTP Error 404: Not Found”。在通常的试错故障查找之后,urllib2 标头向网络服务器提供了一个类似 python 的代理,我认为它无法识别。

Now if I modify urllib2's agent and read its contents using beautifulsoup, i'm able to read the table without a problem.

现在,如果我修改 urllib2 的代理并使用 beautifulsoup 读取其内容,我就可以毫无问题地读取表格。

Example:

例子:

from BeautifulSoup import BeautifulSoup
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
url = r'http://www.transfermarkt.co.uk/en/premier-league/gegentorminuten/wettbewerb_GB1.html'
response = opener.open(url)
html = response.read()
soup = BeautifulSoup(html)
table = soup.find("table")

How do I modify pandas's urllib2 header to allow python to scrape this website?

如何修改 pandas 的 urllib2 标头以允许 python 抓取此网站?

Thanks

谢谢

回答by Viktor Kerkez

Currently you cannot. Relevant piece of code:

目前你不能。相关代码段:

if _is_url(io): # io is the url
    try:
        with urlopen(io) as url:
            raw_text = url.read()
    except urllib2.URLError:
        raise ValueError('Invalid URL: "{0}"'.format(io))

As you see, it just passes the urlto urlopenand reads the data. You can file an issue requesting this feature, but I assume you don't have time to wait for it to be solved so I would suggest using BeautifulSoup to parse the html data and then load it into a DataFrame.

如您所见,它只是传递urltourlopen并读取数据。您可以提交请求此功能的问题,但我假设您没有时间等待它解决,因此我建议使用 BeautifulSoup 解析 html 数据,然后将其加载到 DataFrame 中。

import urllib2

url = 'http://www.transfermarkt.co.uk/en/premier-league/gegentorminuten/wettbewerb_GB1.html'
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(url)
tables = pd.read_html(response.read(), attrs={"class":"tabelle_grafik"})[0]

Or if you can use requests:

或者,如果您可以使用requests

tables = pd.read_html(requests.get(url,
                                   headers={'User-agent': 'Mozilla/5.0'}).text,
                      attrs={"class":"tabelle_grafik"})[0]