在 Python 3 中为 urrlib.request.urlopen 更改用户代理

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24226781/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:16:30  来源:igfitidea点击:

Changing User Agent in Python 3 for urrlib.request.urlopen

pythonpython-3.xurllibuser-agent

提问by user3662991

I want to open a url using urllib.request.urlopen('someurl'):

我想使用urllib.request.urlopen('someurl')以下方法打开一个网址:

with urllib.request.urlopen('someurl') as url:
b = url.read()

I keep getting the following error:

我不断收到以下错误:

urllib.error.HTTPError: HTTP Error 403: Forbidden

I understand the error to be due to the site not letting python access it, to stop bots wasting their network resources- which is understandable. I went searching and found that you need to change the user agent for urllib. However all the guides and solutions I have found for this issue as to how to change the user agent have been with urllib2, and I am using python 3 so all the solutions don't work.

我理解错误是由于站点不让 python 访问它,以阻止机器人浪费他们的网络资源 - 这是可以理解的。我去搜索发现你需要更改urllib的用户代理。然而,我为这个问题找到的关于如何更改用户代理的所有指南和解决方案都与 urllib2 一起使用,而我使用的是 python 3,因此所有解决方案都不起作用。

How can I fix this problem with python 3?

如何使用 python 3 解决此问题?

采纳答案by Martin Konecny

From the Python docs:

来自Python 文档

import urllib.request
req = urllib.request.Request(
    url, 
    data=None, 
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
)

f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))

回答by Collin Anderson

from urllib.request import urlopen, Request

urlopen(Request(url, headers={'User-Agent': 'Mozilla'}))

回答by John Nagle

The host site rejection is coming from the OWASP ModSecurity Core Rules for Apache mod-security. Rule 900002 has a list of "bad" user agents, and one of them is "python-urllib2". That's why requests with the default user agent fail.

主机站点拒绝来自 OWASP ModSecurity Core Rules for Apache mod-security。规则 900002 有一个“坏”用户代理列表,其中之一是“python-urllib2”。这就是使用默认用户代理的请求失败的原因。

Unfortunately, if you use Python's "robotparser" function,

不幸的是,如果你使用 Python 的“robotparser”函​​数,

https://docs.python.org/3.5/library/urllib.robotparser.html?highlight=robotparser#module-urllib.robotparser

https://docs.python.org/3.5/library/urllib.robotparser.html?highlight=robotparser#module-urllib.robotparser

it uses the default Python user agent, and there's no parameter to change that. If "robotparser"'s attempt to read "robots.txt" is refused (not just URL not found), it then treats all URLs from that site as disallowed.

它使用默认的 Python 用户代理,并且没有参数可以改变它。如果“robotparser”尝试读取“robots.txt”被拒绝(不仅仅是未找到 URL),那么它会将来自该站点的所有 URL 视为不允许的。

回答by Tonny Xu

I just answered a similar question here: https://stackoverflow.com/a/43501438/206820

我刚刚在这里回答了一个类似的问题:https: //stackoverflow.com/a/43501438/206820

In case you just not only want to open the URL, but also want to download the resource(say, a PDF file), you can use the code as below:

如果您不仅想打开 URL,还想下载资源(例如 PDF 文件),您可以使用以下代码:

    # proxy = ProxyHandler({'http': 'http://192.168.1.31:8888'})
    proxy = ProxyHandler({})
    opener = build_opener(proxy)
    opener.addheaders = [('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Safari/603.1.30')]
    install_opener(opener)

    result = urlretrieve(url=file_url, filename=file_name)

The reason I added proxy is to monitor the traffic in Charles, and here is the traffic I got:

我加proxy的原因是为了监控Charles里面的流量,这里是我得到的流量:

See the User-Agent

查看用户代理