在 Python 3 中为 urrlib.request.urlopen 更改用户代理

Question

提问by user3662991

I want to open a url using urllib.request.urlopen('someurl'):

我想使用urllib.request.urlopen('someurl')以下方法打开一个网址：

with urllib.request.urlopen('someurl') as url:
b = url.read()

I keep getting the following error:

我不断收到以下错误：

urllib.error.HTTPError: HTTP Error 403: Forbidden

I understand the error to be due to the site not letting python access it, to stop bots wasting their network resources- which is understandable. I went searching and found that you need to change the user agent for urllib. However all the guides and solutions I have found for this issue as to how to change the user agent have been with urllib2, and I am using python 3 so all the solutions don't work.

我理解错误是由于站点不让 python 访问它，以阻止机器人浪费他们的网络资源 - 这是可以理解的。我去搜索发现你需要更改urllib的用户代理。然而，我为这个问题找到的关于如何更改用户代理的所有指南和解决方案都与 urllib2 一起使用，而我使用的是 python 3，因此所有解决方案都不起作用。

How can I fix this problem with python 3?

如何使用 python 3 解决此问题？

Answer 1

采纳答案by Martin Konecny

From the Python docs:

来自Python 文档：

import urllib.request
req = urllib.request.Request(
    url, 
    data=None, 
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
)

f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))

Answer 2

回答by Collin Anderson

from urllib.request import urlopen, Request

urlopen(Request(url, headers={'User-Agent': 'Mozilla'}))

Answer 3

回答by John Nagle

The host site rejection is coming from the OWASP ModSecurity Core Rules for Apache mod-security. Rule 900002 has a list of "bad" user agents, and one of them is "python-urllib2". That's why requests with the default user agent fail.

主机站点拒绝来自 OWASP ModSecurity Core Rules for Apache mod-security。规则 900002 有一个“坏”用户代理列表，其中之一是“python-urllib2”。这就是使用默认用户代理的请求失败的原因。

Unfortunately, if you use Python's "robotparser" function,

不幸的是，如果你使用 Python 的“robotparser”函数，

https://docs.python.org/3.5/library/urllib.robotparser.html?highlight=robotparser#module-urllib.robotparser

it uses the default Python user agent, and there's no parameter to change that. If "robotparser"'s attempt to read "robots.txt" is refused (not just URL not found), it then treats all URLs from that site as disallowed.

它使用默认的 Python 用户代理，并且没有参数可以改变它。如果“robotparser”尝试读取“robots.txt”被拒绝（不仅仅是未找到 URL），那么它会将来自该站点的所有 URL 视为不允许的。

Answer 4

回答by Tonny Xu

I just answered a similar question here: https://stackoverflow.com/a/43501438/206820

我刚刚在这里回答了一个类似的问题：https: //stackoverflow.com/a/43501438/206820

In case you just not only want to open the URL, but also want to download the resource(say, a PDF file), you can use the code as below:

如果您不仅想打开 URL，还想下载资源（例如 PDF 文件），您可以使用以下代码：

    # proxy = ProxyHandler({'http': 'http://192.168.1.31:8888'})
    proxy = ProxyHandler({})
    opener = build_opener(proxy)
    opener.addheaders = [('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Safari/603.1.30')]
    install_opener(opener)

    result = urlretrieve(url=file_url, filename=file_name)

The reason I added proxy is to monitor the traffic in Charles, and here is the traffic I got:

我加proxy的原因是为了监控Charles里面的流量，这里是我得到的流量：

在 Python 3 中为 urrlib.request.urlopen 更改用户代理

提问by user3662991

采纳答案by Martin Konecny

回答by Collin Anderson

回答by John Nagle

回答by Tonny Xu

相关推荐

最近更新

标签

在 Python 3 中为 urrlib.request.urlopen 更改用户代理

提问by user3662991

采纳答案by Martin Konecny

回答by Collin Anderson

回答by John Nagle

回答by Tonny Xu

相关推荐

Python Django 查询集过滤器 GT、LT、GTE、LTE 返回完整的对象列表

Python 套接字错误：地址已被使用

ValueError：解包的值太多（Python 2.7）

Python 为什么这可以解决 matplotlib 的“无 $DISPLAY 环境”问题？

相关推荐

最近更新

标签