Python 的 `urllib2`：当我 `urlopen` 维基百科页面时，为什么会出现错误 403？

Question

提问by Ram Rachum

I have a strange bug when trying to urlopena certain page from Wikipedia. This is the page:

尝试urlopen从维基百科访问某个页面时，我遇到了一个奇怪的错误。这是页面：

http://en.wikipedia.org/wiki/OpenCola_(drink)

http://en.wikipedia.org/wiki/OpenCola_(饮料)

This is the shell session:

这是外壳会话：

>>> f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)')
Traceback (most recent call last):
  File "C:\Program Files\Wing IDE 4.0\src\debug\tserver\_sandbox.py", line 1, in <module>
    # Used internally for debug sandbox under external interpreter
  File "c:\Python26\Lib\urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "c:\Python26\Lib\urllib2.py", line 397, in open
    response = meth(req, response)
  File "c:\Python26\Lib\urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "c:\Python26\Lib\urllib2.py", line 435, in error
    return self._call_chain(*args)
  File "c:\Python26\Lib\urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "c:\Python26\Lib\urllib2.py", line 518, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

This happened to me on two different systems in different continents. Does anyone have an idea why this happens?

这发生在我在不同大陆的两个不同系统上。有谁知道为什么会发生这种情况？

Answer 1

采纳答案by Jochen Ritzel

Wikipedias stance is:

维基百科的立场是：

Data retrieval: Bots may not be used to retrieve bulk content for any use not directly related to an approved bot task. This includes dynamically loading pages from another website, which may result in the website being blacklisted and permanently denied access. If you would like to download bulk content or mirror a project, please do so by downloading or hosting your own copy of our database.

数据检索：机器人不得用于检索与批准的机器人任务没有直接关系的任何用途的批量内容。这包括从另一个网站动态加载页面，这可能会导致该网站被列入黑名单并永久拒绝访问。如果您想下载大量内容或镜像项目，请下载或托管您自己的数据库副本。

That is why Python is blocked. You're supposed to download data dumps.

这就是 Python 被阻塞的原因。你应该下载数据转储。

Anyways, you can read pages like this in Python 2:

不管怎样，你可以在 Python 2 中阅读这样的页面：

req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"}) 
con = urllib2.urlopen( req )
print con.read()

Or in Python 3:

或者在 Python 3 中：

import urllib
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"}) 
con = urllib.request.urlopen( req )
print(con.read())

Answer 2

回答by Chris Foster

Some websites will block access from scripts to avoid 'unnecessary' usage of their servers by reading the headers urllib sends. I don't know and can't imagine why wikipedia does/would do this, but have you tried spoofing your headers?

一些网站会通过读取 urllib 发送的标头来阻止来自脚本的访问，以避免对其服务器的“不必要”使用。我不知道也无法想象维基百科为什么/会这样做，但你有没有试过欺骗你的标题？

Answer 3

回答by Eli

Often times websites will filter access by checking if they are being accessed by a recognised user agent. Wikipedia is just treating your script as a bot and rejecting it. Try spoofing as a browser. The following link takes to you an article to show you how.

通常，网站会通过检查是否由公认的用户代理访问来过滤访问。维基百科只是将您的脚本视为机器人并拒绝它。尝试作为浏览器进行欺骗。以下链接为您提供了一篇文章，向您展示如何操作。

http://wolfprojects.altervista.org/changeua.php

Answer 4

回答by S.Lott

To debug this, you'll need to trap that exception.

要对此进行调试，您需要捕获该异常。

try:
    f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)')
except urllib2.HTTPError, e:
    print e.fp.read()

When I print the resulting message, it includes the following

当我打印结果消息时，它包括以下内容

"English
Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again in a few minutes. "

“英语
我们的服务器目前遇到技术问题。这可能是暂时的，应该尽快修复。请在几分钟后再试一次。”

Answer 5

回答by Hello World

As Jochen Ritzel mentioned, Wikipedia blocks bots.

正如 Jochen Ritzel 所提到的，维基百科会阻止机器人。

However, bots will not get blocked if they use the PHP api. To get the Wikipedia page titled "love":

但是，如果机器人使用 PHP api，它们将不会被阻止。要获得标题为“love”的维基百科页面：

http://en.wikipedia.org/w/api.php?format=json&action=query&titles=love&prop=revisions&rvprop=content

Answer 6

回答by Phil

I made a workaround for this using php which is not blocked by the site I needed.

我使用 php 解决了这个问题，我需要的网站没有阻止它。

it can be accessed like this:

它可以像这样访问：

path='http://phillippowers.com/redirects/get.php? 
file=http://website_you_need_to_load.com'
req = urllib2.Request(path)
response = urllib2.urlopen(req)
vdata = response.read()

This will return the html code to you

这会将 html 代码返回给您

Python 的 `urllib2`：当我 `urlopen` 维基百科页面时，为什么会出现错误 403？

提问by Ram Rachum

采纳答案by Jochen Ritzel

回答by Chris Foster

回答by Eli

回答by S.Lott

回答by Hello World

回答by Phil

相关推荐

最近更新

标签

Python 的 `urllib2`：当我 `urlopen` 维基百科页面时，为什么会出现错误 403？

提问by Ram Rachum

采纳答案by Jochen Ritzel

回答by Chris Foster

回答by Eli

回答by S.Lott

回答by Hello World

回答by Phil

相关推荐

使用通用编码检测器 (chardet) 在 Python 中检测文本文件中的字符

Python 连接两个范围函数结果

Python 熊猫列加法/减法

python的Maven等价物

相关推荐

最近更新

标签