Python 的 `urllib2`:当我 `urlopen` 维基百科页面时,为什么会出现错误 403?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3336549/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python's `urllib2`: Why do I get error 403 when I `urlopen` a Wikipedia page?
提问by Ram Rachum
I have a strange bug when trying to urlopena certain page from Wikipedia. This is the page:
尝试urlopen从维基百科访问某个页面时,我遇到了一个奇怪的错误。这是页面:
http://en.wikipedia.org/wiki/OpenCola_(drink)
http://en.wikipedia.org/wiki/OpenCola_(饮料)
This is the shell session:
这是外壳会话:
>>> f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)')
Traceback (most recent call last):
File "C:\Program Files\Wing IDE 4.0\src\debug\tserver\_sandbox.py", line 1, in <module>
# Used internally for debug sandbox under external interpreter
File "c:\Python26\Lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "c:\Python26\Lib\urllib2.py", line 397, in open
response = meth(req, response)
File "c:\Python26\Lib\urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "c:\Python26\Lib\urllib2.py", line 435, in error
return self._call_chain(*args)
File "c:\Python26\Lib\urllib2.py", line 369, in _call_chain
result = func(*args)
File "c:\Python26\Lib\urllib2.py", line 518, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
This happened to me on two different systems in different continents. Does anyone have an idea why this happens?
这发生在我在不同大陆的两个不同系统上。有谁知道为什么会发生这种情况?
采纳答案by Jochen Ritzel
Data retrieval: Bots may not be used to retrieve bulk content for any use not directly related to an approved bot task. This includes dynamically loading pages from another website, which may result in the website being blacklisted and permanently denied access. If you would like to download bulk content or mirror a project, please do so by downloading or hosting your own copy of our database.
数据检索:机器人不得用于检索与批准的机器人任务没有直接关系的任何用途的批量内容。这包括从另一个网站动态加载页面,这可能会导致该网站被列入黑名单并永久拒绝访问。如果您想下载大量内容或镜像项目,请下载或托管您自己的数据库副本。
That is why Python is blocked. You're supposed to download data dumps.
这就是 Python 被阻塞的原因。你应该下载数据转储。
Anyways, you can read pages like this in Python 2:
不管怎样,你可以在 Python 2 中阅读这样的页面:
req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
con = urllib2.urlopen( req )
print con.read()
Or in Python 3:
或者在 Python 3 中:
import urllib
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
con = urllib.request.urlopen( req )
print(con.read())
回答by Chris Foster
Some websites will block access from scripts to avoid 'unnecessary' usage of their servers by reading the headers urllib sends. I don't know and can't imagine why wikipedia does/would do this, but have you tried spoofing your headers?
一些网站会通过读取 urllib 发送的标头来阻止来自脚本的访问,以避免对其服务器的“不必要”使用。我不知道也无法想象维基百科为什么/会这样做,但你有没有试过欺骗你的标题?
回答by Eli
Often times websites will filter access by checking if they are being accessed by a recognised user agent. Wikipedia is just treating your script as a bot and rejecting it. Try spoofing as a browser. The following link takes to you an article to show you how.
通常,网站会通过检查是否由公认的用户代理访问来过滤访问。维基百科只是将您的脚本视为机器人并拒绝它。尝试作为浏览器进行欺骗。以下链接为您提供了一篇文章,向您展示如何操作。
回答by S.Lott
To debug this, you'll need to trap that exception.
要对此进行调试,您需要捕获该异常。
try:
f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)')
except urllib2.HTTPError, e:
print e.fp.read()
When I print the resulting message, it includes the following
当我打印结果消息时,它包括以下内容
"English
Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again in a few minutes. "
“英语
我们的服务器目前遇到技术问题。这可能是暂时的,应该尽快修复。请在几分钟后再试一次。”
回答by Hello World
As Jochen Ritzel mentioned, Wikipedia blocks bots.
正如 Jochen Ritzel 所提到的,维基百科会阻止机器人。
However, bots will not get blocked if they use the PHP api. To get the Wikipedia page titled "love":
但是,如果机器人使用 PHP api,它们将不会被阻止。要获得标题为“love”的维基百科页面:
http://en.wikipedia.org/w/api.php?format=json&action=query&titles=love&prop=revisions&rvprop=content
http://en.wikipedia.org/w/api.php?format=json&action=query&titles=love&prop=revisions&rvprop=content
回答by Phil
I made a workaround for this using php which is not blocked by the site I needed.
我使用 php 解决了这个问题,我需要的网站没有阻止它。
it can be accessed like this:
它可以像这样访问:
path='http://phillippowers.com/redirects/get.php?
file=http://website_you_need_to_load.com'
req = urllib2.Request(path)
response = urllib2.urlopen(req)
vdata = response.read()
This will return the html code to you
这会将 html 代码返回给您

