python 为什么 Google 搜索会返回 HTTP 错误 403?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/600536/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why does Google Search return HTTP Error 403?
提问by AgentLiquid
Consider the following Python code:
考虑以下 Python 代码:
30 url = "http://www.google.com/search?hl=en&safe=off&q=Monkey" 31 url_object = urllib.request.urlopen(url); 32 print(url_object.read());
When this is run, an Exception is thrown:
运行时,抛出异常:
File "/usr/local/lib/python3.0/urllib/request.py", line 485, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
However, when this is put into a browser, the search returns as expected. What's going on here? How can I overcome this so I can search Google programmatically?
但是,当将其放入浏览器时,搜索将按预期返回。这里发生了什么?我该如何克服这个问题,以便以编程方式搜索 Google?
Any thoughts?
有什么想法吗?
回答by
this should do the trick
这应该可以解决问题
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "http://www.google.com/search?hl=en&safe=off&q=Monkey"
headers={'User-Agent':user_agent,}
request=urllib2.Request(url,None,headers) //The assembled request
response = urllib2.urlopen(request)
data = response.read() // The data u need
回答by Kevin Lacquement
If you want to do Google searches "properly" through a programming interface, take a look at Google APIs. Not only are these the official way of searching Google, they are also not likely to change if Google changes their result page layout.
如果您想通过编程接口“正确地”进行 Google 搜索,请查看Google APIs。这些不仅是搜索 Google 的官方方式,而且如果 Google 更改其结果页面布局,它们也不太可能改变。
回答by Don Kirkby
As lacqui suggested, the Google API'sare the way they want you to make requests from code. Unfortunately, I found their documentation was aimed at people writing AJAX web pages, not making raw HTTP requests. I used LiveHTTP Headersto trace the HTTP requests that the samples made, and I found ddipaolo's blog posthelpful.
正如lacqui 建议的那样,Google API是他们希望您从代码发出请求的方式。不幸的是,我发现他们的文档针对的是编写 AJAX 网页的人,而不是发出原始 HTTP 请求。我使用LiveHTTP Headers来跟踪样本发出的 HTTP 请求,我发现ddipaolo 的博客文章很有帮助。
One more thing that messed me up: they limit you to the first 64 resultsfrom a query. Usually not a problem if you are just providing web users with a search box, but not helpful if you're trying to use Google to go data mining. I guess they don't want you to go data mining using their API. That 64 number has changed over time and varies between search products.
还有一件事让我感到困惑:他们将您限制为查询的前 64 个结果。如果您只是为网络用户提供搜索框,通常不会有问题,但如果您尝试使用 Google 进行数据挖掘,则没有帮助。我猜他们不希望您使用他们的 API 进行数据挖掘。这个 64 数字随着时间的推移而变化,并且因搜索产品而异。
Update:It appears they definitely do not want you to go data mining. Eventually, you get a 403 error with a link to this API access notice.
更新:看来他们绝对不希望您进行数据挖掘。最终,您会收到 403 错误,其中包含指向此API 访问通知的链接。
Please review the Terms of Use for the API(s) you are using (linked in the right sidebar) and ensure compliance. It is likely that we blocked you for one of the following Terms of Use violations: We received automated requests, such as scraping and prefetching. Automated requests are prohibited; all requests must be made as a result of an end-user action.
请查看您正在使用的 API 的使用条款(链接在右侧边栏中)并确保合规。我们很可能因以下违反使用条款之一而阻止您:我们收到了自动请求,例如抓取和预取。禁止自动请求;所有请求都必须是最终用户操作的结果。
They also list other violations, but I think that's the one that triggered for me. I may have to investigate Yahoo's BOSS service. It doesn't seem to have as many restrictions.
他们还列出了其他违规行为,但我认为这是触发我的违规行为。我可能不得不调查雅虎的 BOSS 服务。它似乎没有那么多限制。
回答by Joel Coehoorn
You're doing it too often. Google has limits in place to prevent getting swamped by search bots. You can also try setting the user-agent to something that more closely resembles a normal browser.
你这样做太频繁了。谷歌有限制,以防止被搜索机器人淹没。您还可以尝试将用户代理设置为更类似于普通浏览器的内容。