bash Wget 不获取谷歌搜索结果
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29204103/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Wget does not fetch google search results
提问by anubhava
I noticed when running wget https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=foo
and similar queries, I don't get the search results, but the google homepage.
我注意到在运行wget https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=foo
和类似查询时,我没有得到搜索结果,而是 google 主页。
There seems to be some redirect within the google page. Does anyone know a fix to wget
so it would work?
谷歌页面内似乎有一些重定向。有谁知道修复方法,wget
所以它会起作用吗?
回答by anubhava
You can use this curl commands to pull Google query results:
您可以使用此 curl 命令来拉取 Google 查询结果:
curl -sA "Chrome" -L 'http://www.google.com/search?hl=en&q=time' -o search.html
For using https
URL:
使用https
网址:
curl -k -sA "Chrome" -L 'https://www.google.com/search?hl=en&q=time' -o ssearch.html
-A
option sets a custom user-agent Chrome
in request to Google.
-A
选项设置自定义用户代理Chrome
请求谷歌。
回答by Dolda2000
#q=foo
is your hint, as that's a fragment ID, which never gets sent to the server. I'm guessing you just took this URL from your browser URL-bar when using the live-search function. Since it is implemented with a lot of client-side magic, you cannot rely on it to work; try using Google with live search disabled instead. A URL pattern that seems to work looks like this: http://www.google.com/search?hl=en&q=foo
.
#q=foo
是你的提示,因为这是一个片段 ID,它永远不会被发送到服务器。我猜您在使用实时搜索功能时只是从浏览器的 URL 栏中获取了这个 URL。因为它是用很多客户端魔法实现的,所以你不能依赖它来工作;尝试在禁用实时搜索的情况下使用 Google。这似乎是工作类似如下的URL模式:http://www.google.com/search?hl=en&q=foo
。
However, I do notice that Google returns 403 Forbidden
when called na?vely with wget
, indicating that they don't want that. You can easily get past it by setting some other user-agent string, but do consider all the implications before doing so on a regular basis.
但是,我确实注意到 Google403 Forbidden
在用 调用时会返回wget
,表明他们不想要那样。您可以通过设置其他一些用户代理字符串轻松解决它,但在定期执行此操作之前请务必考虑所有影响。