javascript 使用 PHP 抓取 Google 搜索

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14552043/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-26 21:57:20  来源:igfitidea点击:

Crawling Google Search with PHP

phpjavascriptgoogle-apiweb-crawler

提问by jamietelin

I trying to get my head around how to fetch Google search results with PHP or JavaScript. I know it has been possible before but now I can't find a way.

我试图弄清楚如何使用 PHP 或 JavaScript 获取 Google 搜索结果。我知道以前有可能,但现在我找不到方法。

I am trying to duplicate (somewhat) the functionality of
http://www.getupdated.se/sokmotoroptimering/seo-verktyg/kolla-ranking/

我试图复制(有点)http://www.getupdated.se/sokmotoroptimering/seo-verktyg/kolla-ranking/的功能

But really the core issue I want to solve is just to get the search result via PHP or JavaScript,the rest i can figure out.

但是我真正想要解决的核心问题只是通过PHP或JavaScript获取搜索结果,其余的我可以弄清楚。

Fetching the results using file_get_contents() or cURL doesn't seem to work.

使用 file_get_contents() 或 cURL 获取结果似乎不起作用。

Example:

例子:

$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, 'http://www.google.se/#hl=sv&q=dogs');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$result = curl_exec($ch);
curl_close($ch);
echo '<pre>';
var_dump($result);
echo '</pre>';

Results:

结果:

string(219) "302 Moved The document has moved here."

string(219) "302 Moved 文档已移至此处。"

So, with some Googling i found http://code.google.com/apis/customsearch/v1/overview.htmlbut that seems to only work for generating a custom search for one or more websites. It seem to require a "Custom Search Engine" cx-parameter passed.

因此,通过一些谷歌搜索,我找到了http://code.google.com/apis/customsearch/v1/overview.html但这似乎只适用于为一个或多个网站生成自定义搜索。似乎需要传递“自定义搜索引擎”cx 参数。

So anyway, any idea?

所以无论如何,有什么想法吗?

回答by Hardik Thaker

I did it earlier. Generate the html contents by making https://www.google.co.in/search?hl=en&output=search&q=indiahttp request, now parse specific tags using the htmldom php library. You can parse the content of result page using PHP SIMPLE HTML DOM

我之前做过。通过发出https://www.google.co.in/search?hl=en&output=search&q=indiahttp 请求生成 html 内容,现在使用 htmldom php 库解析特定标签。您可以使用PHP SIMPLE HTML DOM解析结果页面的内容

DEMO : Below code will give you title of all the result :

演示:下面的代码将为您提供所有结果的标题:

<?php

include("simple_html_dom.php");

$html = file_get_html('http://www.google.co.in/search?hl=en&output=search&q=india');

$i = 0;
foreach($html->find('li[class=g]') as $element) {
    foreach($element->find('h3[class=r]') as $h3) 
    {
        $title[$i] = '<h1>'.$h3->plaintext.'</h1>' ;
    }
       $i++;
}
print_r($title);

?>

回答by Soufiane Ghzal

There is php a github package named google-urlthat does the job.

有 php一个名为 google-url 的 github 包可以完成这项工作。

The api is very comfortable to use. See the example :

api使用起来很舒服。请参阅示例:

// this line creates a new crawler
$googleUrl=new \GoogleURL\GoogleUrl();
$googleUrl->setLang('en'); // say for which lang you want to search (it could have been "fr" instead)
$googleUrl->setNumberResults(10); // how many results you want to check
// launch the search for a specific keyword
$results = $googleUrl->search("google crawler");
// finaly you can loop on the results (an example is also available on the github page)

However you will have to think to use a delay between each query, or else google will consider you as a bot and ask you for a captcha that will lock the script.

但是,您必须考虑在每个查询之间使用延迟,否则 google 会将您视为机器人并要求您提供可锁定脚本的验证码。

回答by JakeGould

Odd. Because if I do a curlfrom the command like I get a 200 OK:

奇怪的。因为如果我curl从命令中执行 a就像我得到一个200 OK

curl -I 'http://www.google.se/#hl=sv&q=dogs'
HTTP/1.1 200 OK
Date: Sun, 27 Jan 2013 20:45:02 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=b82cb66e9d996c48:FF=0:TM=1359319502:LM=1359319502:S=D-LW-_w8GlMfw-lX; expires=Tue, 27-Jan-2015 20:45:02 GMT; path=/; domain=.google.se
Set-Cookie: NID=67=XtW2l43TDBuOaOnhWkQ-AeRbpZOiA-UYEcs7BIgfGs41FkHlEegssgllBRmfhgQDwubG3JB0s5691OLHpNmLSNmJrKHKGZuwxCJYv1qnaBPtzitRECdLAIL0oQ0DSkrx; expires=Mon, 29-Jul-2013 20:45:02 GMT; path=/; domain=.google.se; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Transfer-Encoding: chunked

Also, maybe consider setting a urlencodefor the passed URL so this line:

另外,也许可以考虑urlencode为传递的 URL设置一个,所以这一行:

curl_setopt($ch, CURLOPT_URL, 'http://www.google.se/#hl=sv&q=dogs');

Changes to this:

对此的更改:

curl_setopt($ch, CURLOPT_URL, 'http://www.google.se/' . urlencode('#hl=sv&q=dogs'));