php 如何使用 cURL 获取页面内容?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14953867/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 08:16:12  来源:igfitidea点击:

How to get page content using cURL?

phpcurl

提问by 7usam

I would like to scrape the content of this Google search result pageusing curl. I've been trying setting different user agents, and setting other options but I just can't seem to get the content of that page, as I often get redirected or I get a "page moved" error.

我想使用 curl抓取此Google 搜索结果页面的内容。我一直在尝试设置不同的用户代理,并设置其他选项,但我似乎无法获取该页面的内容,因为我经常被重定向或出现“页面移动”错误。

I believe it has something to do with the fact that the query string gets encoded somewhere but I'm really not sure how to get around that.

我相信这与查询字符串在某处编码的事实有关,但我真的不知道如何解决这个问题。

    //$url is the same as the link above
    $ch = curl_init();
    $user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0'
    curl_setopt ($ch, CURLOPT_URL, $url);
    curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
    curl_setopt ($ch, CURLOPT_HEADER, 0);
    curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($ch,CURLOPT_CONNECTTIMEOUT,120);
    curl_setopt ($ch,CURLOPT_TIMEOUT,120);
    curl_setopt ($ch,CURLOPT_MAXREDIRS,10);
    curl_setopt ($ch,CURLOPT_COOKIEFILE,"cookie.txt");
    curl_setopt ($ch,CURLOPT_COOKIEJAR,"cookie.txt");
    echo curl_exec ($ch);

What do I need to do to get my php code to show the exact content of the page as I would see it on my browser? What am I missing? Can anyone point me to the right direction?

我需要做什么才能让我的 php 代码显示页面的确切内容,就像我在浏览器上看到的那样?我错过了什么?任何人都可以指出我正确的方向吗?

I've seen similar questions on SO, but none with an answer that could help me.

我在 SO 上看到过类似的问题,但没有一个答案可以帮助我。

EDIT:

编辑:

I tried to just open the link using the Selenium WebDriver, that gives the same results as cURL. I am still thinking that this has to do with the fact that there are special characters in the query string which are getting messed up somewhere in the process.

我尝试使用 Selenium WebDriver 打开链接,结果与 cURL 相同。我仍然认为这与查询字符串中有特殊字符在过程中的某个地方弄乱了这一事实有关。

回答by

this is how:

这是如何:

   /**
     * Get a web file (HTML, XHTML, XML, image, etc.) from a URL.  Return an
     * array containing the HTTP server response header fields and content.
     */
    function get_web_page( $url )
    {
        $user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0';

        $options = array(

            CURLOPT_CUSTOMREQUEST  =>"GET",        //set request type post or get
            CURLOPT_POST           =>false,        //set to GET
            CURLOPT_USERAGENT      => $user_agent, //set user agent
            CURLOPT_COOKIEFILE     =>"cookie.txt", //set cookie file
            CURLOPT_COOKIEJAR      =>"cookie.txt", //set cookie jar
            CURLOPT_RETURNTRANSFER => true,     // return web page
            CURLOPT_HEADER         => false,    // don't return headers
            CURLOPT_FOLLOWLOCATION => true,     // follow redirects
            CURLOPT_ENCODING       => "",       // handle all encodings
            CURLOPT_AUTOREFERER    => true,     // set referer on redirect
            CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect
            CURLOPT_TIMEOUT        => 120,      // timeout on response
            CURLOPT_MAXREDIRS      => 10,       // stop after 10 redirects
        );

        $ch      = curl_init( $url );
        curl_setopt_array( $ch, $options );
        $content = curl_exec( $ch );
        $err     = curl_errno( $ch );
        $errmsg  = curl_error( $ch );
        $header  = curl_getinfo( $ch );
        curl_close( $ch );

        $header['errno']   = $err;
        $header['errmsg']  = $errmsg;
        $header['content'] = $content;
        return $header;
    }

Example

例子

//Read a web page and check for errors:

$result = get_web_page( $url );

if ( $result['errno'] != 0 )
    ... error: bad url, timeout, redirect loop ...

if ( $result['http_code'] != 200 )
    ... error: no page, no permissions, no service ...

$page = $result['content'];

回答by 712011m4n

For a realistic approach that emulates the most human behavior, you may want to add a referer in your curl options. You may also want to add a follow_location to your curl options. Trust me, whoever said that cURLING Google results is impossible, is a complete dolt and should throw his/her computer against the wall in hopes of never returning to the internetz again. Everything that you can do "IRL" with your own browser can all be emulated using PHP cURL or libCURL in Python. You just need to do more cURLS to get buff. Then you will see what I mean. :)

对于模拟大多数人类行为的现实方法,您可能希望在 curl 选项中添加引用。您可能还想在 curl 选项中添加一个 follow_location。相信我,谁说 cURLING Google 结果是不可能的,是一个彻头彻尾的傻瓜,应该把他/她的电脑扔到墙上,希望永远不会再回到 internetz。您可以使用自己的浏览器执行“IRL”的所有操作都可以在 Python 中使用 PHP cURL 或 libCURL 进行模拟。你只需要做更多的卷曲来获得增益。然后你就会明白我的意思。:)

  $url = "http://www.google.com/search?q=".$strSearch."&hl=en&start=0&sa=N";
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_REFERER, 'http://www.example.com/1');
  curl_setopt($ch, CURLOPT_HEADER, 0);
  curl_setopt($ch, CURLOPT_VERBOSE, 0);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible;)");
  curl_setopt($ch, CURLOPT_URL, urlencode($url));
  $response = curl_exec($ch);
  curl_close($ch);

回答by One Man Crew

Try This:

尝试这个:

$url = "http://www.google.com/search?q=".$strSearch."&hl=en&start=0&sa=N";
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_HEADER, 0);
  curl_setopt($ch, CURLOPT_VERBOSE, 0);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible;)");
  curl_setopt($ch, CURLOPT_URL, urlencode($url));
  $response = curl_exec($ch);
  curl_close($ch);

回答by George Vasiliou

I suppose that have you noticed that your link is actually an HTTPS link.... It seems that CURL parameters do not include any kind of SSH handling... maybe this could be your problem. Why don't you try with a non-HTTPS link to see what happens (i.e Google Custom Search Engine)...?

我想你有没有注意到你的链接实际上是一个 HTTPS 链接......似乎 CURL 参数不包括任何类型的 SSH 处理......也许这可能是你的问题。为什么不尝试使用非 HTTPS 链接来查看会发生什么(即 Google 自定义搜索引擎)...?

回答by user5746128

Get content with Curl php

使用 Curl php 获取内容

request server support Curl function, enable in httpd.conf in folder Apache

请求服务器支持 Curl 功能,在 Apache 文件夹中的 httpd.conf 中启用


function UrlOpener($url)
     global $output;
     $ch = curl_init(); 
     curl_setopt($ch, CURLOPT_URL, $url); 
     curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
     $output = curl_exec($ch); 
     curl_close($ch);    
     echo $output;

If get content by google cache use Curl you can use this url: http://webcache.googleusercontent.com/search?q=cache:Putyour url Sample: http://urlopener.mixaz.net/

如果通过谷歌缓存使用 Curl 获取内容,您可以使用以下网址:http: //webcache.googleusercontent.com/search?q =cache: Putyour url Sample: http://urlopener.mixaz.net/