重定向后的 PHP Curl

Question

提问by David

I'm trying to be a bit sneeky and as part of a learning process try and improve my page scraping skills.

我试图有点狡猾，作为学习过程的一部分，尝试提高我的页面抓取技巧。

One thing i've come across that I have yet to be able to solve is that certain sites will use an internal link which then redirects to an external link.

我遇到的一件事我还没有能够解决是某些网站将使用内部链接，然后重定向到外部链接。

What I want to do is modify some curl code to follow the redirects until they stop and then obtain the final resting place URL.

我想要做的是修改一些 curl 代码以跟随重定向，直到它们停止，然后获取最终的休息位置 URL。

Anyone recommend some code for me?

有人为我推荐一些代码吗？

I have this at the moment, but it's not following the redirects properly at the moment.

我目前有这个，但目前没有正确遵循重定向。

        $opts = array(CURLOPT_URL => $url,
                      CURLOPT_RETURNTRANSFER => true,
                      CURLOPT_HEADER => true,
                      CURLOPT_FOLLOWLOCATION => true);      

        $curl = curl_init(); 
        curl_setopt_array($curl, $opts);  
        $str = curl_exec($curl);  
        curl_close($curl);

Answer 1

回答by Manish Raj

http.//php.net/manual/en/ref.curl.php

   function get_final_url( $url, $timeout = 5 )
 {
    $url = str_replace( "&amp;", "&", urldecode(trim($url)) );

   $cookie = tempnam ("/tmp", "CURLCOOKIE");
$ch = curl_init();
curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_ENCODING, "" );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
$content = curl_exec( $ch );
$response = curl_getinfo( $ch );
curl_close ( $ch );

if ($response['http_code'] == 301 || $response['http_code'] == 302)
{
    ini_set("user_agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");
    $headers = get_headers($response['url']);

    $location = "";
    foreach( $headers as $value )
    {
        if ( substr( strtolower($value), 0, 9 ) == "location:" )
            return get_final_url( trim( substr( $value, 9, strlen($value) ) ) );
    }
}

if (    preg_match("/window\.location\.replace\('(.*)'\)/i", $content, $value) ||
        preg_match("/window\.location\=\"(.*)\"/i", $content, $value)
)
{
    return get_final_url ( $value[1] );
}
else
{
    return $response['url'];
   }
}

Answer 2

回答by Tchoupi

If you can't use CURLOPT_FOLLOWLOCATION, I suggest you use a recursive method like this one:

如果你不能使用CURLOPT_FOLLOWLOCATION，我建议你使用这样的递归方法：

function getUrl($url, $count) {

    // max number of redirects
    if ($count > 5) {
        return false;
    }

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    $data = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);

    curl_close($ch);

    if (!$data) {
        return false;
    }

    $dataArray = explode("\r\n\r\n", $data, 2);

    if (count($dataArray) != 2) {
        return false;
    }

    list($header, $body) = $dataArray;
    if ($httpCode == 301 || $httpCode == 302) {
        $matches = array();
        preg_match('/Location:(.*?)\n/', $header, $matches);

        if (isset($matches[1])) {
            return getUrl(trim($matches[1]), $count + 1);
        }
    } else {
        return $body;
    }
}

重定向后的 PHP Curl

提问by David

回答by Manish Raj

回答by Tchoupi

相关推荐

最近更新

标签

重定向后的 PHP Curl

提问by David

回答by Manish Raj

回答by Tchoupi

相关推荐

php 从字符串的开头和结尾删除双引号

php mysql_real_escape_string() 在 MySQL 中留下斜线

php php比较两个关联数组

如何检查 PHP 数组是关联的还是顺序的？

相关推荐

最近更新

标签