重定向后的 PHP Curl
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10288130/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
PHP Curl following redirects
提问by David
I'm trying to be a bit sneeky and as part of a learning process try and improve my page scraping skills.
我试图有点狡猾,作为学习过程的一部分,尝试提高我的页面抓取技巧。
One thing i've come across that I have yet to be able to solve is that certain sites will use an internal link which then redirects to an external link.
我遇到的一件事我还没有能够解决是某些网站将使用内部链接,然后重定向到外部链接。
What I want to do is modify some curl code to follow the redirects until they stop and then obtain the final resting place URL.
我想要做的是修改一些 curl 代码以跟随重定向,直到它们停止,然后获取最终的休息位置 URL。
Anyone recommend some code for me?
有人为我推荐一些代码吗?
I have this at the moment, but it's not following the redirects properly at the moment.
我目前有这个,但目前没有正确遵循重定向。
$opts = array(CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADER => true,
CURLOPT_FOLLOWLOCATION => true);
$curl = curl_init();
curl_setopt_array($curl, $opts);
$str = curl_exec($curl);
curl_close($curl);
回答by Manish Raj
http.//php.net/manual/en/ref.curl.php
http.//php.net/manual/en/ref.curl.php
function get_final_url( $url, $timeout = 5 )
{
$url = str_replace( "&", "&", urldecode(trim($url)) );
$cookie = tempnam ("/tmp", "CURLCOOKIE");
$ch = curl_init();
curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_ENCODING, "" );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
$content = curl_exec( $ch );
$response = curl_getinfo( $ch );
curl_close ( $ch );
if ($response['http_code'] == 301 || $response['http_code'] == 302)
{
ini_set("user_agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");
$headers = get_headers($response['url']);
$location = "";
foreach( $headers as $value )
{
if ( substr( strtolower($value), 0, 9 ) == "location:" )
return get_final_url( trim( substr( $value, 9, strlen($value) ) ) );
}
}
if ( preg_match("/window\.location\.replace\('(.*)'\)/i", $content, $value) ||
preg_match("/window\.location\=\"(.*)\"/i", $content, $value)
)
{
return get_final_url ( $value[1] );
}
else
{
return $response['url'];
}
}
回答by Tchoupi
If you can't use CURLOPT_FOLLOWLOCATION, I suggest you use a recursive method like this one:
如果你不能使用CURLOPT_FOLLOWLOCATION,我建议你使用这样的递归方法:
function getUrl($url, $count) {
// max number of redirects
if ($count > 5) {
return false;
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if (!$data) {
return false;
}
$dataArray = explode("\r\n\r\n", $data, 2);
if (count($dataArray) != 2) {
return false;
}
list($header, $body) = $dataArray;
if ($httpCode == 301 || $httpCode == 302) {
$matches = array();
preg_match('/Location:(.*?)\n/', $header, $matches);
if (isset($matches[1])) {
return getUrl(trim($matches[1]), $count + 1);
}
} else {
return $body;
}
}

