在 PHP 中测试 404 的 URL 的简单方法？

Question

提问by strager

I'm teaching myself some basic scraping and I've found that sometimes the URL's that I feed into my code return 404, which gums up all the rest of my code.

我正在自学一些基本的抓取，我发现有时输入到我的代码中的 URL 返回 404，这会混淆我的所有其余代码。

So I need a test at the top of the code to check if the URL returns 404 or not.

所以我需要在代码顶部进行测试，以检查 URL 是否返回 404。

This would seem like a pretty straightfoward task, but Google's not giving me any answers. I worry I'm searching for the wrong stuff.

这似乎是一项非常简单的任务，但 Google 没有给我任何答案。我担心我正在寻找错误的东西。

One blog recommended I use this:

一个博客推荐我使用这个：

$valid = @fsockopen($url, 80, $errno, $errstr, 30);

and then test to see if $valid if empty or not.

然后测试以查看 $valid 是否为空。

But I think the URL that's giving me problems has a redirect on it, so $valid is coming up empty for all values. Or perhaps I'm doing something else wrong.

但我认为给我带来问题的 URL 有一个重定向，所以 $valid 对于所有值都是空的。或者也许我做错了什么。

I've also looked into a "head request" but I've yet to find any actual code examples I can play with or try out.

我还研究了“头部请求”，但我还没有找到任何可以玩或尝试的实际代码示例。

Suggestions? And what's this about curl?

建议？卷曲是怎么回事？

Answer 1

回答by strager

If you are using PHP's curlbindings, you can check the error code using curl_getinfoas such:

如果您使用的是 PHP 的curlbindings，则可以使用以下方法检查错误代码curl_getinfo：

$handle = curl_init($url);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);

/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);

/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode == 404) {
    /* Handle 404 here. */
}

curl_close($handle);

/* Handle $response here. */

Answer 2

回答by Asciant

If your running php5 you can use:

如果您正在运行 php5，您可以使用：

$url = 'http://www.example.com';
print_r(get_headers($url, 1));

Alternatively with php4 a user has contributed the following:

或者使用 php4，用户贡献了以下内容：

/**
This is a modified version of code from "stuart at sixletterwords dot com", at 14-Sep-2005 04:52. This version tries to emulate get_headers() function at PHP4. I think it works fairly well, and is simple. It is not the best emulation available, but it works.

Features:
- supports (and requires) full URLs.
- supports changing of default port in URL.
- stops downloading from socket as soon as end-of-headers is detected.

Limitations:
- only gets the root URL (see line with "GET / HTTP/1.1").
- don't support HTTPS (nor the default HTTPS port).
*/

if(!function_exists('get_headers'))
{
    function get_headers($url,$format=0)
    {
        $url=parse_url($url);
        $end = "\r\n\r\n";
        $fp = fsockopen($url['host'], (empty($url['port'])?80:$url['port']), $errno, $errstr, 30);
        if ($fp)
        {
            $out  = "GET / HTTP/1.1\r\n";
            $out .= "Host: ".$url['host']."\r\n";
            $out .= "Connection: Close\r\n\r\n";
            $var  = '';
            fwrite($fp, $out);
            while (!feof($fp))
            {
                $var.=fgets($fp, 1280);
                if(strpos($var,$end))
                    break;
            }
            fclose($fp);

            $var=preg_replace("/\r\n\r\n.*$/",'',$var);
            $var=explode("\r\n",$var);
            if($format)
            {
                foreach($var as $i)
                {
                    if(preg_match('/^([a-zA-Z -]+): +(.*)$/',$i,$parts))
                        $v[$parts[1]]=$parts[2];
                }
                return $v;
            }
            else
                return $var;
        }
    }
}

Both would have a result similar to:

两者都会产生类似于：

Array
(
    [0] => HTTP/1.1 200 OK
    [Date] => Sat, 29 May 2004 12:28:14 GMT
    [Server] => Apache/1.3.27 (Unix)  (Red-Hat/Linux)
    [Last-Modified] => Wed, 08 Jan 2003 23:11:55 GMT
    [ETag] => "3f80f-1b6-3e1cb03b"
    [Accept-Ranges] => bytes
    [Content-Length] => 438
    [Connection] => close
    [Content-Type] => text/html
)

Therefore you could just check to see that the header response was OK eg:

因此，您只需检查标头响应是否正常，例如：

$headers = get_headers($url, 1);
if ($headers[0] == 'HTTP/1.1 200 OK') {
//valid 
}

if ($headers[0] == 'HTTP/1.1 301 Moved Permanently') {
//moved or redirect page
}

W3C Codes and Definitions

W3C 代码和定义

Answer 3

回答by Aram Kocharyan

With strager's code, you can also check the CURLINFO_HTTP_CODE for other codes. Some websites do not report a 404, rather they simply redirect to a custom 404 page and return 302 (redirect) or something similar. I used this to check if an actual file (eg. robots.txt) existed on the server or not. Clearly this kind of file would not cause a redirect if it existed, but if it didn't it would redirect to a 404 page, which as I said before may not have a 404 code.

使用 strager 的代码，您还可以检查 CURLINFO_HTTP_CODE 以获取其他代码。有些网站不报告 404，而只是重定向到自定义 404 页面并返回 302（重定向）或类似内容。我用它来检查服务器上是否存在实际文件（例如 robots.txt）。很明显，这种文件如果存在就不会导致重定向，但如果不存在，它会重定向到 404 页面，正如我之前所说，它可能没有 404 代码。

function is_404($url) {
    $handle = curl_init($url);
    curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);

    /* Get the HTML or whatever is linked in $url. */
    $response = curl_exec($handle);

    /* Check for 404 (file not found). */
    $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
    curl_close($handle);

    /* If the document has loaded successfully without any redirection or error */
    if ($httpCode >= 200 && $httpCode < 300) {
        return false;
    } else {
        return true;
    }
}

Answer 4

回答by Beau Simensen

As strager suggests, look into using cURL. You may also be interested in setting CURLOPT_NOBODY with curl_setoptto skip downloading the whole page (you just want the headers).

正如 strager 建议的那样，考虑使用 cURL。您可能也有兴趣使用curl_setopt设置 CURLOPT_NOBODY以跳过下载整个页面（您只需要标题）。

Answer 5

回答by Nasaralla

If you are looking for an easiest solution and the one you can try in one go on php5 do

如果您正在寻找一种最简单的解决方案，并且可以在 php5 上一次性尝试

file_get_contents('www.yoursite.com');
//and check by echoing
echo $http_response_header[0];

Answer 6

回答by Ross

I found this answer here:

我在这里找到了这个答案：

if(($twitter_XML_raw=file_get_contents($timeline))==false){
    // Retrieve HTTP status code
    list($version,$status_code,$msg) = explode(' ',$http_response_header[0], 3);

    // Check the HTTP Status code
    switch($status_code) {
        case 200:
                $error_status="200: Success";
                break;
        case 401:
                $error_status="401: Login failure.  Try logging out and back in.  Password are ONLY used when posting.";
                break;
        case 400:
                $error_status="400: Invalid request.  You may have exceeded your rate limit.";
                break;
        case 404:
                $error_status="404: Not found.  This shouldn't happen.  Please let me know what happened using the feedback link above.";
                break;
        case 500:
                $error_status="500: Twitter servers replied with an error. Hopefully they'll be OK soon!";
                break;
        case 502:
                $error_status="502: Twitter servers may be down or being upgraded. Hopefully they'll be OK soon!";
                break;
        case 503:
                $error_status="503: Twitter service unavailable. Hopefully they'll be OK soon!";
                break;
        default:
                $error_status="Undocumented error: " . $status_code;
                break;
    }

Essentially, you use the "file get contents" method to retrieve the URL, which automatically populates the http response header variable with the status code.

本质上，您使用“文件获取内容”方法来检索 URL，该方法会使用状态代码自动填充 http 响应标头变量。

Answer 7

回答by Email

addendum;tested those 3 methods considering performance.

附录；考虑到性能，测试了这 3 种方法。

The result, at least in my testing environment:

结果，至少在我的测试环境中：

Curl wins

卷曲获胜

This test is done under the consideration that only the headers (noBody) is needed. Test yourself:

这个测试是在只需要头文件（noBody）的情况下完成的。测试自己：

$url = "http://de.wikipedia.org/wiki/Pinocchio";

$start_time = microtime(TRUE);
$headers = get_headers($url);
echo $headers[0]."<br>";
$end_time = microtime(TRUE);
echo $end_time - $start_time."<br>";


$start_time = microtime(TRUE);
$response = file_get_contents($url);
echo $http_response_header[0]."<br>";
$end_time = microtime(TRUE);
echo $end_time - $start_time."<br>";

$start_time = microtime(TRUE);
$handle = curl_init($url);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($handle, CURLOPT_NOBODY, 1); // and *only* get the header 
/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);
/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
// if($httpCode == 404) {
    // /* Handle 404 here. */
// }
echo $httpCode."<br>";
curl_close($handle);
$end_time = microtime(TRUE);
echo $end_time - $start_time."<br>";

Answer 8

回答by Juergen

This will give you true if url does not return 200 OK

如果 url 不返回 200 OK，这将为您提供 true

function check_404($url) {
   $headers=get_headers($url, 1);
   if ($headers[0]!='HTTP/1.1 200 OK') return true; else return false;
}

Answer 9

回答by Melbin Mathew Antony

<?php

$url= 'www.something.com';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, true);   
curl_setopt($ch, CURLOPT_NOBODY, true);    
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.4");
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,10);
curl_setopt($ch, CURLOPT_ENCODING, "gzip");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$output = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);


echo $httpcode;
?>

Answer 10

回答by Andreas

Here is a short solution.

这是一个简短的解决方案。

$handle = curl_init($uri);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($handle,CURLOPT_HTTPHEADER,array ("Accept: application/rdf+xml"));
curl_setopt($handle, CURLOPT_NOBODY, true);
curl_exec($handle);
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode == 200||$httpCode == 303) 
{
    echo "you might get a reply";
}
curl_close($handle);

In your case, you can change application/rdf+xmlto whatever you use.

在您的情况下，您可以更改application/rdf+xml为您使用的任何内容。

在 PHP 中测试 404 的 URL 的简单方法？

提问by strager

回答by strager

回答by Asciant

回答by Aram Kocharyan

回答by Beau Simensen

回答by Nasaralla

回答by Ross

回答by Email

回答by Juergen

回答by Melbin Mathew Antony

回答by Andreas

相关推荐

最近更新

标签

在 PHP 中测试 404 的 URL 的简单方法？

提问by strager

回答by strager

回答by Asciant

回答by Aram Kocharyan

回答by Beau Simensen

回答by Nasaralla

回答by Ross

回答by Email

回答by Juergen

回答by Melbin Mathew Antony

回答by Andreas

相关推荐

php 打开所需文件失败

在 PHP 中声明变量类型？

php 从MYSQL查询php中选择一行

php 将 HTML + CSS 转换为 PDF

相关推荐

最近更新

标签