在 PHP 中测试 404 的 URL 的简单方法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/408405/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-24 22:38:46  来源:igfitidea点击:

Easy way to test a URL for 404 in PHP?

phphttpvalidationhttp-headershttp-status-code-404

提问by strager

I'm teaching myself some basic scraping and I've found that sometimes the URL's that I feed into my code return 404, which gums up all the rest of my code.

我正在自学一些基本的抓取,我发现有时输入到我的代码中的 URL 返回 404,这会混淆我的所有其余代码。

So I need a test at the top of the code to check if the URL returns 404 or not.

所以我需要在代码顶部进行测试,以检查 URL 是否返回 404。

This would seem like a pretty straightfoward task, but Google's not giving me any answers. I worry I'm searching for the wrong stuff.

这似乎是一项非常简单的任务,但 Google 没有给我任何答案。我担心我正在寻找错误的东西。

One blog recommended I use this:

一个博客推荐我使用这个:

$valid = @fsockopen($url, 80, $errno, $errstr, 30);

and then test to see if $valid if empty or not.

然后测试以查看 $valid 是否为空。

But I think the URL that's giving me problems has a redirect on it, so $valid is coming up empty for all values. Or perhaps I'm doing something else wrong.

但我认为给我带来问题的 URL 有一个重定向,所以 $valid 对于所有值都是空的。或者也许我做错了什么。

I've also looked into a "head request" but I've yet to find any actual code examples I can play with or try out.

我还研究了“头部请求”,但我还没有找到任何可以玩或尝试的实际代码示例。

Suggestions? And what's this about curl?

建议?卷曲是怎么回事?

回答by strager

If you are using PHP's curlbindings, you can check the error code using curl_getinfoas such:

如果您使用的是 PHP 的curlbindings,则可以使用以下方法检查错误代码curl_getinfo

$handle = curl_init($url);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);

/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);

/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode == 404) {
    /* Handle 404 here. */
}

curl_close($handle);

/* Handle $response here. */

回答by Asciant

If your running php5 you can use:

如果您正在运行 php5,您可以使用:

$url = 'http://www.example.com';
print_r(get_headers($url, 1));

Alternatively with php4 a user has contributed the following:

或者使用 php4,用户贡献了以下内容:

/**
This is a modified version of code from "stuart at sixletterwords dot com", at 14-Sep-2005 04:52. This version tries to emulate get_headers() function at PHP4. I think it works fairly well, and is simple. It is not the best emulation available, but it works.

Features:
- supports (and requires) full URLs.
- supports changing of default port in URL.
- stops downloading from socket as soon as end-of-headers is detected.

Limitations:
- only gets the root URL (see line with "GET / HTTP/1.1").
- don't support HTTPS (nor the default HTTPS port).
*/

if(!function_exists('get_headers'))
{
    function get_headers($url,$format=0)
    {
        $url=parse_url($url);
        $end = "\r\n\r\n";
        $fp = fsockopen($url['host'], (empty($url['port'])?80:$url['port']), $errno, $errstr, 30);
        if ($fp)
        {
            $out  = "GET / HTTP/1.1\r\n";
            $out .= "Host: ".$url['host']."\r\n";
            $out .= "Connection: Close\r\n\r\n";
            $var  = '';
            fwrite($fp, $out);
            while (!feof($fp))
            {
                $var.=fgets($fp, 1280);
                if(strpos($var,$end))
                    break;
            }
            fclose($fp);

            $var=preg_replace("/\r\n\r\n.*$/",'',$var);
            $var=explode("\r\n",$var);
            if($format)
            {
                foreach($var as $i)
                {
                    if(preg_match('/^([a-zA-Z -]+): +(.*)$/',$i,$parts))
                        $v[$parts[1]]=$parts[2];
                }
                return $v;
            }
            else
                return $var;
        }
    }
}

Both would have a result similar to:

两者都会产生类似于:

Array
(
    [0] => HTTP/1.1 200 OK
    [Date] => Sat, 29 May 2004 12:28:14 GMT
    [Server] => Apache/1.3.27 (Unix)  (Red-Hat/Linux)
    [Last-Modified] => Wed, 08 Jan 2003 23:11:55 GMT
    [ETag] => "3f80f-1b6-3e1cb03b"
    [Accept-Ranges] => bytes
    [Content-Length] => 438
    [Connection] => close
    [Content-Type] => text/html
)

Therefore you could just check to see that the header response was OK eg:

因此,您只需检查标头响应是否正常,例如:

$headers = get_headers($url, 1);
if ($headers[0] == 'HTTP/1.1 200 OK') {
//valid 
}

if ($headers[0] == 'HTTP/1.1 301 Moved Permanently') {
//moved or redirect page
}

W3C Codes and Definitions

W3C 代码和定义

回答by Aram Kocharyan

With strager's code, you can also check the CURLINFO_HTTP_CODE for other codes. Some websites do not report a 404, rather they simply redirect to a custom 404 page and return 302 (redirect) or something similar. I used this to check if an actual file (eg. robots.txt) existed on the server or not. Clearly this kind of file would not cause a redirect if it existed, but if it didn't it would redirect to a 404 page, which as I said before may not have a 404 code.

使用 strager 的代码,您还可以检查 CURLINFO_HTTP_CODE 以获取其他代码。有些网站不报告 404,而只是重定向到自定义 404 页面并返回 302(重定向)或类似内容。我用它来检查服务器上是否存在实际文件(例如 robots.txt)。很明显,这种文件如果存在就不会导致重定向,但如果不存在,它会重定向到 404 页面,正如我之前所说,它可能没有 404 代码。

function is_404($url) {
    $handle = curl_init($url);
    curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);

    /* Get the HTML or whatever is linked in $url. */
    $response = curl_exec($handle);

    /* Check for 404 (file not found). */
    $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
    curl_close($handle);

    /* If the document has loaded successfully without any redirection or error */
    if ($httpCode >= 200 && $httpCode < 300) {
        return false;
    } else {
        return true;
    }
}

回答by Beau Simensen

As strager suggests, look into using cURL. You may also be interested in setting CURLOPT_NOBODY with curl_setoptto skip downloading the whole page (you just want the headers).

正如 strager 建议的那样,考虑使用 cURL。您可能也有兴趣使用curl_setopt设置 CURLOPT_NOBODY以跳过下载整个页面(您只需要标题)。

回答by Nasaralla

If you are looking for an easiest solution and the one you can try in one go on php5 do

如果您正在寻找一种最简单的解决方案,并且可以在 php5 上一次性尝试

file_get_contents('www.yoursite.com');
//and check by echoing
echo $http_response_header[0];

回答by Ross

I found this answer here:

我在这里找到了这个答案:

if(($twitter_XML_raw=file_get_contents($timeline))==false){
    // Retrieve HTTP status code
    list($version,$status_code,$msg) = explode(' ',$http_response_header[0], 3);

    // Check the HTTP Status code
    switch($status_code) {
        case 200:
                $error_status="200: Success";
                break;
        case 401:
                $error_status="401: Login failure.  Try logging out and back in.  Password are ONLY used when posting.";
                break;
        case 400:
                $error_status="400: Invalid request.  You may have exceeded your rate limit.";
                break;
        case 404:
                $error_status="404: Not found.  This shouldn't happen.  Please let me know what happened using the feedback link above.";
                break;
        case 500:
                $error_status="500: Twitter servers replied with an error. Hopefully they'll be OK soon!";
                break;
        case 502:
                $error_status="502: Twitter servers may be down or being upgraded. Hopefully they'll be OK soon!";
                break;
        case 503:
                $error_status="503: Twitter service unavailable. Hopefully they'll be OK soon!";
                break;
        default:
                $error_status="Undocumented error: " . $status_code;
                break;
    }

Essentially, you use the "file get contents" method to retrieve the URL, which automatically populates the http response header variable with the status code.

本质上,您使用“文件获取内容”方法来检索 URL,该方法会使用状态代码自动填充 http 响应标头变量。

回答by Email

addendum;tested those 3 methods considering performance.

附录;考虑到性能,测试了这 3 种方法。

The result, at least in my testing environment:

结果,至少在我的测试环境中:

Curl wins

卷曲获胜

This test is done under the consideration that only the headers (noBody) is needed. Test yourself:

这个测试是在只需要头文件(noBody)的情况下完成的。测试自己:

$url = "http://de.wikipedia.org/wiki/Pinocchio";

$start_time = microtime(TRUE);
$headers = get_headers($url);
echo $headers[0]."<br>";
$end_time = microtime(TRUE);
echo $end_time - $start_time."<br>";


$start_time = microtime(TRUE);
$response = file_get_contents($url);
echo $http_response_header[0]."<br>";
$end_time = microtime(TRUE);
echo $end_time - $start_time."<br>";

$start_time = microtime(TRUE);
$handle = curl_init($url);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($handle, CURLOPT_NOBODY, 1); // and *only* get the header 
/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);
/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
// if($httpCode == 404) {
    // /* Handle 404 here. */
// }
echo $httpCode."<br>";
curl_close($handle);
$end_time = microtime(TRUE);
echo $end_time - $start_time."<br>";

回答by Juergen

This will give you true if url does not return 200 OK

如果 url 不返回 200 OK,这将为您提供 true

function check_404($url) {
   $headers=get_headers($url, 1);
   if ($headers[0]!='HTTP/1.1 200 OK') return true; else return false;
}

回答by Melbin Mathew Antony

<?php

$url= 'www.something.com';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, true);   
curl_setopt($ch, CURLOPT_NOBODY, true);    
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.4");
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,10);
curl_setopt($ch, CURLOPT_ENCODING, "gzip");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$output = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);


echo $httpcode;
?>

回答by Andreas

Here is a short solution.

这是一个简短的解决方案。

$handle = curl_init($uri);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($handle,CURLOPT_HTTPHEADER,array ("Accept: application/rdf+xml"));
curl_setopt($handle, CURLOPT_NOBODY, true);
curl_exec($handle);
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode == 200||$httpCode == 303) 
{
    echo "you might get a reply";
}
curl_close($handle);

In your case, you can change application/rdf+xmlto whatever you use.

在您的情况下,您可以更改application/rdf+xml为您使用的任何内容。