URL 的 PHP 验证/正则表达式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/206059/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-24 21:56:04  来源:igfitidea点击:

PHP validation/regex for URL

phpregexurlvalidation

提问by AndreLiem

I've been looking for a simple regex for URLs, does anybody have one handy that works well? I didn't find one with the zend framework validation classes and have seen several implementations.

我一直在寻找一个简单的 URL 正则表达式,有人手头上有一个很好用的吗?我没有找到一个带有 zend 框架验证类的,并且已经看到了几个实现。

采纳答案by Owen

I used this on a few projects, I don't believe I've run into issues, but I'm sure it's not exhaustive:

我在一些项目中使用了它,我不相信我遇到了问题,但我确定它不是详尽无遗的:

$text = preg_replace(
  '#((https?|ftp)://(\S*?\.\S*?))([\s)\[\]{},;"\':<]|\.\s|$)#i',
  "'<a href=\"\" target=\"_blank\"></a>'",
  $text
);

Most of the random junk at the end is to deal with situations like http://domain.com.in a sentence (to avoid matching the trailing period). I'm sure it could be cleaned up but since it worked. I've more or less just copied it over from project to project.

结尾的大部分随机垃圾都是为了处理http://domain.com.句子中的情况(以避免匹配尾随句点)。我敢肯定它可以清理,但因为它有效。我或多或少只是从一个项目复制到另一个项目。

回答by Stanislav

Use the filter_var()function to validate whether a string is URL or not:

使用该filter_var()函数来验证字符串是否为 URL:

var_dump(filter_var('example.com', FILTER_VALIDATE_URL));

It is bad practice to use regular expressions when not necessary.

在不必要时使用正则表达式是不好的做法。

EDIT: Be careful, this solution is not unicode-safe and not XSS-safe. If you need a complex validation, maybe it's better to look somewhere else.

编辑:小心,这个解决方案不是 unicode 安全的,也不是 XSS 安全的。如果您需要复杂的验证,也许最好去其他地方看看。

回答by catchdave

As per the PHP manual - parse_url should notbe used to validate a URL.

根据 PHP 手册 - parse_url应用于验证 URL。

Unfortunately, it seems that filter_var('example.com', FILTER_VALIDATE_URL)does not perform any better.

不幸的是,它似乎filter_var('example.com', FILTER_VALIDATE_URL)并没有表现得更好。

Both parse_url()and filter_var()will pass malformed URLs such as http://...

双方parse_url()filter_var()会通过恶意的URL,例如http://...

Therefore in this case - regex isthe better method.

因此在这种情况下 - 正则表达式更好的方法。

回答by Roger

Just in case you want to know if the url really exists:

以防万一您想知道该网址是否真的存在:

function url_exist($url){//se passar a URL existe
    $c=curl_init();
    curl_setopt($c,CURLOPT_URL,$url);
    curl_setopt($c,CURLOPT_HEADER,1);//get the header
    curl_setopt($c,CURLOPT_NOBODY,1);//and *only* get the header
    curl_setopt($c,CURLOPT_RETURNTRANSFER,1);//get the response as a string from curl_exec(), rather than echoing it
    curl_setopt($c,CURLOPT_FRESH_CONNECT,1);//don't use a cached version of the url
    if(!curl_exec($c)){
        //echo $url.' inexists';
        return false;
    }else{
        //echo $url.' exists';
        return true;
    }
    //$httpcode=curl_getinfo($c,CURLINFO_HTTP_CODE);
    //return ($httpcode<400);
}

回答by abhiomkar

As per John Gruber(Daring Fireball):

根据约翰格鲁伯(大胆的火球):

Regex:

正则表达式:

(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>???“”‘']))

using in preg_match():

在 preg_match() 中使用:

preg_match("/(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>???“”‘']))/", $url)

Here is the extended regex pattern (with comments):

这是扩展的正则表达式模式(带注释):

(?xi)
\b
(                       # Capture 1: entire matched URL
  (?:
    https?://               # http or https protocol
    |                       #   or
    www\d{0,3}[.]           # "www.", "www1.", "www2." … "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
  )
  (?:                       # One or more:
    [^\s()<>]+                  # Run of non-space, non-()<>
    |                           #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                       # End with:
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                               #   or
    [^\s`!()\[\]{};:'".,<>???“”‘']        # not a space or one of these punct chars
  )
)

For more details please look at: http://daringfireball.net/2010/07/improved_regex_for_matching_urls

更多详情请查看:http: //daringfireball.net/2010/07/improved_regex_for_matching_urls

回答by promaty

I don't think that using regular expressions is a smart thing to do in this case. It is impossible to match all of the possibilities and even if you did, there is still a chance that url simply doesn't exist.

我不认为在这种情况下使用正则表达式是一件明智的事情。不可能匹配所有的可能性,即使你匹配了,仍然有可能 url 根本不存在。

Here is a very simple way to test if url actually exists and is readable :

这是一种非常简单的方法来测试 url 是否确实存在并且是否可读:

if (preg_match("#^https?://.+#", $link) and @fopen($link,"r")) echo "OK";

(if there is no preg_matchthen this would also validate all filenames on your server)

(如果没有,preg_match那么这也将验证您服务器上的所有文件名)

回答by Vikash Kumar

    function validateURL($URL) {
      $pattern_1 = "/^(http|https|ftp):\/\/(([A-Z0-9][A-Z0-9_-]*)(\.[A-Z0-9][A-Z0-9_-]*)+.(com|org|net|dk|at|us|tv|info|uk|co.uk|biz|se)$)(:(\d+))?\/?/i";
      $pattern_2 = "/^(www)((\.[A-Z0-9][A-Z0-9_-]*)+.(com|org|net|dk|at|us|tv|info|uk|co.uk|biz|se)$)(:(\d+))?\/?/i";       
      if(preg_match($pattern_1, $URL) || preg_match($pattern_2, $URL)){
        return true;
      } else{
        return false;
      }
    }

回答by Peter Bailey

I've used this one with good success - I don't remember where I got it from

我已经成功地使用了这个 - 我不记得我从哪里得到的

$pattern = "/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i";

回答by George Milonas

And there is your answer =) Try to break it, you can't!!!

这就是你的答案 =) 试着打破它,你不能!!!

function link_validate_url($text) {
$LINK_DOMAINS = 'aero|arpa|asia|biz|com|cat|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel|mobi|local';
  $LINK_ICHARS_DOMAIN = (string) html_entity_decode(implode("", array( // @TODO completing letters ...
    "&#x00E6;", // ?
    "&#x00C6;", // ?
    "&#x00C0;", // à
    "&#x00E0;", // à
    "&#x00C1;", // á
    "&#x00E1;", // á
    "&#x00C2;", // ?
    "&#x00E2;", // a
    "&#x00E5;", // ?
    "&#x00C5;", // ?
    "&#x00E4;", // ?
    "&#x00C4;", // ?
    "&#x00C7;", // ?
    "&#x00E7;", // ?
    "&#x00D0;", // D
    "&#x00F0;", // e
    "&#x00C8;", // è
    "&#x00E8;", // è
    "&#x00C9;", // é
    "&#x00E9;", // é
    "&#x00CA;", // ê
    "&#x00EA;", // ê
    "&#x00CB;", // ?
    "&#x00EB;", // ?
    "&#x00CE;", // ?
    "&#x00EE;", // ?
    "&#x00CF;", // ?
    "&#x00EF;", // ?
    "&#x00F8;", // ?
    "&#x00D8;", // ?
    "&#x00F6;", // ?
    "&#x00D6;", // ?
    "&#x00D4;", // ?
    "&#x00F4;", // ?
    "&#x00D5;", // ?
    "&#x00F5;", // ?
    "&#x0152;", // ?
    "&#x0153;", // ?
    "&#x00FC;", // ü
    "&#x00DC;", // ü
    "&#x00D9;", // ù
    "&#x00F9;", // ù
    "&#x00DB;", // ?
    "&#x00FB;", // ?
    "&#x0178;", // ?
    "&#x00FF;", // ? 
    "&#x00D1;", // ?
    "&#x00F1;", // ?
    "&#x00FE;", // t
    "&#x00DE;", // T
    "&#x00FD;", // y
    "&#x00DD;", // Y
    "&#x00BF;", // ?
  )), ENT_QUOTES, 'UTF-8');

  $LINK_ICHARS = $LINK_ICHARS_DOMAIN . (string) html_entity_decode(implode("", array(
    "&#x00DF;", // ?
  )), ENT_QUOTES, 'UTF-8');
  $allowed_protocols = array('http', 'https', 'ftp', 'news', 'nntp', 'telnet', 'mailto', 'irc', 'ssh', 'sftp', 'webcal');

  // Starting a parenthesis group with (?: means that it is grouped, but is not captured
  $protocol = '((?:'. implode("|", $allowed_protocols) .'):\/\/)';
  $authentication = "(?:(?:(?:[\w\.\-\+!$&'\(\)*\+,;=" . $LINK_ICHARS . "]|%[0-9a-f]{2})+(?::(?:[\w". $LINK_ICHARS ."\.\-\+%!$&'\(\)*\+,;=]|%[0-9a-f]{2})*)?)?@)";
  $domain = '(?:(?:[a-z0-9' . $LINK_ICHARS_DOMAIN . ']([a-z0-9'. $LINK_ICHARS_DOMAIN . '\-_\[\]])*)(\.(([a-z0-9' . $LINK_ICHARS_DOMAIN . '\-_\[\]])+\.)*('. $LINK_DOMAINS .'|[a-z]{2}))?)';
  $ipv4 = '(?:[0-9]{1,3}(\.[0-9]{1,3}){3})';
  $ipv6 = '(?:[0-9a-fA-F]{1,4}(\:[0-9a-fA-F]{1,4}){7})';
  $port = '(?::([0-9]{1,5}))';

  // Pattern specific to external links.
  $external_pattern = '/^'. $protocol .'?'. $authentication .'?('. $domain .'|'. $ipv4 .'|'. $ipv6 .' |localhost)'. $port .'?';

  // Pattern specific to internal links.
  $internal_pattern = "/^(?:[a-z0-9". $LINK_ICHARS ."_\-+\[\]]+)";
  $internal_pattern_file = "/^(?:[a-z0-9". $LINK_ICHARS ."_\-+\[\]\.]+)$/i";

  $directories = "(?:\/[a-z0-9". $LINK_ICHARS ."_\-\.~+%=&,$'#!():;*@\[\]]*)*";
  // Yes, four backslashes == a single backslash.
  $query = "(?:\/?\?([?a-z0-9". $LINK_ICHARS ."+_|\-\.~\/\\%=&,$'():;*@\[\]{} ]*))";
  $anchor = "(?:#[a-z0-9". $LINK_ICHARS ."_\-\.~+%=&,$'():;*@\[\]\/\?]*)";

  // The rest of the path for a standard URL.
  $end = $directories .'?'. $query .'?'. $anchor .'?'.'$/i';

  $message_id = '[^@].*@'. $domain;
  $newsgroup_name = '(?:[0-9a-z+-]*\.)*[0-9a-z+-]*';
  $news_pattern = '/^news:('. $newsgroup_name .'|'. $message_id .')$/i';

  $user = '[a-zA-Z0-9'. $LINK_ICHARS .'_\-\.\+\^!#$%&*+\/\=\?\`\|\{\}~\'\[\]]+';
  $email_pattern = '/^mailto:'. $user .'@'.'(?:'. $domain .'|'. $ipv4 .'|'. $ipv6 .'|localhost)'. $query .'?$/';

  if (strpos($text, '<front>') === 0) {
    return false;
  }
  if (in_array('mailto', $allowed_protocols) && preg_match($email_pattern, $text)) {
    return false;
  }
  if (in_array('news', $allowed_protocols) && preg_match($news_pattern, $text)) {
    return false;
  }
  if (preg_match($internal_pattern . $end, $text)) {
    return false;
  }
  if (preg_match($external_pattern . $end, $text)) {
    return false;
  }
  if (preg_match($internal_pattern_file, $text)) {
    return false;
  }

  return true;
}

回答by jini

function is_valid_url ($url="") {

        if ($url=="") {
            $url=$this->url;
        }

        $url = @parse_url($url);

        if ( ! $url) {


            return false;
        }

        $url = array_map('trim', $url);
        $url['port'] = (!isset($url['port'])) ? 80 : (int)$url['port'];
        $path = (isset($url['path'])) ? $url['path'] : '';

        if ($path == '') {
            $path = '/';
        }

        $path .= ( isset ( $url['query'] ) ) ? "?$url[query]" : '';



        if ( isset ( $url['host'] ) AND $url['host'] != gethostbyname ( $url['host'] ) ) {
            if ( PHP_VERSION >= 5 ) {
                $headers = get_headers("$url[scheme]://$url[host]:$url[port]$path");
            }
            else {
                $fp = fsockopen($url['host'], $url['port'], $errno, $errstr, 30);

                if ( ! $fp ) {
                    return false;
                }
                fputs($fp, "HEAD $path HTTP/1.1\r\nHost: $url[host]\r\n\r\n");
                $headers = fread ( $fp, 128 );
                fclose ( $fp );
            }
            $headers = ( is_array ( $headers ) ) ? implode ( "\n", $headers ) : $headers;
            return ( bool ) preg_match ( '#^HTTP/.*\s+[(200|301|302)]+\s#i', $headers );
        }

        return false;
    }