从 PHP 文本中提取 URL
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/910912/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract URLs from text in PHP
提问by ahmed
I have this text:
我有这样的文字:
$string = "this is my friend's website http://example.com I think it is coll";
How can I extract the link into another variable?
如何将链接提取到另一个变量中?
I know it should be by using regular expression especially preg_match()but I don't know how?
我知道应该使用正则表达式,preg_match()但我不知道如何使用?
回答by Nobu
Probably the safest way is using code snippets from WordPress. Download the latest one (currently 3.1.1) and see wp-includes/formatting.php. There's a function named make_clickable which has plain text for param and returns formatted string. You can grab codes for extracting URLs. It's pretty complex though.
可能最安全的方法是使用 WordPress 的代码片段。下载最新的(目前是 3.1.1)并查看 wp-includes/formatting.php。有一个名为 make_clickable 的函数,它具有 param 的纯文本并返回格式化的字符串。您可以抓取用于提取 URL 的代码。不过还是挺复杂的。
This one line regex might be helpful.
这一行正则表达式可能会有所帮助。
preg_match_all('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $string, $match);
But this regex still can't remove some malformed URLs (ex. http://google:ha.ckers.org).
但是这个正则表达式仍然无法删除一些格式错误的 URL(例如http://google:ha.ckers.org)。
回答by Mikael Roos
I tried to do as Nobu said, using Wordpress, but to much dependencies to other WordPress functions I instead opted to use Nobu's regular expression for preg_match_all()and turned it into a function, using preg_replace_callback(); a function which now replaces all links in a text with clickable links. It uses anonymous functionsso you'll need PHP 5.3 or you may rewrite the code to use an ordinary function instead.
我尝试像 Nobu 所说的那样,使用 Wordpress,但由于对其他 WordPress 函数的依赖性很大,我选择使用 Nobu 的正则表达式preg_match_all()并将其转换为函数,使用preg_replace_callback(); 现在用可点击的链接替换文本中的所有链接的功能。它使用匿名函数,因此您需要 PHP 5.3 或者您可以重写代码以使用普通函数。
<?php
/**
* Make clickable links from URLs in text.
*/
function make_clickable($text) {
$regex = '#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#';
return preg_replace_callback($regex, function ($matches) {
return "<a href=\'{$matches[0]}\'>{$matches[0]}</a>";
}, $text);
}
回答by soulmerge
URLs have a quite complex definition— you must decide what you want to capture first. A simple example capturing anything starting with http://and https://could be:
URL 有一个相当复杂的定义——您必须首先决定要捕获的内容。一个简单的例子捕获任何以http://and开头的内容https://可能是:
preg_match_all('!https?://\S+!', $string, $matches);
$all_urls = $matches[0];
Note that this is very basic and could capture invalid URLs. I would recommend catching up on POSIXand PHP regular expressionsfor more complex things.
回答by Michael Borgwardt
If the text you extract the URLs from is user-submitted and you're going to display the result as links anywhere, you have to be very, VERY careful to avoid XSS vulnerabilities, most prominently "javascript:" protocol URLs, but also malformed URLsthat might trick your regexp and/or the displaying browser into executing them as Javascript URLs. At the very least, you should accept only URLs that start with "http", "https" or "ftp".
如果您从中提取 URL 的文本是用户提交的,并且您要将结果作为链接显示在任何地方,则您必须非常非常小心地避免XSS 漏洞,最突出的是“javascript:”协议 URL,但也有格式错误网址可能会诱使你的正则表达式和/或显示浏览器进入执行它们的JavaScript网址。至少,您应该只接受以“http”、“https”或“ftp”开头的 URL。
There's also a blog entryby Jeff where he describes some other problems with extracting URLs.
Jeff还撰写了 一篇博客文章,其中描述了提取 URL 的其他一些问题。
回答by Kai Noack
The code that worked for me (especially if you have several links in your $string) is:
对我有用的代码(特别是如果您的 $string 中有多个链接)是:
$string = "this is my friend's website https://www.example.com I think it is cool, but this one is cooler https://www.stackoverflow.com :)";
$regex = '/\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|$!:,.;]*[A-Z0-9+&@#\/%=~_|$]/i';
preg_match_all($regex, $string, $matches);
$urls = $matches[0];
// go over all links
foreach($urls as $url)
{
echo $url.'<br />';
}
Hope that helps others as well.
希望对其他人也有帮助。
回答by runfalk
preg_match_all('/[a-z]+:\/\/\S+/', $string, $matches);
This is an easy way that'd work for a lot of cases, not all. All the matches are put in $matches. Note that this do not cover links in anchor elements (<a href=""...), but that wasn't in your example either.
这是一种简单的方法,适用于很多情况,而不是所有情况。所有匹配项都放在 $matches 中。请注意,这不包括锚元素(<a href=""...)中的链接,但这也不在您的示例中。
回答by Shankar Damodaran
You could do like this..
你可以这样做..
<?php
$string = "this is my friend's website http://example.com I think it is coll";
echo explode(' ',strstr($string,'http://'))[0]; //"prints" http://example.com
回答by Shankar Damodaran
preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\']+".
"(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/",
$var, &$matches);
$matches = $matches[1];
$list = array();
foreach($matches as $var)
{
print($var."<br>");
}
回答by HTML5 developer
You could try this to find the link and revise the link (add the href link).
您可以尝试使用此方法查找链接并修改链接(添加 href 链接)。
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
// The Text you want to filter for urls
$text = "The text you want to filter goes here. http://example.com";
if(preg_match($reg_exUrl, $text, $url)) {
echo preg_replace($reg_exUrl, "<a href="{$url[0]}">{$url[0]}</a> ", $text);
} else {
echo "No url in the text";
}
refer here: http://php.net/manual/en/function.preg-match.php
回答by vstelmakh
There are a lot of edge cases with urls. Like url could contain brackets or not contain protocol etc. Thats why regex is not enough.
有很多带有 url 的边缘情况。像 url 可以包含括号或不包含协议等。这就是正则表达式不够的原因。
I created a PHP library that could deal with lots of edge cases: Url highlight.
我创建了一个可以处理很多边缘情况的 PHP 库:Url highlight。
Example:
例子:
<?php
use VStelmakh\UrlHighlight\UrlHighlight;
$urlHighlight = new UrlHighlight();
$urlHighlight->getUrls("this is my friend's website http://example.com I think it is coll");
// return: ['http://example.com']
For more details see readme. For covered url cases see test.

