php 用 HTML 链接替换文本中的 URL

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1188129/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 01:25:49  来源:igfitidea点击:

Replace URLs in text with HTML links

phpregexurlpreg-replacelinkify

提问by Angel.King.47

Here is a design though: For example is I put a link such as

这是一个设计:例如,我放了一个链接,例如

http://example.com

http://example.com

in textarea. How do I get PHP to detect it's a http://link and then print it as

文本区域。如何让 PHP 检测到它是一个http://链接,然后将其打印为

print "<a href='http://www.example.com'>http://www.example.com</a>";

I remember doing something like this before however, it was not fool proof it kept breaking for complex links.

我记得以前做过这样的事情,但这并不是万无一失的,它不断为复杂的链接中断。

Another good idea would be if you have a link such as

另一个好主意是,如果您有一个链接,例如

http://example.com/test.php?val1=bla&val2blablabla%20bla%20bla.bl

http://example.com/test.php?val1=bla&val2blablabla%20bla%20bla.bl

fix it so it does

修复它,它确实如此

print "<a href='http://example.com/test.php?val1=bla&val2=bla%20bla%20bla.bla'>";
print "http://example.com/test.php";
print "</a>";

This one is just an after thought.. stackoverflow could also probably use this as well :D

这只是事后的想法.. stackoverflow 也可能会使用它:D

Any Ideas

有任何想法吗

回答by S?ren L?vborg

Let's look at the requirements. You have some user-supplied plain text, which you want to display with hyperlinked URLs.

我们来看看要求。你有一些用户提供的纯文本,你想用超链接的 URL 来显示。

  1. The "http://" protocol prefix should be optional.
  2. Both domains and IP addresses should be accepted.
  3. Any valid top-level domain should be accepted, e.g. .aero and .xn--jxalpdlp.
  4. Port numbers should be allowed.
  5. URLs must be allowed in normal sentence contexts. For instance, in "Visit stackoverflow.com.", the final period is not part of the URL.
  6. You probably want to allow "https://" URLs as well, and perhaps others as well.
  7. As always when displaying user supplied text in HTML, you want to prevent cross-site scripting(XSS). Also, you'll want ampersands in URLs to be correctly escapedas &amp;.
  8. You probably don't need support for IPv6 addresses.
  9. Edit: As noted in the comments, support for email-adresses is definitely a plus.
  10. Edit: Only plain text input is to be supported – HTML tags in the input should not be honoured. (The Bitbucket version supports HTML input.)
  1. “http://”协议前缀应该是可选的。
  2. 域和 IP 地址都应该被接受。
  3. 应接受任何有效的顶级域,例如 .aero 和 .xn--jxalpdlp。
  4. 应该允许端口号。
  5. 在正常的句子上下文中必须允许 URL。例如,在“访问 stackoverflow.com.”中,最后一个句点不是 URL 的一部分。
  6. 您可能还希望允许“https://” URL,也可能希望允许其他 URL。
  7. 在以 HTML 显示用户提供的文本时,您希望防止跨站点脚本(XSS)。此外,您还需要将 URL 中的 & 符号正确转义为 &。
  8. 您可能不需要对 IPv6 地址的支持。
  9. 编辑:如评论中所述,对电子邮件地址的支持绝对是一个加分项。
  10. 编辑:只支持纯文本输入——输入中的 HTML 标签不应该被尊重。(Bitbucket 版本支持 HTML 输入。)

Edit: Check out GitHubfor the latest version, with support for email addresses, authenticated URLs, URLs in quotes and parentheses, HTML input, as well as an updated TLD list.

编辑:查看GitHub以获取最新版本,支持电子邮件地址、经过身份验证的 URL、引号和括号中的 URL、HTML 输入以及更新的 TLD 列表。

Here's my take:

这是我的看法:

<?php
$text = <<<EOD
Here are some URLs:
stackoverflow.com/questions/1188129/pregreplace-to-detect-html-php
Here's the answer: http://www.google.com/search?rls=en&q=42&ie=utf-8&oe=utf-8&hl=en. What was the question?
A quick look at http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax is helpful.
There is no place like 127.0.0.1! Except maybe http://news.bbc.co.uk/1/hi/england/surrey/8168892.stm?
Ports: 192.168.0.1:8080, https://example.net:1234/.
Beware of Greeks bringing internationalized top-level domains: xn--hxajbheg2az3al.xn--jxalpdlp.
And remember.Nobody is perfect.

<script>alert('Remember kids: Say no to XSS-attacks! Always HTML escape untrusted input!');</script>
EOD;

$rexProtocol = '(https?://)?';
$rexDomain   = '((?:[-a-zA-Z0-9]{1,63}\.)+[-a-zA-Z0-9]{2,63}|(?:[0-9]{1,3}\.){3}[0-9]{1,3})';
$rexPort     = '(:[0-9]{1,5})?';
$rexPath     = '(/[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]*?)?';
$rexQuery    = '(\?[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';
$rexFragment = '(#[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';

// Solution 1:

function callback($match)
{
    // Prepend http:// if no protocol specified
    $completeUrl = $match[1] ? $match[0] : "http://{$match[0]}";

    return '<a href="' . $completeUrl . '">'
        . $match[2] . $match[3] . $match[4] . '</a>';
}

print "<pre>";
print preg_replace_callback("&\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:\"]?(\s|$))&",
    'callback', htmlspecialchars($text));
print "</pre>";
  • To properly escape < and & characters, I throw the whole text through htmlspecialchars before processing. This is not ideal, as the html escaping can cause misdetection of URL boundaries.
  • As demonstrated by the "And remember.Nobody is perfect." line (in which remember.Nobody is treated as an URL, because of the missing space), further checking on valid top-level domains might be in order.
  • 为了正确转义 < 和 & 字符,我在处理之前通过 htmlspecialchars 抛出整个文本。这并不理想,因为 html 转义会导致错误检测 URL 边界。
  • 正如“记住。没有人是完美的”所证明的那样。行(其中 remember.Nobody 被视为 URL,因为缺少空格),可能需要进一步检查有效的顶级域。

Edit: The following code fixes the above two problems, but is quite a bit more verbose since I'm more or less re-implementing preg_replace_callbackusing preg_match.

编辑:以下代码修复了上述两个问题,但由于我或多或少地preg_replace_callback使用preg_match.

// Solution 2:

$validTlds = array_fill_keys(explode(" ", ".aero .asia .biz .cat .com .coop .edu .gov .info .int .jobs .mil .mobi .museum .name .net .org .pro .tel .travel .ac .ad .ae .af .ag .ai .al .am .an .ao .aq .ar .as .at .au .aw .ax .az .ba .bb .bd .be .bf .bg .bh .bi .bj .bm .bn .bo .br .bs .bt .bv .bw .by .bz .ca .cc .cd .cf .cg .ch .ci .ck .cl .cm .cn .co .cr .cu .cv .cx .cy .cz .de .dj .dk .dm .do .dz .ec .ee .eg .er .es .et .eu .fi .fj .fk .fm .fo .fr .ga .gb .gd .ge .gf .gg .gh .gi .gl .gm .gn .gp .gq .gr .gs .gt .gu .gw .gy .hk .hm .hn .hr .ht .hu .id .ie .il .im .in .io .iq .ir .is .it .je .jm .jo .jp .ke .kg .kh .ki .km .kn .kp .kr .kw .ky .kz .la .lb .lc .li .lk .lr .ls .lt .lu .lv .ly .ma .mc .md .me .mg .mh .mk .ml .mm .mn .mo .mp .mq .mr .ms .mt .mu .mv .mw .mx .my .mz .na .nc .ne .nf .ng .ni .nl .no .np .nr .nu .nz .om .pa .pe .pf .pg .ph .pk .pl .pm .pn .pr .ps .pt .pw .py .qa .re .ro .rs .ru .rw .sa .sb .sc .sd .se .sg .sh .si .sj .sk .sl .sm .sn .so .sr .st .su .sv .sy .sz .tc .td .tf .tg .th .tj .tk .tl .tm .tn .to .tp .tr .tt .tv .tw .tz .ua .ug .uk .us .uy .uz .va .vc .ve .vg .vi .vn .vu .wf .ws .ye .yt .yu .za .zm .zw .xn--0zwm56d .xn--11b5bs3a9aj6g .xn--80akhbyknj4f .xn--9t4b11yi5a .xn--deba0ad .xn--g6w251d .xn--hgbk6aj7f53bba .xn--hlcj6aya9esc7a .xn--jxalpdlp .xn--kgbechtv .xn--zckzah .arpa"), true);

$position = 0;
while (preg_match("{\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:\"]?(\s|$))}", $text, &$match, PREG_OFFSET_CAPTURE, $position))
{
    list($url, $urlPosition) = $match[0];

    // Print the text leading up to the URL.
    print(htmlspecialchars(substr($text, $position, $urlPosition - $position)));

    $domain = $match[2][0];
    $port   = $match[3][0];
    $path   = $match[4][0];

    // Check if the TLD is valid - or that $domain is an IP address.
    $tld = strtolower(strrchr($domain, '.'));
    if (preg_match('{\.[0-9]{1,3}}', $tld) || isset($validTlds[$tld]))
    {
        // Prepend http:// if no protocol specified
        $completeUrl = $match[1][0] ? $url : "http://$url";

        // Print the hyperlink.
        printf('<a href="%s">%s</a>', htmlspecialchars($completeUrl), htmlspecialchars("$domain$port$path"));
    }
    else
    {
        // Not a valid URL.
        print(htmlspecialchars($url));
    }

    // Continue text parsing from after the URL.
    $position = $urlPosition + strlen($url);
}

// Print the remainder of the text.
print(htmlspecialchars(substr($text, $position)));

回答by Angel.King.47

Here is something i found that is tried and tested

这是我发现的一些经过尝试和测试的东西

function make_links_blank($text)
{
  return  preg_replace(
     array(
       '/(?(?=<a[^>]*>.+<\/a>)
             (?:<a[^>]*>.+<\/a>)
             |
             ([^="\']?)((?:https?|ftp|bf2|):\/\/[^<> \n\r]+)
         )/iex',
       '/<a([^>]*)target="?[^"\']+"?/i',
       '/<a([^>]+)>/i',
       '/(^|\s)(www.[^<> \n\r]+)/iex',
       '/(([_A-Za-z0-9-]+)(\.[_A-Za-z0-9-]+)*@([A-Za-z0-9-]+)
       (\.[A-Za-z0-9-]+)*)/iex'
       ),
     array(
       "stripslashes((strlen('\2')>0?'\1<a href=\"\2\">\2</a>\3':'\0'))",
       '<a\1',
       '<a\1 target="_blank">',
       "stripslashes((strlen('\2')>0?'\1<a href=\"http://\2\">\2</a>\3':'\0'))",
       "stripslashes((strlen('\2')>0?'<a href=\"mailto:\0\">\0</a>':'\0'))"
       ),
       $text
   );
}

It works for me. And it works for emails and URL's, Sorry to answer my own question. :(

这个对我有用。它适用于电子邮件和 URL,很抱歉回答我自己的问题。:(

But this one is the only that works

但这是唯一有效的

Here is the link where i found it : http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_21878567.html

这是我找到它的链接:http: //www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_21878567.html

Sry in advance for it being a experts-exchange.

提前预习,因为这是一个专家交流。

回答by Raheel Hasan

You guyz are talking way to advance and complex stuff which is good for some situation, but mostly we need a simple careless solution. How about simply this?

你们正在谈论先进和复杂的东西,这对某些情况有好处,但大多数情况下我们需要一个简单的粗心解决方案。简单的这个怎么样?

preg_replace('/(http[s]{0,1}\:\/\/\S{4,})\s{0,}/ims', '<a href="" target="_blank"></a> ', $text_msg);

Just try it and let me know what crazy url it doesnt satisfy.

试试吧,让我知道它不满足什么疯狂的网址。

回答by Dharmendra Jadon

Here is the code using Regular Expressions in function

这是在函数中使用正则表达式的代码

<?php
//Function definations
function MakeUrls($str)
{
$find=array('`((?:https?|ftp)://\S+[[:alnum:]]/?)`si','`((?<!//)(www\.\S+[[:alnum:]]/?))`si');

$replace=array('<a href="" target="_blank"></a>', '<a href="http://" target="_blank"></a>');

return preg_replace($find,$replace,$str);
}
//Function testing
$str="www.cloudlibz.com";
$str=MakeUrls($str);
echo $str;
?>

回答by Armand

I've been using this function, it works for me

我一直在使用这个功能,它对我有用

function AutoLinkUrls($str,$popup = FALSE){
    if (preg_match_all("#(^|\s|\()((http(s?)://)|(www\.))(\w+[^\s\)\<]+)#i", $str, $matches)){
        $pop = ($popup == TRUE) ? " target=\"_blank\" " : "";
        for ($i = 0; $i < count($matches['0']); $i++){
            $period = '';
            if (preg_match("|\.$|", $matches['6'][$i])){
                $period = '.';
                $matches['6'][$i] = substr($matches['6'][$i], 0, -1);
            }
            $str = str_replace($matches['0'][$i],
                    $matches['1'][$i].'<a href="http'.
                    $matches['4'][$i].'://'.
                    $matches['5'][$i].
                    $matches['6'][$i].'"'.$pop.'>http'.
                    $matches['4'][$i].'://'.
                    $matches['5'][$i].
                    $matches['6'][$i].'</a>'.
                    $period, $str);
        }//end for
    }//end if
    return $str;
}//end AutoLinkUrls

All credits goes to - http://snipplr.com/view/68586/

所有学分都归于 - http://snipplr.com/view/68586/

Enjoy!

享受!

回答by Svetoslav Marinov

As I mentioned in one of the comments above my VPS, which is running php 7, started emitting warnings Warning: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead. The buffer after the replacement was empty/false.

正如我在运行 php 7 的 VPS 上面的评论之一中提到的,它开始发出警告警告:preg_replace():不再支持 /e 修饰符,请改用 preg_replace_callback。替换后的缓冲区为空/假。

I have rewritten the code and made some improvements. If you think that you should be in the author section feel free to edit the comment above the function make_links_blank name. I am intentionally not using the closing php ?> to avoid inserting whitespace in the output.

我已经重写了代码并做了一些改进。如果您认为您应该在作者部分,请随时编辑函数 make_links_blank 名称上方的注释。我故意不使用结束的 php ?> 以避免在输出中插入空格。

<?php

class App_Updater_String_Util {
    public static function get_default_link_attribs( $regex_matches = [] ) {
        $t = ' target="_blank" ';
        return $t;
    }

    /**
     * App_Updater_String_Util::set_protocol();
     * @param string $link
     * @return string
     */
    public static function set_protocol( $link ) {
        if ( ! preg_match( '#^https?#si', $link ) ) {
            $link = 'http://' . $link;
        }
        return $link;
    }

/**
     * Goes through text and makes whatever text that look like a link an html link
     * which opens in a new tab/window (by adding target attribute).
     * 
     * Usage: App_Updater_String_Util::make_links_blank( $text );
     * 
     * @param str $text
     * @return str
     * @see http://stackoverflow.com/questions/1188129/replace-urls-in-text-with-html-links
     * @author Angel.King.47 | http://dashee.co.uk
     * @author Svetoslav Marinov (Slavi) | http://orbisius.com
     */
    public static function make_links_blank( $text ) {
        $patterns = [
            '#(?(?=<a[^>]*>.+?<\/a>)
                 (?:<a[^>]*>.+<\/a>)
                 |
                 ([^="\']?)((?:https?|ftp):\/\/[^<> \n\r]+)
             )#six' => function ( $matches ) {
                $r1 = empty( $matches[1] ) ? '' : $matches[1];
                $r2 = empty( $matches[2] ) ? '' : $matches[2];
                $r3 = empty( $matches[3] ) ? '' : $matches[3];

                $r2 = empty( $r2 ) ? '' : App_Updater_String_Util::set_protocol( $r2 );
                $res = ! empty( $r2 ) ? "$r1<a href=\"$r2\">$r2</a>$r3" : $matches[0];
                $res = stripslashes( $res );

                return $res;
             },

            '#(^|\s)((?:https?://|www\.|https?://www\.)[^<>\ \n\r]+)#six' => function ( $matches ) {
                $r1 = empty( $matches[1] ) ? '' : $matches[1];
                $r2 = empty( $matches[2] ) ? '' : $matches[2];
                $r3 = empty( $matches[3] ) ? '' : $matches[3];

                $r2 = ! empty( $r2 ) ? App_Updater_String_Util::set_protocol( $r2 ) : '';
                $res = ! empty( $r2 ) ? "$r1<a href=\"$r2\">$r2</a>$r3" : $matches[0];
                $res = stripslashes( $res );

                return $res;
            },

            // Remove any target attribs (if any)
            '#<a([^>]*)target="?[^"\']+"?#si' => '<a\1',

            // Put the target attrib
            '#<a([^>]+)>#si' => '<a\1 target="_blank">',

            // Make emails clickable Mailto links
            '/(([\w\-]+)(\.[\w\-]+)*@([\w\-]+)
                (\.[\w\-]+)*)/six' => function ( $matches ) {

                $r = $matches[0];
                $res = ! empty( $r ) ? "<a href=\"mailto:$r\">$r</a>" : $r;
                $res = stripslashes( $res );

                return $res;
            },
        ];

        foreach ( $patterns as $regex => $callback_or_replace ) {
            if ( is_callable( $callback_or_replace ) ) {
                $text = preg_replace_callback( $regex, $callback_or_replace, $text );
            } else {
                $text = preg_replace( $regex, $callback_or_replace, $text );
            }
        }

        return $text;
    }
}

回答by Stephen Fuhry

this should get you email addresses:

这应该为您提供电子邮件地址:

$string = "bah bah [email protected] foo";
$match = preg_match('/[^\x00-\x20()<>@,;:\".[\]\x7f-\xff]+(?:\.[^\x00-\x20()<>@,;:\".[\]\x7f-\xff]+)*\@[^\x00-\x20()<>@,;:\".[\]\x7f-\xff]+(?:\.[^\x00-\x20()<>@,;:\".[\]\x7f-\xff]+)+/', $string, $array);
print_r($array);

// outputs:
Array
(
    [0] => [email protected]
)

回答by fresskoma

This RegEx should match any link except for these new 3+ character toplevel domains...

除了这些新的 3 个以上字符的顶级域之外,此 RegEx 应匹配任何链接...

{
  \b
  # Match the leading part (proto://hostname, or just hostname)
  (
    # http://, or https:// leading part
    (https?)://[-\w]+(\.\w[-\w]*)+
  |
    # or, try to find a hostname with more specific sub-expression
    (?i: [a-z0-9] (?:[-a-z0-9]*[a-z0-9])? \. )+ # sub domains
    # Now ending .com, etc. For these, require lowercase
    (?-i: com\b
        | edu\b
        | biz\b
        | gov\b
        | in(?:t|fo)\b # .int or .info
        | mil\b
        | net\b
        | org\b
        | [a-z][a-z]\.[a-z][a-z]\b # two-letter country code
    )
  )

  # Allow an optional port number
  ( : \d+ )?

  # The rest of the URL is optional, and begins with /
  (
    /
    # The rest are heuristics for what seems to work well
    [^.!,?;"\'()\[\]\{\}\s\x7F-\xFF]*
    (
      [.!,?]+ [^.!,?;"\'()\[\]\{\}\s\x7F-\xFF]+
    )*
  )?
}ix

It's not written by me, I'm not quite sure where I got it from, sorry that I can give no credit...

它不是我写的,我不太确定我是从哪里得到的,抱歉我不能信任......

回答by lepe

I know this answer has been accepted and that this question is quite old, but it can be useful for other people looking for other implementations.

我知道这个答案已被接受,并且这个问题已经很老了,但它对于寻找其他实现的其他人可能很有用。

This is a modified version of the code posted by: Angel.King.47 on July 27,09:

这是09年7月27日Angel.King.47贴出的代码修改版:

$text = preg_replace(
 array(
   '/(^|\s|>)(www.[^<> \n\r]+)/iex',
   '/(^|\s|>)([_A-Za-z0-9-]+(\.[A-Za-z]{2,3})?\.[A-Za-z]{2,4}\/[^<> \n\r]+)/iex',
   '/(?(?=<a[^>]*>.+<\/a>)(?:<a[^>]*>.+<\/a>)|([^="\']?)((?:https?):\/\/([^<> \n\r]+)))/iex'
 ),  
 array(
   "stripslashes((strlen('\2')>0?'\1<a href=\"http://\2\" target=\"_blank\">\2</a>&nbsp;\3':'\0'))",
   "stripslashes((strlen('\2')>0?'\1<a href=\"http://\2\" target=\"_blank\">\2</a>&nbsp;\4':'\0'))",
   "stripslashes((strlen('\2')>0?'\1<a href=\"\2\" target=\"_blank\">\3</a>&nbsp;':'\0'))",
 ),  
 $text
);

Changes:

变化:

  • I removed rules #2 and #3 (I'm not sure in which situations are useful).
  • Removed email parsing as I really don't need it.
  • I added one more rule which allows the recognition of URLs in the form: [domain]/* (without www). For example: "example.com/faq/" (Multiple tld: domain.{2-3}.{2-4}/)
  • When parsing strings starting with "http://", it removes it from the link label.
  • Added "target='_blank'" to all links.
  • Urls can be specified just after any(?) tag. For example: <b>www.example.com</b>
  • 我删除了规则 #2 和 #3(我不确定在哪些情况下有用)。
  • 删除了电子邮件解析,因为我真的不需要它。
  • 我又添加了一条规则,允许以以下形式识别 URL:[domain]/*(不带 www)。例如:“example.com/faq/”(多个 tld:domain.{2-3}.{2-4}/)
  • 解析以“http://”开头的字符串时,会将其从链接标签中删除。
  • 向所有链接添加了“target='_blank'”。
  • 可以在 any(?) 标签之后指定网址。例如:<b>www.example.com</b>

As "S?ren L?vborg" has stated, this function does not escape the URLs. I tried his/her class but it just didn't work as I expected (If you don't trust your users, then try his/her code first).

正如“S?ren L?vborg”所说,这个函数不会对 URL 进行转义。我尝试了他/她的课程,但它没有按我预期的那样工作(如果您不信任您的用户,请先尝试他/她的代码)。

回答by amarjit singh

This classchanges the urls into text and while keeping the home url as it is. I hope this will help and save time for you.Enjoy.

class会将 url 更改为文本,同时保持主页 url 原样。我希望这会对您有所帮助并节省时间。享受。

class RegClass 
{ 

     function preg_callback_url($matches) 
     { 
        //var_dump($matches); 
        //Get the matched URL  text <a>text</a>
        $text = $matches[2];
        //Get the matched URL link <a href ="http://www.test.com">text</a>
        $url = $matches[1];

        if($url=='href ="http://www.test.com"'){
         //replace all a tag as it is
         return '<a href='.$url.' rel="nofollow"> '.$text.' </a>'; 

         }else{
         //replace all a tag to text
         return " $text " ;
         }
} 
function ParseText($text){ 

    $text = preg_replace( "/www\./", "http://www.", $text );
        $regex ="/http:\/\/http:\/\/www\./"
    $text = preg_replace( $regex, "http://www.", $text );
        $regex2 = "/https:\/\/http:\/\/www\./";
    $text = preg_replace( $regex2, "https://www.", $text );

        return preg_replace_callback('/<a\s(.+?)>(.+?)<\/a>/is',
                array( &$this,        'preg_callback_url'), $text); 
      } 

} 
$regexp = new RegClass();
echo $regexp->ParseText($text);