php 用于查找 <a> 和 </a> 标签之间所有内容的正则表达式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/343115/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-24 22:26:16  来源:igfitidea点击:

regexp for finding everything between <a> and </a> tags

phpregex

提问by Vikram Haer

I'm trying to find a way to make a list of everything between <a>and </a>tags. So I have a list of links and I want to get the names of the links (not where the links go, but what they're called on the page). Would be really helpful to me.

我正在尝试找到一种方法来列出<a></a>标签之间的所有内容。所以我有一个链接列表,我想获取链接的名称(不是链接的位置,而是它们在页面上的名称)。对我真的很有帮助。

Currently I have this:

目前我有这个:

$lines = preg_split("/\r?\n|\r/", $content);  // content is the given page
foreach ($lines as $val) {
  if (preg_match("/(<A(.*)>)(<\/A>)/", $val, $alink)) {     
    $newurl = $alink[1];

    // put in array of found links
    $links[$index] = $newurl;
    $index++;
    $is_href = true;
  }
}

回答by Tomalak

The standard disclaimer applies: Parsing HTML with regular expressions is not ideal. Success depends on the well-formedness of the input on a character-by-character level. If you cannot guarantee this, the regex will fail to do the Right Thing at some point.

标准免责声明适用:使用正则表达式解析 HTML 并不理想。成功取决于输入在逐个字符级别上的良好格式。如果你不能保证这一点,正则表达式将在某些时候无法做正确的事情。

Having said that:

话说回来:

<a\b[^>]*>(.*?)</a>   // match group one will contain the link text

回答by slim

I'm a big fan of regexes, but this is not the right place to use them.

我是正则表达式的忠实粉丝,但这不是使用它们的正确地方。

Use a real HTML parser.

使用真正的 HTML 解析器。

  • Your code will be clearer
  • It will be more likely to work
  • 你的代码会更清晰
  • 它会更有可能工作

I Googled for a PHP HTML parser, and found this one.

我在谷歌上搜索了一个 PHP HTML 解析器,并找到了这个

If you know you're working with XHTML, then you could use PHP's standard XML parser.

如果您知道您正在使用 XHTML,那么您可以使用 PHP 的标准 XML 解析器。

回答by Xetius

<a\s*(.*)\>(.*)</a>

<a href="http://www.stackoverflow.com">Go to stackoverflow.com</a>

$1 = href="www.stackoverflow.com"

$1 = href="www.stackoverflow.com"

$2 = Go to stackoverflow.com

$2 = 去 stackoverflow.com

I answered a similar question to strip everything except a tags here

我回答了一个类似的问题,以在这里去除标签之外的所有内容

回答by Avram Cosmin

Best and quickest way to create a list of what's between , is by using preg_match_all.

创建 之间内容列表的最佳和最快方法是使用 preg_match_all。

Example:

例子:

$pattern = '#<a[^>]*>([^<]*)<\/a>#';
$subject = '<a href="#">Link 1</a> <a href="#">Link 3</a> <a href="#">Link 3</a>';
preg_match_all($pattern, $subject, $matches);
print_r($matches[1]);

OR

或者

$pattern = '#<a[^>]*>(.*?)<\/a>#';
$subject = '<a href="#">2 > 1</a> <a href="#">1 < 2</a>';
preg_match_all($pattern, $subject, $matches);

The result will be:

结果将是:

Array (
 [0] => Link 1
 [1] => Link 3
 [2] => Link 3
)

回答by Juan José Brown

With the pattern

随着图案

'<a.*?>(.*?)</a>'

You'll get

你会得到

['sign up', 'log in', 'careers 2.0']

Searching in this markup:

在此标记中搜索:

<span id="hlinks-nav"><a href="/users/login?returnurl=%2fquestions%2f343115%2fregexp-for-finding-everything-between-a-and-a-tags">sign up</a><span class="lsep">|</span><a href="/users/login?returnurl=%2fquestions%2f343115%2fregexp-for-finding-everything-between-a-and-a-tags">log in</a><span class="lsep">|</span><a href="http://careers.stackoverflow.com">careers 2.0</a><span class="lsep">|</span></span>

回答by Emma

If there would have been some imaginary or invalid edge cases, an expression with a ["']boundary with iand sflags would have been an option too, such as in:

如果存在一些虚构的或无效的边缘情况,则["']边界为is标志的表达式也将是一个选项,例如:

<a\s.*?['"]\s*>((?:(?!<\/a>).)*)<\/a>

Test

测试

$re = '/<a\s.*?[\'"]\s*>((?:(?!<\/a>).)*)<\/a>/si';
$str = '<a href="https://google.com"
title="some title"
data-key="{\'key\':\'adf0a8dfq<>*1%\' >

some context in here <>

some context in there <>

</a>

<A href="https://google.com"
title="some title"
data-key="{\'key\':\'adf0a8dfq<>*1%\'>

some context in here

some context in there

</A>';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);


If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

如果你想简化/修改/探索表达式,它已经在regex101.com 的右上角面板中进行了解释。如果您愿意,您还可以在此链接中观看它如何与某些示例输入匹配。



RegEx Circuit

正则表达式电路

jex.imvisualizes regular expressions:

jex.im可视化正则表达式:

enter image description here

在此处输入图片说明

回答by mickmackusa

If I am going to complain about all of the regex solutions, I suppose I need to actually demonstrate how to use a proper HTML parser (the OP makes no indication that the HTML to be parsed is in any way invalid -- so a legitimate parser is absolutely appropriate for script stability and quality).

如果我要抱怨所有正则表达式解决方案,我想我需要实际演示如何使用正确的 HTML 解析器(OP 没有表明要解析的 HTML 以任何方式无效——因此是一个合法的解析器绝对适合脚本的稳定性和质量)。

Now, my advice does require that you become familiar with the basics of DOMDocument (and optionally DOMXPath), but you will see that the syntax is far less cryptic than a regex expression once you understand the components involved. For this reason, I will also argue that this technique will improve the overall readability of your script (for you and future readers of your code).

现在,我的建议确实要求您熟悉 DOMDocument(以及可选的 DOMXPath)的基础知识,但是一旦您了解了所涉及的组件,您就会发现语法远没有正则表达式那么神秘。出于这个原因,我还将认为这种技术将提高脚本的整体可读性(对于您和您的代码的未来读者)。

Code: (Demos)

代码:(演示

$html = <<<HTML
<a href="#">hello</a> <abbr href="#">FYI</abbr> <a title="goodbye">later</a>
<a href=https://example.com>no quoted attributes</a>
<A href="https://example.com"
title="some title"
data-key="{\'key\':\'adf0a8dfq<>*1%\'">a link with data attribute</A>
and
this is <a title="hello">not a hyperlink</a> but simply an anchor tag
HTML;

$dom = new DOMDocument; 
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$linkText = [];
foreach ($xpath->evaluate("//a[@href]") as $node) {
    $linkText[] = $node->nodeValue;
}
var_export($linkText);

Output:

输出:

array (
  0 => 'hello',
  1 => 'no quoted attributes',
  2 => 'a link with data attribute',
)    

if you don't care about the hrefattribute existing:

如果您不关心href现有的属性:

Code:

代码:

$doc = new DOMDocument();
$doc->loadHTML($html);
$aTags = [];
foreach ($doc->getElementsByTagName('a') as $a) {
    $aTags[] = $a->nodeValue;
}
var_export($aTags);

Output:

输出:

array (
  0 => 'hello',
  1 => 'later',
  2 => 'no quoted attributes',
  3 => 'a link with data attribute',
  4 => 'not a hyperlink',
)

回答by guerda

Regex, the black magic, again :)

正则表达式,黑魔法,再次:)

I found one nice questionabout common regex. There some interesting links where you will find very common regexpressions like yours.

我发现了一个关于常见正则表达式的好问题。有一些有趣的链接,您可以在其中找到与您类似的非常常见的正则表达式。

Grabbing HTML Tags

< TAG\b[^>]>(.?) Analyze this regular expression with RegexBuddy matches the opening and closing pair of a specific HTML tag. Anything between the tags is captured into the first backreference. The question mark in the regex makes the star lazy, to make sure it stops before the first closing tag rather than before the last, like a greedy star would do. This regex will not properly match tags nested inside themselves, like in onetwoone.

<([A-Z][A-Z0-9])\b[^>]>(.*?) Analyze this regular expression with RegexBuddy will match the opening and closing pair of any HTML tag. Be sure to turn off case sensitivity. The key in this solution is the use of the backreference \1 in the regex. Anything between the tags is captured into the second backreference. This solution will also not match tags nested in themselves.

抓取 HTML 标签

< TAG\b[^>] >(.?) 用 RegexBuddy 分析这个正则表达式匹配特定 HTML 标签的开始和结束对。标签之间的任何内容都被捕获到第一个反向引用中。正则表达式中的问号使星星变得懒惰,以确保它在第一个结束标记之前而不是在最后一个之前停止,就像贪婪的星星那样。这个正则表达式不会正确匹配嵌套在它们内部的标签,就像在 onetwoone 中一样。

<([AZ][A-Z0-9] )\b[^>]>(.*?) 用 RegexBuddy 分析这个正则表达式将匹配任何 HTML 标签的开始和结束对。请务必关闭区分大小写。此解决方案的关键是在正则表达式中使用反向引用 \1。标签之间的任何内容都被捕获到第二个反向引用中。此解决方案也不会匹配嵌套在其自身中的标签。

Otherwise: Browse this link: keyword "link". There are some interesting approaches to filter links.

否则:浏览此链接:关键字 "link"。有一些有趣的方法来过滤链接。

I hope this helps :)

我希望这有帮助 :)

Good luck!

祝你好运!

回答by J?rn Jensen

Well.. Using regular expressions is not perfect, but in perl regexp,

嗯.. 使用正则表达式并不完美,但是在 perl regexp 中,

m!<a .*?>(.*?)</a>!i

should give you the name of the first link on that line in match group one, ignoring case.

应该为您提供第一组比赛中该行的第一个链接的名称,忽略大小写。

Limitations:

限制:

  • Does not handle multiple links on one line
  • Does not handle links going over several lines.
  • Will also match on anchor tags.
  • 不在一行上处理多个链接
  • 不处理跨越多行的链接。
  • 也将匹配锚标签。

You could work around this by joining all lines into one line and then split it into an array (or multiple lines) using the link start as separator.

您可以通过将所有行合并为一行,然后使用链接开始作为分隔符将其拆分为一个数组(或多行)来解决此问题。