Html 用于提取标签属性的正则表达式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/317053/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 22:47:35  来源:igfitidea点击:

Regular expression for extracting tag attributes

htmlregex

提问by splattne

I'm trying to extract the attributes of a anchor tag (<a>). So far I have this expression:

我正在尝试提取锚标记 ( <a>)的属性。到目前为止,我有这个表达:

(?<name>\b\w+\b)\s*=\s*("(?<value>[^"]*)"|'(?<value>[^']*)'|(?<value>[^"'<> \s]+)\s*)+

which works for strings like

这适用于像这样的字符串

<a href="test.html" class="xyz">

and (single quotes)

和(单引号)

<a href='test.html' class="xyz">

but not for a string without quotes:

但不适用于没有引号的字符串:

<a href=test.html class=xyz>

How can I modify my regex making it work with attributes without quotes? Or is there a better way to do that?

如何修改我的正则表达式使其与没有引号的属性一起使用?或者有没有更好的方法来做到这一点?

Update:Thanks for all the good comments and advice so far. There is one thing I didn't mention: I sadly have to patch/modify code not written by me. And there is no time/money to rewrite this stuff from the bottom up.

更新:感谢您到目前为止的所有好评和建议。有一件事我没有提到:遗憾的是,我不得不修补/修改不是我写的代码。而且没有时间/金钱从下往上重写这些东西。

回答by VonC

If you have an element like

如果你有一个像

<name attribute=value attribute="value" attribute='value'>

this regex could be used to find successively each attribute name and value

此正则表达式可用于连续查找每个属性名称和值

(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?

Applied on:

应用于:

<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href='test.html' class="xyz">

it would yield:

它会产生:

'href' => 'test.html'
'class' => 'xyz'

Note:This does not work with numeric attribute values e.g. <div id="1">won't work.

注意:这不适用于数字属性值,例如<div id="1">不起作用。

回答by Axeman

Although the advice not to parse HTML via regexp is valid, here's a expression that does pretty much what you asked:

尽管不通过正则表达式解析 HTML 的建议是有效的,但这里的表达式几乎可以满足您的要求:

/
   \G                     # start where the last match left off
   (?>                    # begin non-backtracking expression
       .*?                # *anything* until...
       <[Aa]\b            # an anchor tag
    )??                   # but look ahead to see that the rest of the expression
                          #    does not match.
    \s+                   # at least one space
    ( \p{Alpha}           # Our first capture, starting with one alpha
      \p{Alnum}*          # followed by any number of alphanumeric characters
    )                     # end capture #1
    (?: \s* = \s*         # a group starting with a '=', possibly surrounded by spaces.
        (?: (['"])        # capture a single quote character
            (.*?)         # anything else
                        # which ever quote character we captured before
        |   ( [^>\s'"]+ ) # any number of non-( '>', space, quote ) chars
        )                 # end group
     )?                   # attribute value was optional
/msx;

"But wait," you might say. "What about *comments?!?!" Okay, then you can replace the .in the non-backtracking section with: (It also handles CDATA sections.)

“但是等等,”你可能会说。“关于*评论呢?!?!” 好的,那么您可以将.非回溯部分中的替换为:(它还处理 CDATA 部分。)

(?:[^<]|<[^!]|<![^-\[]|<!\[(?!CDATA)|<!\[CDATA\[.*?\]\]>|<!--(?:[^-]|-[^-])*-->)
  • Also if you wanted to run a substitution under Perl 5.10 (and I think PCRE), you can put \Kright before the attribute name and not have to worry about capturing all the stuff you want to skip over.
  • 此外,如果您想在 Perl 5.10(我认为是 PCRE)下运行替换,您可以将其放在\K属性名称之前,而不必担心捕获您想要跳过的所有内容。

回答by Kent Fredric

Token Mantra response: you should not tweak/modify/harvest/or otherwise produce html/xml using regular expression.

Token Mantra 响应:您不应使用正则表达式调整/修改/收获/或以其他方式生成 html/xml。

there are too may corner case conditionals such as \' and \" which must be accounted for. You are much better off using a proper DOM Parser, XML Parser, or one of the many other dozens of tried and tested tools for this job instead of inventing your own.

也可能有一些特殊的条件,例如 \' 和 \",它们必须被考虑在内。最好使用适当的 DOM 解析器、XML 解析器或其他数十种久经考验的工具之一来代替这项工作发明你自己的。

I don't really care which one you use, as long as its recognized, tested, and you use one.

我真的不在乎你使用哪一种,只要它被认可、测试过,并且你使用一种。

my $foo  = Someclass->parse( $xmlstring ); 
my @links = $foo->getChildrenByTagName("a"); 
my @srcs = map { $_->getAttribute("src") } @links; 
# @srcs now contains an array of src attributes extracted from the page. 

回答by Gumbo

You cannot use the same name for multiple captures. Thus you cannot use a quantifier on expressions with named captures.

您不能对多个捕获使用相同的名称。因此,您不能在具有命名捕获的表达式上使用量词。

So either don't use named captures:

所以要么不使用命名捕获:

(?:(\b\w+\b)\s*=\s*("[^"]*"|'[^']*'|[^"'<>\s]+)\s+)+

Or don't use the quantifier on this expression:

或者不要在这个表达式上使用量词:

(?<name>\b\w+\b)\s*=\s*(?<value>"[^"]*"|'[^']*'|[^"'<>\s]+)

This does also allow attribute values like bar=' baz='quux:

这也允许属性值,如bar=' baz='quux

foo="bar=' baz='quux"

Well the drawback will be that you have to strip the leading and trailing quotes afterwards.

那么缺点是你必须在之后去除前导和尾随引号。

回答by bobince

Just to agree with everyone else: don't parse HTML using regexp.

只是同意其他人:不要使用正则表达式解析 HTML。

It isn't possible to create an expression that will pick out attributes for even a correct piece of HTML, never mind all the possible malformed variants. Your regexp is already pretty much unreadable even without trying to cope with the invalid lack of quotes; chase further into the horror of real-world HTML and you will drive yourself crazy with an unmaintainable blob of unreliable expressions.

不可能创建一个表达式来为即使是正确的 HTML 片段挑选属性,更不用说所有可能的格式错误的变体。即使不尝试处理无效的引号缺失,您的正则表达式已经几乎无法阅读;深入了解现实世界 HTML 的恐怖,你会被一堆不可维护的不可靠表达式逼疯。

There are existing libraries to either read broken HTML, or correct it into valid XHTML which you can then easily devour with an XML parser. Use them.

有一些现有的库可以读取损坏的 HTML,或者将其更正为有效的 XHTML,然后您可以使用 XML 解析器轻松地处理这些内容。使用它们。

回答by Ivan Chaer

PHP (PCRE) and Python

PHP (PCRE) 和 Python

Simple attribute extraction (See it working):

简单的属性提取(见它工作):

((?:(?!\s|=).)*)\s*?=\s*?["']?((?:(?<=")(?:(?<=\)"|[^"])*|(?<=')(?:(?<=\)'|[^'])*)|(?:(?!"|')(?:(?!\/>|>|\s).)+))

Or with tag opening / closure verification, tag name retrieval and comment escaping. This expression foresees unquoted / quoted, single / double quotes, escaped quotes inside attributes, spaces around equals signs, different number of attributes, check only for attributes inside tags, and manage different quotes within an attribute value. (See it working):

或者使用标签打开/关闭验证、标签名称检索和评论转义。此表达式预见未引用/引用、单/双引号、属性内的转义引号、等号周围的空格、不同数量的属性、仅检查标签内的属性以及管理属性值内的不同引号。(看到它工作):

(?:\<\!\-\-(?:(?!\-\-\>)\r\n?|\n|.)*?-\-\>)|(?:<(\S+)\s+(?=.*>)|(?<=[=\s])\G)(?:((?:(?!\s|=).)*)\s*?=\s*?[\"']?((?:(?<=\")(?:(?<=\)\"|[^\"])*|(?<=')(?:(?<=\)'|[^'])*)|(?:(?!\"|')(?:(?!\/>|>|\s).)+))[\"']?\s*)

(Works better with the "gisx" flags.)

(使用“gisx”标志效果更好。)



Javascript

Javascript

As Javascriptregular expressions don't support look-behinds, it won't support most features of the previous expressions I propose. But in case it might fit someone's needs, you could try this version. (See it working).

由于Javascript正则表达式不支持后视,因此它不支持我提出的先前表达式的大多数功能。但如果它可能适合某人的需要,您可以尝试这个版本。(看到它工作)。

(\S+)=[\'"]?((?:(?!\/>|>|"|\'|\s).)+)

回答by Israel Alberto RV

This is my best RegEx to extract properties in HTML Tag:

这是我在 HTML 标签中提取属性的最佳正则表达式:

# Trim the match inside of the quotes (single or double)

# 修剪引号内的匹配(单引号或双引号)

(\S+)\s*=\s*([']|["])\s*([\W\w]*?)\s*

# Without trim

# 没有修剪

(\S+)\s*=\s*([']|["])([\W\w]*?)

Pros:

优点:

  • You are able to trim the content inside of quotes.
  • Match all the special ASCII characters inside of the quotes.
  • If you have title="You're mine" the RegEx does not broken
  • 您可以修剪引号内的内容。
  • 匹配引号内的所有特殊 ASCII 字符。
  • 如果你有 title="你是我的",则 RegEx 不会损坏

Cons:

缺点:

  • It returns 3 groups; first the property then the quote ("|') and at the end the property inside of the quotes i.e.: <div title="You're">the result is Group 1: title, Group 2: ", Group 3: You're.
  • 它返回3组;首先是属性,然后是引号(“|”),最后是引号内的属性,即:<div title="You're">结果是第 1 组:标题,第 2 组:“,第 3 组:你是。

This is the online RegEx example: https://regex101.com/r/aVz4uG/13

这是在线 RegEx 示例:https: //regex101.com/r/aVz4uG/13





I normally use this RegEx to extract the HTML Tags:

我通常使用这个 RegEx 来提取 HTML 标签:

I recommend this if you don't use a tag type like <div, <span, etc.

如果您不使用诸如、 等标签类型<div,我建议您使用此方法<span

<[^/]+?(?:\".*?\"|'.*?'|.*?)*?>

For example:

例如:

<div title="a>b=c<d" data-type='a>b=c<d'>Hello</div>
<span style="color: >=<red">Nothing</span>
# Returns 
# <div title="a>b=c<d" data-type='a>b=c<d'>
# <span style="color: >=<red">

This is the online RegEx example: https://regex101.com/r/aVz4uG/15

这是在线 RegEx 示例:https: //regex101.com/r/aVz4uG/15

The bug in this RegEx is:

这个正则表达式中的错误是:

<div[^/]+?(?:\".*?\"|'.*?'|.*?)*?>

In this tag:

在这个标签中:

<article title="a>b=c<d" data-type='a>b=c<div '>Hello</article>

Returns <div '>but it should not return any match:

返回<div '>但不应返回任何匹配项:

Match:  <div '>

To "solve" this remove the [^/]+?pattern:

要“解决”这个删除[^/]+?模式:

<div(?:\".*?\"|'.*?'|.*?)*?>




The answer #317081is good but it not match properly with these cases:

答案 # 317081很好,但与这些情况不匹配:

<div id="a"> # It returns "a instead of a
<div style=""> # It doesn't match instead of return only an empty property
<div title = "c"> # It not recognize the space between the equal (=)

This is the improvement:

这是改进:

(\S+)\s*=\s*["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))?[^"']*)["']?

vs

对比

(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?

Avoid the spaces between equal signal: (\S+)\s*=\s*((?:...

避免相等信号之间的空格: (\S+) \s*= \s*((?:...

Change the last + and . for: |[>"']))?[^"']*)["']?

更改最后一个 + 和 。对于:|[>"'])) ?[^"']*)["']?

This is the online RegEx example: https://regex101.com/r/aVz4uG/8

这是在线 RegEx 示例:https: //regex101.com/r/aVz4uG/8

回答by fedmich

splattne,

斯普拉特尼,

@VonC solution partly works but there is some issue if the tag had a mixed of unquoted and quoted

@VonC 解决方案部分有效,但如果标签混合了未引用和引用,则会出现一些问题

This one works with mixed attributes

这个适用于混合属性

$pat_attributes = "(\S+)=(\"|'| |)(.*)(\"|'| |>)"

to test it out

测试一下

<?php
$pat_attributes = "(\S+)=(\"|'| |)(.*)(\"|'| |>)"

$code = '    <IMG title=09.jpg alt=09.jpg src="http://example.com.jpg?v=185579" border=0 mce_src="example.com.jpg?v=185579"
    ';

preg_match_all( "@$pat_attributes@isU", $code, $ms);
var_dump( $ms );

$code = '
<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href=\'test.html\' class="xyz">
<img src="http://"/>      ';

preg_match_all( "@$pat_attributes@isU", $code, $ms);

var_dump( $ms );

$ms would then contain keys and values on the 2nd and 3rd element.

$ms 然后将包含第二个和第三个元素上的键和值。

$keys = $ms[1];
$values = $ms[2];

回答by user273314

something like this might be helpful

这样的事情可能会有所帮助

'(\S+)\s*?=\s*([\'"])(.*?|)

回答by Dietrich Baumgarten

Tags and attributes in HTML have the form

HTML 中的标签和属性具有以下形式

<tag 
   attrnovalue 
   attrnoquote=bli 
   attrdoublequote="blah 'blah'"
   attrsinglequote='bloob "bloob"' >

To match attributes, you need a regex attrthat finds one of the four forms. Then you need to make sure that only matches are reported within HTML tags. Assuming you have the correct regex, the total regex would be:

要匹配属性,您需要一个attr可找到四种形式之一的正则表达式。然后您需要确保在 HTML 标签中只报告匹配项。假设你有正确的正则表达式,总的正则表达式将是:

attr(?=(attr)*\s*/?\s*>)

The lookahead ensures that only other attributes and the closing tag follow the attribute. I use the following regular expression for attr:

前瞻确保只有其他属性和结束标记跟随该属性。我使用以下正则表达式attr

\s+(\w+)(?:\s*=\s*(?:"([^"]*)"|'([^']*)'|([^><"'\s]+)))?

Unimportant groups are made non capturing. The first matching group $1gives you the name of the attribute, the value is one of $2or $3or $4. I use $2$3$4to extract the value. The final regex is

不重要的组被设为非捕获。第一个匹配的组$1为您提供了属性的名称,该值是一个 $2$3$4。我$2$3$4用来提取值。最后的正则表达式是

\s+(\w+)(?:\s*=\s*(?:"([^"]*)"|'([^']*)'|([^><"'\s]+)))?(?=(?:\s+\w+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^><"'\s]+))?)*\s*/?\s*>)

Note: I removed all unnecessary groups in the lookahead and made all remaining groups non capturing.

注意:我在前瞻中删除了所有不必要的组,并使所有剩余的组不被捕获。