您能否提供一些示例,说明为什么使用正则表达式很难解析 XML 和 HTML?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/701166/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
提问by Chas. Owens
One mistake I see people making overand over againis trying to parse XML or HTML with a regex. Here are a few of the reasons parsing XML and HTML is hard:
一个错误我看到人们做了,并再次试图解析XML或HTML用正则表达式。以下是解析 XML 和 HTML 困难的几个原因:
People want to treat a file as a sequence of lines, but this is valid:
人们希望将文件视为一系列行,但这是有效的:
<tag
attr="5"
/>
People want to treat < or <tag as the start of a tag, but stuff like this exists in the wild:
人们希望将 < 或 <tag 视为标签的开始,但这样的东西存在于野外:
<img src="imgtag.gif" alt="<img>" />
People often want to match starting tags to ending tags, but XML and HTML allow tags to contain themselves (which traditional regexes cannot handle at all):
人们通常希望将起始标签与结束标签相匹配,但 XML 和 HTML 允许标签包含自己(传统正则表达式根本无法处理):
<span id="outer"><span id="inner">foo</span></span>
People often want to match against the content of a document (such as the famous "find all phone numbers on a given page" problem), but the data may be marked up (even if it appears to be normal when viewed):
人们经常想对文档的内容进行匹配(例如著名的“在给定页面上查找所有电话号码”问题),但数据可能会被标记(即使查看时看起来很正常):
<span class="phonenum">(<span class="area code">703</span>)
<span class="prefix">348</span>-<span class="linenum">3020</span></span>
Comments may contain poorly formatted or incomplete tags:
评论可能包含格式错误或不完整的标签:
<a href="foo">foo</a>
<!-- FIXME:
<a href="
-->
<a href="bar">bar</a>
What other gotchas are you aware of?
你还知道哪些其他问题?
采纳答案by bobince
Here's some fun valid XML for you:
这里有一些有趣的有效 XML:
<!DOCTYPE x [ <!ENTITY y "a]>b"> ]>
<x>
<a b="&y;>" />
<![CDATA[[a>b <a>b <a]]>
<?x <a> <!-- <b> ?> c --> d
</x>
And this little bundle of joy is valid HTML:
这个小小的快乐是有效的 HTML:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd" [
<!ENTITY % e "href='hello'">
<!ENTITY e "<a %e;>">
]>
<title>x</TITLE>
</head>
<p id = a:b center>
<span / hello </span>
&<br left>
<!---- >t<!---> < -->
&e link </a>
</body>
Not to mention all the browser-specific parsing for invalid constructs.
更不用说针对无效构造的所有特定于浏览器的解析了。
Good luck pitting regex against that!
祝你好运正则表达式反对那个!
EDIT (J?rg W Mittag): Here is another nice piece of well-formed, valid HTML 4.01:
编辑(J?rg W Mittag):这是另一个格式良好、有效的 HTML 4.01:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<HTML/
<HEAD/
<TITLE/>/
<P/>
回答by LordOfThePigs
Actually
实际上
<img src="imgtag.gif" alt="<img>" />
is not valid HTML, and is not valid XML either.
不是有效的 HTML,也不是有效的 XML。
It is not valid XML because the '<' and '>' are not valid characters inside attribute strings. They need to be escaped using the corresponding XML entities < and >
它不是有效的 XML,因为 '<' 和 '>' 不是属性字符串中的有效字符。它们需要使用相应的 XML 实体 < 进行转义。和 >
It is not valid HTML either because the short closing form is not allowed in HTML (but is correct in XML and XHTML). The 'img' tag is also an implicitly closed tag as per the HTML 4.01 specification. This means that manually closing it is actually wrong, and is equivalent to closing any other tag twice.
它也不是有效的 HTML,因为 HTML 中不允许使用简短的结束形式(但在 XML 和 XHTML 中是正确的)。根据 HTML 4.01 规范,'img' 标签也是一个隐式闭合标签。这意味着手动关闭它实际上是错误的,相当于关闭任何其他标签两次。
The correct version in HTML is
HTML 中的正确版本是
<img src="imgtag.gif" alt="<img>">
and the correct version in XHTML and XML is
XHTML 和 XML 中的正确版本是
<img src="imgtag.gif" alt="<img>"/>
The following example you gave is also invalid
您给出的以下示例也无效
<
tag
attr="5"
/>
This is not valid HTML or XML either. The name of the tag must be right behind the '<', although the attributes and the closing '>' may be wherever they want. So the valid XML is actually
这也不是有效的 HTML 或 XML。标签的名称必须紧跟在 '<' 后面,尽管属性和结束的 '>' 可以放在他们想要的任何地方。所以有效的 XML 实际上是
<tag
attr="5"
/>
And here's another funkier one: you can actually choose to use either " or ' as your attribute quoting character
这是另一个更时髦的:您实际上可以选择使用 " 或 ' 作为您的属性引用字符
<img src="image.gif" alt='This is single quoted AND valid!'>
All the other reasons that were posted are correct, but the biggest problem with parsing HTML is that people usually don't understand all the syntax rules correctly. The fact that your browser interprets your tagsoup as HTML doesn't means that you have actually written valid HTML.
发布的所有其他原因都是正确的,但解析 HTML 的最大问题是人们通常无法正确理解所有语法规则。您的浏览器将 tagsoup 解释为 HTML 的事实并不意味着您实际上编写了有效的 HTML。
Edit: And even stackoverflow.com agrees with me regarding the definition of valid and invalid. Your invalid XML/HTML is not highlighted, while my corrected version is.
编辑:甚至 stackoverflow.com 也同意我关于有效和无效的定义。您的无效 XML/HTML 没有突出显示,而我更正的版本是。
Basically, XML is not made to be parsed with regexps. But there is also no reason to do so. There are many, many XML parsers for each and every language. You have the choice between SAX parsers, DOM parsers and Pull parsers. All of these are guaranteed to be much faster than parsing with a regexp and you may then use cool technologies like XPath or XSLT on the resulting DOM tree.
基本上,XML 不是用正则表达式解析的。但也没有理由这样做。每种语言都有许许多多的 XML 解析器。您可以在 SAX 解析器、DOM 解析器和 Pull 解析器之间进行选择。所有这些都保证比使用正则表达式解析要快得多,然后您可以在生成的 DOM 树上使用 XPath 或 XSLT 等很酷的技术。
My reply is therefore: not only is parsing XML with regexps hard, but it is also a bad idea. Just use one of the millions of existing XML parsers, and take advantage of all the advanced features of XML.
因此,我的回答是:使用正则表达式解析 XML 不仅困难,而且也是一个坏主意。只需使用数百万个现有 XML 解析器中的一个,并利用 XML 的所有高级功能。
HTML is just too hard to even try parsing on your own. First the legal syntax has many little subtleties that you may not be aware of, and second, HTML in the wild is just a huge stinking pile of (you get my drift). There are a variety of lax parser libraries that do a good job at handling HTML like tag soup, just use these.
HTML 太难了,甚至无法自己尝试解析。首先,法律语法有许多你可能不知道的细微之处,其次,野外的 HTML 只是一大堆臭味(你明白我的意思)。有各种松散的解析器库可以很好地处理 HTML 之类的标签汤,只需使用这些。
回答by JaredPar
I wrote an entire blog entry on this subject: Regular Expression Limitations
我写了一篇关于这个主题的完整博客条目:正则表达式限制
The crux of the issue is that HTML and XML are recursive structures which require counting mechanisms in order to properly parse. A true regex is not capable of counting. You must have a context free grammar in order to count.
问题的关键在于 HTML 和 XML 是递归结构,需要计数机制才能正确解析。真正的正则表达式无法计数。您必须具有上下文无关语法才能计数。
The previous paragraph comes with a slight caveat. Certain regex implementations now support the idea of recursion. However once you start adding recursion into your regex expressions, you are really stretching the boundaries and should consider a parser.
上一段带有一点警告。某些正则表达式实现现在支持递归的想法。但是,一旦您开始将递归添加到正则表达式中,您就真的在扩展边界,应该考虑使用解析器。
回答by AmbroseChapel
One gotcha not on your list is that attributes can appear in any order, so if your regex is looking for a link with the href "foo" and the class "bar", they can come in any order, and have any number of other things between them.
一个不在您列表中的问题是属性可以以任何顺序出现,因此如果您的正则表达式正在寻找带有 href“foo”和类“bar”的链接,它们可以以任何顺序出现,并且具有任意数量的其他他们之间的事情。
回答by Anton Gogolev
It depends on what you mean by "parsing". Generally speaking, XML cannot be parsed using regex since XML grammar is by no means regular. To put it simply, regexes cannot count (well, Perl regexes might actually be able to count things) so you cannot balance open-close tags.
这取决于您所说的“解析”是什么意思。一般来说,XML 不能使用正则表达式解析,因为 XML 语法绝不是常规的。简而言之,正则表达式无法计数(好吧,Perl 正则表达式实际上可能会计数),因此您无法平衡开闭标记。
回答by Robin Day
Are people actually making a mistake by using a regex, or is it simply good enough for the task they're trying to achieve?
人们是否真的通过使用正则表达式犯了错误,或者它是否足以完成他们试图完成的任务?
I totally agree that parsing html and xml using a regex is not possible as other people have answered.
我完全同意使用正则表达式解析 html 和 xml 是不可能的,因为其他人已经回答了。
However, if your requirement is not to parse html/xml but to just get at one small bit of data in a "known good" bit of html / xml then maybe a regular expression or even an even simpler "substring" is good enough.
但是,如果您的要求不是解析 html/xml,而是在 html/xml 的“已知良好”位中获取一小部分数据,那么正则表达式甚至更简单的“子字符串”就足够了。
回答by chaos
People normally default to writing greedy patterns, often enough leading to an un-thought-through .* slurping large chunks of file into the largest possible <foo>.*</foo>.
人们通常默认编写贪婪的模式,这通常足以导致未经深思熟虑的 .* 将大块文件吞入尽可能大的 <foo>.*</foo>。
回答by Isaac Rabinovitch
I'm tempted to say "don't re-invent the wheel". Except that XML is a really, reallycomplex format. So maybe I should say "don't reinvent the synchrotron."
我很想说“不要重新发明轮子”。除了 XML 是一种非常非常复杂的格式。所以也许我应该说“不要重新发明同步加速器”。
Perhaps the correct cliche starts "when all you have is a hammer..." You know how to use regular expressions, regular expression are good at parsing, so why bother to learn an XML parsing library?
也许正确的陈词滥调开始于“当你只有一把锤子时……”你知道如何使用正则表达式,正则表达式很擅长解析,那为什么还要学习一个 XML 解析库呢?
Because parsing XML is hard. Any effort you save by not having to learn to use an XML parsing library will be more than made up by the amount of creative work and bug-swatting you will have to do. For your own sake, google "XML library" and leverage somebody else's work.
因为解析 XML 很困难。您不必学习使用 XML 解析库而节省的任何努力,将不仅仅是由您必须做的大量创造性工作和解决错误所弥补的。为了您自己,谷歌“XML 库”并利用其他人的工作。
回答by Michael Kay
I think the problems boil down to:
我认为问题归结为:
The regex is almost invariably incorrect. There are legitimate inputs which it will fail to match correctly. If you work hard enough you can make it 99% correct, or 99.999%, but making it 100% correct is almost impossible, if only because of the weird things that XML allows by using entities.
If the regex is incorrect, even for 0.00001% of inputs, then you have a security problem, because someone can discover the one input that will break your application.
If the regex is correct enough to cover 99.99% of cases then it is going to be thoroughly unreadable and unmaintainable.
It's very likely that a regex will perform very badly on moderate-sized input files. My very first encounter with XML was to replace a Perl script that (incorrectly) parsed incoming XML documents with a proper XML parser, and we not only replaced 300 lines of unreadable code with 100 lines that anyone could understand, but we improved user response time from 10 seconds to about 0.1 seconds.
正则表达式几乎总是不正确的。存在无法正确匹配的合法输入。如果你足够努力,你可以使它 99% 或 99.999% 正确,但使它 100% 正确几乎是不可能的,因为 XML 允许使用实体进行一些奇怪的事情。
如果正则表达式不正确,即使是 0.00001% 的输入,那么您就会遇到安全问题,因为有人可以发现会破坏您的应用程序的一个输入。
如果正则表达式足够正确以涵盖 99.99% 的情况,那么它将完全不可读和不可维护。
正则表达式很可能在中等大小的输入文件上表现非常糟糕。我第一次遇到 XML 是用适当的 XML 解析器替换(错误地)解析传入 XML 文档的 Perl 脚本,我们不仅用任何人都可以理解的 100 行替换了 300 行不可读的代码,而且我们改进了用户响应时间从 10 秒到大约 0.1 秒。
回答by Adam Arold
I believe thisclassichas the information you are looking for. You can find the point in one of the comments there:
我相信这个经典有你正在寻找的信息。您可以在其中一条评论中找到这一点:
I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar - you can't possibly hope to make this work. But many will try, some will claim success and others will find the fault and totally mess you up.
我认为这里的缺陷是 HTML 是 Chomsky Type 2 语法(上下文无关语法),而 RegEx 是 Chomsky Type 3 语法(正则表达式)。由于类型 2 语法从根本上比类型 3 语法更复杂 - 您不可能希望使这项工作成功。但是很多人会尝试,有些人会声称成功,而另一些人会发现错误并完全把你搞砸。
Some more info from Wikipedia: Chomsky Hierarchy
来自维基百科的更多信息:乔姆斯基层次结构