Html RegEx 匹配除 XHTML 自包含标签之外的开放标签

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1732348/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 01:19:08  来源:igfitidea点击:

RegEx match open tags except XHTML self-contained tags

htmlregexxhtml

提问by Jeff

I need to match all of these opening tags:

我需要匹配所有这些开始标签:

<p>
<a href="foo">

But not these:

但不是这些:

<br />
<hr class="foo" />

I came up with this and wanted to make sure I've got it right. I am only capturing the a-z.

我想出了这个并想确保我做对了。我只捕获a-z.

<([a-z]+) *[^/]*?>

I believe it says:

我相信它说:

  • Find a less-than, then
  • Find (and capture) a-z one or more times, then
  • Find zero or more spaces, then
  • Find any character zero or more times, greedy, except /, then
  • Find a greater-than
  • 找到一个小于,然后
  • 查找(并捕获)az 一次或多次,然后
  • 找到零个或多个空格,然后
  • 查找任意字符零次或多次,贪心,除了/,然后
  • 找到一个大于

Do I have that right? And more importantly, what do you think?

我有这个权利吗?更重要的是,你怎么看?

回答by bobince

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the n?erves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege???x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be savedthe trangession of a chi?ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourgeusing regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this worldand the dread realm of c??o??rrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of reg?ex parsers for HTML will ins?tantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection wil?l devour your HT?ML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fi?ght he com?e?s, ?h?i?s un?ho?ly radian?ce? destro?ying all enli????ghtenment, HTML tags lea?ki?n?g fr?o?m ?yo??ur eye?s? ?l?ik?e liq?uid pain, the song of re?gular exp?ression parsing will exti?nguish the voices of mor?tal man from the sp?here I can see it can you see ?????i???t???????? it is beautiful t?he final snuffing of the lie?s of Man ALL IS LOS????????T ALL I?S LOST the pon?y he comes he c??omes he comes theich?or permeates all MY FACE MY FACE ?h god no NO NOO?O?O NΘ stop the an?*??????????g????????l??????????e??s?a???r?????en?ot re????a?l???????? ZA????LG? IS????????? TO???????? TH?E??? ?P???O??N?Y? H??????????E?????????? ??????????C??????????O??????M??????????E?????????S??????????

你不能用正则表达式解析 [X]HTML。因为正则表达式无法解析 HTML。正则表达式不是可用于正确解析 HTML 的工具。正如我之前多次在 HTML-and-regex 问题中回答的那样,使用 regex 将不允许您使用 HTML。正则表达式是一种不够复杂的工具,无法理解 HTML 所采用的结构。HTML 不是正则语言,因此不能被正则表达式解析。正则表达式查询无法将 HTML 分解成有意义的部分。这么多次,但它没有得到我。即使是 Perl 使用的增强的不规则正则表达式也无法完成解析 HTML 的任务。你永远不会让我崩溃。HTML 是一种足够复杂的语言,无法通过正则表达式进行解析。甚至 Jon Skeet 也无法使用正则表达式解析 HTML。每次您尝试使用正则表达式解析 HTML 时,邪恶的孩子都会为处女流泪,而俄罗斯黑客会窃取您的网络应用程序。使用正则表达式解析 HTML 会将受污染的灵魂召唤到生活的领域。HTML 和正则表达式结合在一起,就像爱情、婚姻和仪式杀婴一样。<center> 无法容纳它为时已晚。正则表达式和 HTML 在同一个概念空间中的力量会像水一样的腻子一样摧毁你的思想。如果你用正则表达式解析 HTML,你就会屈服于 Them 及其亵渎神明的方式,这些方式注定我们所有人都要为名字无法在基本多语言平面上表达的那一位付出非人道的辛劳,他来了。HTML-plus-regexp 将在你观察的同时液化有知觉的神经,你的心灵在恐怖的冲击中枯萎。为时已晚 为时已晚 我们无法拯救孩子的变性? 确保正则表达式将消耗所有活组织(除了 HTML,正如之前所预言的那样)亲爱的上帝帮助我们如何使用正则表达式幸免于这种祸害解析 HTML 已经注定了人类永远遭受可怕的折磨和安全漏洞,使用 regex 作为处理 HTML 的工具在这个世界和 c??o??rrupt 实体(如 SGML 实体,但更腐败)reg 世界的一瞥前解析器HTML将插件?tantly运输AP rogrammer的意识我n要AW ORL人的不断尖叫d,他来了, 瘟疫 slthy 正则表达式感染会吗?升吞噬你的HT?ML解析器,应用和存在的Visual Basic一样,所有的时间只有更糟,他谈到他命令ES没有网络连接?向右^ h èCOM?è?S,3 H?我?的未?豪?LY弧度? ce? de stro?ying all enli??ghtenment, HTML 标签lea?ki?n?g fr?o?m ?yo??ur eye?s? ·L·伊克?èLIQ?UID p艾因,重的歌吗?gular EXP?再裂变解析会EXTI ?nguish铁道部的声音?来自 sp 的 tal man?here I can see it you see ?????i???t?????????? 它很漂亮吗?他˚F inal snufFing头Ø F中的谎言?s of Man ALL IS LOS????????TALL I?SLOST日èPON?Ÿ他来了小号,他在C18 OM ES他合作式T脑出血?或permeat ES人升MY FAC éMY FACE 2 H神Ñ □否NOO?O?ONΘ 停止他?*??????????g????????l?????????? e??s ?a???r?????en ?ot re??a?l???????? ZA????LG? 是?????????ŤØ???????? 是吗?E????小马?他????????????????????来?????????S?????????



Have you tried using an XML parser instead?

您是否尝试过使用 XML 解析器?



Moderator's Note

This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look - there are no problems with its content. Please do not flag it for our attention.

版主须知

此帖子已被锁定,以防止对其内容进行不当编辑。该帖子看起来完全符合它的预期 - 其内容没有问题。请不要标记它以引起我们的注意。

回答by Kaitlin Duck Sherwood

While arbitraryHTML with only a regex is impossible, it's sometimes appropriate to use them for parsing a limited, knownset of HTML.

虽然仅使用正则表达式的任意HTML 是不可能的,但有时使用它们来解析有限的、已知的 HTML 集是合适的。

If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's web site. This was a limited, one-time job.

如果您有一小部分 HTML 页面要从中抓取数据然后将其填充到数据库中,则正则表达式可能会正常工作。例如,我最近想获取澳大利亚联邦代表的姓名、政党和地区,这是我从议会网站上获取的。这是一项有限的一次性工作。

Regexes worked just fine for me, and were very fast to set up.

正则表达式对我来说工作得很好,而且设置起来非常快。

回答by NealB

I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar)and RegEx is a Chomsky Type 3 grammar (regular grammar). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar (see the Chomsky hierarchy), it is mathematically impossibleto parse XML with RegEx.

我认为这里的缺陷是 HTML 是Chomsky Type 2 语法(上下文无关语法),而 RegEx 是Chomsky Type 3 语法(正则语法)。由于类型 2 语法从根本上比类型 3 语法复杂(请参阅乔姆斯基层次结构),因此在数学上不可能使用 RegEx 解析 XML。

But many will try, some will even claim success - but until others find the fault and totally mess you up.

但是很多人会尝试,有些人甚至会声称成功 - 但直到其他人发现错误并完全把你搞砸。

回答by Justin Morgan

Don't listen to these guys. You totally canparse context-free grammars with regex if you break the task into smaller pieces. You can generate the correct pattern with a script that does each of these in order:

不要听这些人的。如果您将任务分解为更小的部分,您完全可以使用正则表达式解析上下文无关语法。您可以使用按顺序执行这些操作的脚本生成正确的模式:

  1. Solve the Halting Problem.
  2. Square a circle.
  3. Work out the Traveling Salesman Problem in O(log n) or less. If it's any more than that, you'll run out of RAM and the engine will hang.
  4. The pattern will be pretty big, so make sure you have an algorithm that losslessly compresses random data.
  5. Almost there - just divide the whole thing by zero. Easy-peasy.
  1. 解决停机问题。
  2. 方圆。
  3. 在 O(log n) 或更少时间内解决旅行商问题。如果超过这个值,您将耗尽 RAM 并且引擎将挂起。
  4. 该模式将非常大,因此请确保您拥有无损压缩随机数据的算法。
  5. 几乎就在那里 - 只需将整个事物除以零即可。十分简单。

I haven't quite finished the last part myself, but I know I'm getting close. It keeps throwing CthulhuRlyehWgahnaglFhtagnExceptions for some reason, so I'm going to port it to VB 6 and use On Error Resume Next. I'll update with the code once I investigate this strange door that just opened in the wall. Hmm.

我自己还没有完全完成最后一部分,但我知道我已经接近了。CthulhuRlyehWgahnaglFhtagnException出于某种原因,它不断抛出s,所以我要将它移植到 VB 6 并使用On Error Resume Next. 一旦我调查了这扇刚在墙上打开的奇怪门,我就会更新代码。唔。

P.S. Pierre de Fermat also figured out how to do it, but the margin he was writing in wasn't big enough for the code.

PS Pierre de Fermat 也想出了怎么做,但是他写的边距对于代码来说不够大。

回答by itsadok

Disclaimer: use a parser if you have the option. That said...

免责声明:如果可以,请使用解析器。那说...

This is the regex I use (!) to match HTML tags:

这是我使用 (!) 匹配 HTML 标签的正则表达式:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>

It may not be perfect, but I ran this code through a lotof HTML. Note that it even catches strange things like <a name="badgenerator"">, which show up on the web.

它可能并不完美,但我通过大量HTML运行了这段代码。请注意,它甚至会捕获一些奇怪的东西,比如<a name="badgenerator"">,这些东西会出现在网络上。

I guess to make it not match self contained tags, you'd either want to use Kobi's negative look-behind:

我想让它不匹配自包含的标签,你要么想使用Kobi的负面回顾:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+(?<!/\s*)>

or just combine if and if not.

或者只是结合 if 和 if not。

To downvoters:This is working code from an actual product. I doubt anyone reading this page will get the impression that it is socially acceptable to use regexes on HTML.

对于downvoters:这是来自实际产品的工作代码。我怀疑任何阅读此页面的人都会觉得在 HTML 上使用正则表达式是社会可以接受的。

Caveat: I should note that this regex still breaks down in the presence of CDATA blocks, comments, and script and style elements. Good news is, you can get rid of those using a regex...

警告:我应该注意到这个正则表达式在存在 CDATA 块、注释、脚本和样式元素的情况下仍然会失效。好消息是,你可以摆脱那些使用正则表达式的......

回答by xanatos

There are people that will tell you that the Earth is round (or perhaps that the Earth is an oblate spheroid if they want to use strange words). They are lying.

有些人会告诉你地球是圆的(或者,如果他们想用奇怪的词,也许地球是一个扁球体)。他们在撒谎。

There are people that will tell you that Regular Expressions shouldn't be recursive. They are limiting you. They need to subjugate you, and they do it by keeping you in ignorance.

有些人会告诉你正则表达式不应该是递归的。他们在限制你。他们需要征服你,他们通过让你保持无知来做到这一点。

You can live in their reality or take the red pill.

你可以生活在他们的现实中,也可以服用红色药丸。

Like Lord Marshal (is he a relative of the Marshal .NET class?), I have seen the UnderverseStack Based Regex-Verse and returned with powersknowledge you can't imagine. Yes, I think there were an Old One or two protecting them, but they were watching football on the TV, so it wasn't difficult.

就像 Marshal 勋爵(他是 Marshal .NET 类的亲戚吗?),我看过基于UnderverseStack 的 Regex-Verse,并带着你无法想象的力量知识回来了。是的,我认为有一两个老一辈在保护他们,但是他们在电视上看足球,所以这并不难。

I think the XML case is quite simple. The RegEx (in the .NET syntax), deflated and coded in base64 to make it easier to comprehend by your feeble mind, should be something like this:

我认为 XML 案例非常简单。正则表达式(在 .NET 语法中),在 base64 中压缩和编码,以便您的弱智更容易理解,应该是这样的:

7L0HYBxJliUmL23Ke39K9UrX4HShCIBgEyTYkEAQ7MGIzeaS7B1pRyMpqyqBymVWZV1mFkDM7Z28
995777333nvvvfe6O51OJ/ff/z9cZmQBbPbOStrJniGAqsgfP358Hz8itn6Po9/3eIue3+Px7/3F
86enJ8+/fHn64ujx7/t7vFuUd/Dx65fHJ6dHW9/7fd/t7fy+73Ye0v+f0v+Pv//JnTvureM3b169
OP7i9Ogyr5uiWt746u+BBqc/8dXx86PP7tzU9mfQ9tWrL18d3UGnW/z7nZ9htH/y9NXrsy9fvPjq
i5/46ss3p4z+x3e8b452f9/x93a2HxIkH44PpgeFyPD6lMAEHUdbcn8ffTP9fdTrz/8rBPCe05Iv
p9WsWF788Obl9MXJl0/PXnwONLozY747+t7x9k9l2z/4vv4kqo1//993+/vf2kC5HtwNcxXH4aOf
LRw2z9/v8WEz2LTZcpaV1TL/4c3h66ex2Xv95vjF0+PnX744PbrOm59ZVhso5UHYME/dfj768H7e
Yy5uQUydDAH9+/4eR11wHbqdfPnFF6cv3ogq/V23t++4z4620A13cSzd7O1s/77rpw+ePft916c7
O/jj2bNnT7e/t/397//M9+ibA/7s6ZNnz76PP0/kT2rz/Ts/s/0NArvziYxVEZWxbm93xsrUfnlm
rASN7Hf93u/97vvf+2Lx/e89L7+/FSXiz4Bkd/hF5mVq9Yik7fcncft9350QCu+efkr/P6BfntEv
z+iX9c4eBrFz7wEwpB9P+d9n9MfuM3yzt7Nzss0/nuJfbra3e4BvZFR7z07pj3s7O7uWJM8eCkme
nuCPp88MfW6kDeH7+26PSTX8vu+ePAAiO4LVp4zIPWC1t7O/8/+pMX3rzo2KhL7+8s23T1/RhP0e
vyvm8HbsdmPXYDVhtpdnAzJ1k1jeufOtUAM8ffP06Zcnb36fl6dPXh2f/F6nRvruyHfMd9rgJp0Y
gvsRx/6/ZUzfCtX4e5hTndGzp5jQo9e/z+s3p1/czAUMlts+P3tz+uo4tISd745uJxvb3/v4ZlWs
mrjfd9SG/swGPD/6+nh+9MF4brTBRmh1Tl5+9eT52ckt5oR0xldPzp7GR8pfuXf5PWJv4nJIwvbH
W3c+GY3vPvrs9zj8Xb/147/n7/b7/+52DD2gsSH8zGDvH9+i9/fu/PftTfTXYf5hB+9H7P1BeG52
MTtu4S2cTAjDizevv3ry+vSNb8N+3+/1po2anj4/hZsGt3TY4GmjYbEKDJ62/pHB+3/LmL62wdsU
1J18+eINzTJr3dMvXr75fX7m+MXvY9XxF2e/9+nTgPu2bgwh5U0f7u/74y9Pnh6/OX4PlA2UlwTn
xenJG8L996VhbP3++PCrV68QkrjveITxr2TIt+lL+f3k22fPn/6I6f/fMqZvqXN/K4Xps6sazUGZ
GeQlar49xEvajzI35VRevDl78/sc/b7f6jkG8Va/x52N4L9lBe/kZSh1hr9fPj19+ebbR4AifyuY
12efv5CgGh9TroR6Pj2l748iYxYgN8Z7pr0HzRLg66FnRvcjUft/45i+pRP08vTV6TOe2N/9jv37
R9P0/5YxbXQDeK5E9R12XdDA/4zop+/9Ht/65PtsDVlBBUqko986WsDoWqvbPD2gH/T01DAC1NVn
3/uZ0feZ+T77fd/GVMkA4KjeMcg6RcvQLRl8HyPaWVStdv17PwHV0bOB9xUh7rfMp5Zu3icBJp25
D6f0NhayHyfI3HXHY6YYCw7Pz17fEFhQKzS6ZWChrX+kUf7fMqavHViEPPKjCf1/y5hukcyPTvjP
mHQCppRDN4nbVFPaT8+ekpV5/TP8g/79mVPo77PT1/LL7/MzL7548+XvdfritflFY00fxIsvSQPS
mvctdYZpbt7vxKRfj3018OvC/hEf/79lTBvM3debWj+b8KO0wP+3OeM2aYHumuCAGonmCrxw9cVX
X1C2d4P+uSU7eoBUMzI3/f9udjbYl/el04dI7s8fan8dWRjm6gFx+NrKeFP+WX0CxBdPT58df/X8
DaWLX53+xFdnr06f/szv++NnX7x8fnb6NAhIwsbPkPS7iSUQAFETvP2Tx8+/Og0Xt/yBvDn9vd/c
etno8S+81QKXptq/ffzKZFZ+4e/743e8zxino+8RX37/k595h5/H28+y7fPv490hQdJ349E+txB3
zPZ5J/jsR8bs/y1j2hh/2fkayOqEmYcej0cXUWMN7QrqBwjDrVZRfyQM3xjj/EgYvo4wfLTZrnVS
ebdKq0XSZJvzajKQDUv1/P3NwbEP7cN5+Odivv9/ysPfhHfkOP6b9Fl+91v7LD9aCvp/+Zi+7lLQ
j0zwNzYFP+/Y6r1NcFeDbfBIo8rug3zS3/3WPumPlN3/y8f0I2X3cz4FP+/Y6htSdr2I42fEuSPX
/ewpL4e9/n1evzn94hb+Plpw2+dnbyh79zx0CsPvbq0lb+UQ/h7xvqPq/Gc24PnR18fzVrp8I57d
mehj7ebk5VdPnp+d3GJOSP189eTsaXyk/JV7l98j4SAZgRxtf7x155PR+O6jz36Pw9/1Wz/+e/5u
v//vbsfQAxobws8M9v7xLXp/785/395ED4nO1wx5fsTeH4LnRva+eYY8rpZUBFb/j/jfm8XAvfEj
4/b/ljF1F9B/jx5PhAkp1nu/+y3n+kdZp/93jWmjJ/M11TG++VEG6puZn593PPejoOyHMQU/79jq
GwrKfpSB+tmcwZ93XPkjZffDmIKfd2z1DSm7bmCoPPmjBNT74XkrVf71I/Sf6wTU7XJA4RB+lIC6
mW1+xN5GWw1/683C5rnj/m364cmr45Pf6/SN9H4Us4LISn355vjN2ZcvtDGT6fHvapJcMISmxc0K
MAD4IyP6/5Yx/SwkP360FvD1VTH191mURr/HUY+2P3I9boPnz7Ju/pHrcWPnP3I9/r/L3sN0v52z
0fEgNrgbL8/Evfh9fw/q5Xf93u/97vvf+2Lx/e89L7+/Fe3iZ37f34P5h178kTfx/5YxfUs8vY26
7/d4/OWbb5++ogn7PX5XzOHtOP3GrsHmqobOVO/8Hh1Gk/TPl198QS6w+rLb23fcZ0fMaTfjsv29
7Zul7me2v0FgRoYVURnf9nZEkDD+H2VDf8hjeq8xff1s6GbButNLacEtefHm9VdPXp++CRTw7/v9
r6vW8b9eJ0+/PIHzs1HHdyKE/x9L4Y+s2f+PJPX/1dbsJn3wrY6wiqv85vjVm9Pnp+DgN8efM5va
j794+eb36Xz3mAf5+58+f3r68s230dRvJcxKn/l//oh3f+7H9K2O0r05PXf85s2rH83f/1vGdAvd
w+qBFqsoWvzspozD77EpXYeZ7yzdfxy0ec+l+8e/8FbR84+Wd78xbvn/qQQMz/J7L++GPB7N0MQa
2vTMBwjDrVI0PxKGb4xxfiQMX0cYPuq/Fbx2C1sU8yEF+F34iNsx1xOGa9t6l/yX70uqmxu+qBGm
AxlxWwVS11O97ULqlsFIUvUnT4/fHIuL//3f9/t9J39Y9m8W/Tuc296yUeX/b0PiHwUeP1801Y8C
j/9vz9+PAo8f+Vq35Jb/n0rAz7Kv9aPA40fC8P+RMf3sC8PP08DjR1L3DXHoj6SuIz/CCghZNZb8
fb/Hf/2+37tjvuBY9vu3jmRvxNeGgQAuaAF6Pwj8/+e66M8/7rwpRNj6uVwXZRl52k0n3FVl95Q+
+fz0KSu73/dtkGDYdvZgSP5uskadrtViRKyal2IKAiQfiW+FI+tET/9/Txj9SFf8SFf8rOuKzagx
+r/vD34mUADO1P4/AQAA//8=

The options to set is RegexOptions.ExplicitCapture. The capture group you are looking for is ELEMENTNAME. If the capture group ERRORis not empty then there was a parsing error and the Regex stopped.

要设置的选项是RegexOptions.ExplicitCapture。您要查找的捕获组是ELEMENTNAME。如果捕获组ERROR不为空,则出现解析错误并且 Regex 停止。

If you have problems reconverting it to a human-readable regex, this should help:

如果您在将其重新转换为人类可读的正则表达式时遇到问题,这应该会有所帮助:

static string FromBase64(string str)
{
    byte[] byteArray = Convert.FromBase64String(str);

    using (var msIn = new MemoryStream(byteArray))
    using (var msOut = new MemoryStream()) {
        using (var ds = new DeflateStream(msIn, CompressionMode.Decompress)) {
            ds.CopyTo(msOut);
        }

        return Encoding.UTF8.GetString(msOut.ToArray());
    }
}

If you are unsure, no, I'm NOT kidding (but perhaps I'm lying). It WILL work. I've built tons of unit tests to test it, and I have even used (part of) the conformance tests. It's a tokenizer, not a full-blown parser, so it will only split the XML into its component tokens. It won't parse/integrate DTDs.

如果您不确定,不,我不是在开玩笑(但也许我在撒谎)。它会起作用。我已经构建了大量的单元测试来测试它,我什至使用了(部分)一致性测试。它是一个标记器,而不是一个成熟的解析器,因此它只会将 XML 拆分为其组件标记。它不会解析/集成 DTD。

Oh... if you want the source code of the regex, with some auxiliary methods:

哦...如果你想要正则表达式的源代码,有一些辅助方法:

regex to tokenize an xmlor the full plain regex

正则表达式来标记一个 xml完整的普通正则表达式

回答by dubiousjim

In shell, you can parse HTMLusing sed:

在 shell 中,您可以使用sed解析HTML

  1. Turing.sed
  2. Write HTML parser (homework)
  3. ???
  4. Profit!
  1. 图灵
  2. 编写 HTML 解析器(作业)
  3. ???
  4. 利润!


Related (why you shouldn't use regex match):

相关(为什么你不应该使用正则表达式匹配):

回答by Sam

I agree that the right tool to parse XML and especially HTMLis a parser and not a regular expression engine. However, like others have pointed out, sometimes using a regex is quicker, easier, and gets the job done if you know the data format.

我同意解析 XML尤其是 HTML的正确工具是解析器而不是正则表达式引擎。然而,正如其他人指出的那样,如果您知道数据格式,有时使用正则表达式会更快、更容易,并且可以完成工作。

Microsoft actually has a section of Best Practices for Regular Expressions in the .NET Frameworkand specifically talks about Consider[ing] the Input Source.

Microsoft在 .NET Framework 中实际上有一个关于正则表达式最佳实践的部分,并专门讨论了考虑 [ing] 输入源

Regular Expressions do have limitations, but have you considered the following?

正则表达式确实有局限性,但您是否考虑过以下几点?

The .NET framework is unique when it comes to regular expressions in that it supports Balancing Group Definitions.

.NET 框架在正则表达式方面是独一无二的,因为它支持平衡组定义

For this reason, I believe you CAN parse XML using regular expressions. Note however, that it must be valid XML(browsers are very forgiving of HTML and allow bad XML syntax inside HTML). This is possible since the "Balancing Group Definition" will allow the regular expression engine to act as a PDA.

因此,我相信您可以使用正则表达式解析 XML。但是请注意,它必须是有效的 XML浏览器对 HTML 非常宽容,并允许在 HTML 中使用错误的 XML 语法)。这是可能的,因为“平衡组定义”将允许正则表达式引擎充当 PDA。

Quote from article 1 cited above:

引自上述第 1 条:

.NET Regular Expression Engine

As described above properly balanced constructs cannot be described by a regular expression. However, the .NET regular expression engine provides a few constructs that allow balanced constructs to be recognized.

  • (?<group>)- pushes the captured result on the capture stack with the name group.
  • (?<-group>)- pops the top most capture with the name group off the capture stack.
  • (?(group)yes|no)- matches the yes part if there exists a group with the name group otherwise matches no part.

These constructs allow for a .NET regular expression to emulate a restricted PDA by essentially allowing simple versions of the stack operations: push, pop and empty. The simple operations are pretty much equivalent to increment, decrement and compare to zero respectively. This allows for the .NET regular expression engine to recognize a subset of the context-free languages, in particular the ones that only require a simple counter. This in turn allows for the non-traditional .NET regular expressions to recognize individual properly balanced constructs.

.NET 正则表达式引擎

如上所述,正确平衡的结构不能用正则表达式来描述。但是,.NET 正则表达式引擎提供了一些允许识别平衡结构的结构。

  • (?<group>)- 将捕获的结果推送到名称为 group 的捕获堆栈上。
  • (?<-group>)- 从捕获堆栈中弹出带有名称组的最顶部捕获。
  • (?(group)yes|no)- 如果存在名称为 group 的组,则匹配 yes 部分,否则不匹配任何部分。

这些构造允许 .NET 正则表达式通过本质上允许堆栈操作的简单版本来模拟受限制的 PDA:push、pop 和 empty。简单的操作分别相当于递增、递减和与零比较。这允许 .NET 正则表达式引擎识别上下文无关语言的子集,特别是那些只需要简单计数器的语言。这反过来又允许非传统的 .NET 正则表达式识别各个适当平衡的构造。

Consider the following regular expression:

考虑以下正则表达式:

(?=<ul\s+id="matchMe"\s+type="square"\s*>)
(?>
   <!-- .*? -->                  |
   <[^>]*/>                      |
   (?<opentag><(?!/)[^>]*[^/]>)  |
   (?<-opentag></[^>]*[^/]>)     |
   [^<>]*
)*
(?(opentag)(?!))

Use the flags:

使用标志:

  • Singleline
  • IgnorePatternWhitespace (not necessary if you collapse regex and remove all whitespace)
  • IgnoreCase (not necessary)
  • 单线
  • IgnorePatternWhitespace(如果折叠正则表达式并删除所有空格,则不需要)
  • IgnoreCase(非必需)

Regular Expression Explained (inline)

正则表达式解释(内联)

(?=<ul\s+id="matchMe"\s+type="square"\s*>) # match start with <ul id="matchMe"...
(?>                                        # atomic group / don't backtrack (faster)
   <!-- .*? -->                 |          # match xml / html comment
   <[^>]*/>                     |          # self closing tag
   (?<opentag><(?!/)[^>]*[^/]>) |          # push opening xml tag
   (?<-opentag></[^>]*[^/]>)    |          # pop closing xml tag
   [^<>]*                                  # something between tags
)*                                         # match as many xml tags as possible
(?(opentag)(?!))                           # ensure no 'opentag' groups are on stack

You can try this at A Better .NET Regular Expression Tester.

你可以在A Better .NET 正则表达式测试器上试试这个。

I used the sample source of:

我使用了以下示例源:

<html>
<body>
<div>
   <br />
   <ul id="matchMe" type="square">
      <li>stuff...</li>
      <li>more stuff</li>
      <li>
          <div>
               <span>still more</span>
               <ul>
                    <li>Another &gt;ul&lt;, oh my!</li>
                    <li>...</li>
               </ul>
          </div>
      </li>
   </ul>
</div>
</body>
</html>

This found the match:

这找到了匹配:

   <ul id="matchMe" type="square">
      <li>stuff...</li>
      <li>more stuff</li>
      <li>
          <div>
               <span>still more</span>
               <ul>
                    <li>Another &gt;ul&lt;, oh my!</li>
                    <li>...</li>
               </ul>
          </div>
      </li>
   </ul>

although it actually came out like this:

虽然它实际上是这样出来的:

<ul id="matchMe" type="square">           <li>stuff...</li>           <li>more stuff</li>           <li>               <div>                    <span>still more</span>                    <ul>                         <li>Another &gt;ul&lt;, oh my!</li>                         <li>...</li>                    </ul>               </div>           </li>        </ul>

Lastly, I really enjoyed Jeff Atwood's article: Parsing Html The Cthulhu Way. Funny enough, it cites the answer to this question that currently has over 4k votes.

最后,我真的很喜欢 Jeff Atwood 的文章: Parsing Html The Cthulhu Way。有趣的是,它引用了这个目前拥有超过 4k 票的问题的答案。

回答by John Fiala

I suggest using QueryPathfor parsing XML and HTML in PHP. It's basically much the same syntax as jQuery, only it's on the server side.

我建议使用QueryPath在 PHP 中解析 XML 和 HTML。它的语法基本上与 jQuery 相同,只是它在服务器端。

回答by moritz

While the answers that you can't parse HTML with regexes are correct, they don't apply here. The OP just wants to parse one HTML tag with regexes, and that is something that can be done with a regular expression.

虽然您无法使用正则表达式解析 HTML 的答案是正确的,但它们不适用于此处。OP 只是想用正则表达式解析一个 HTML 标签,而这可以用正则表达式来完成。

The suggested regex is wrong, though:

不过,建议的正则表达式是错误的:

<([a-z]+) *[^/]*?>

If you add something to the regex, by backtracking it can be forced to match silly things like <a >>, [^/]is too permissive. Also note that <space>*[^/]*is redundant, because the [^/]*can also match spaces.

如果你在正则表达式中添加一些东西,通过回溯它可以被迫匹配愚蠢的东西,比如<a >>,[^/]太宽容了。还要注意<space>*[^/]*是多余的,因为[^/]*也可以匹配空格。

My suggestion would be

我的建议是

<([a-z]+)[^>]*(?<!/)>

Where (?<! ... )is (in Perl regexes) the negative look-behind. It reads "a <, then a word, then anything that's not a >, the last of which may not be a /, followed by >".

(?<! ... )(在 Perl 正则表达式中)负向后视在哪里。它读作“一个<,然后是一个词,然后是任何不是>的东西,最后一个可能不是/,然后是>”。

Note that this allows things like <a/ >(just like the original regex), so if you want something more restrictive, you need to build a regex to match attribute pairs separated by spaces.

请注意,这允许诸如<a/ >(就像原始正则表达式)之类的事情,因此如果您想要更严格的限制,则需要构建一个正则表达式来匹配由空格分隔的属性对。