java Java中的正则表达式,查找开始和结束标记
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/512342/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Regex in Java, finding start and end tag
提问by Berlin Brown
I am trying to find data within a HTML document. I don't need a full blown parser as it is just the data between one tag.
我正在尝试在 HTML 文档中查找数据。我不需要完整的解析器,因为它只是一个标签之间的数据。
But, I want to detect the 'select' tag and the data in between.
但是,我想检测“选择”标签及其之间的数据。
return Pattern.compile(pattern,
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE |
Pattern.DOTALL);
/// End right angle bracket left off intentionally:
track_pattern_buf.append("<select");
track_pattern_buf.append("(.*?)");
track_pattern_buf.append("</select");
Is this the 'regex' that you would use?
这是您要使用的“正则表达式”吗?
回答by Gumbo
If you really want to stich with regular expressions (which are not the best choice) I'd use:
如果您真的想使用正则表达式(这不是最佳选择),我会使用:
"<select[^>]*>(.+?)</select\s*>"
回答by Aaron Maenpaa
I would use something that looked like:
我会使用看起来像这样的东西:
"<select>([^<>]+)</select>"
I'm not sure why you left off the '>'s and I wouldn't want to match other tags (here I'm assuming we're looking for textual data and not a document fragment).
我不确定你为什么不使用 '>' 并且我不想匹配其他标签(这里我假设我们正在寻找文本数据而不是文档片段)。
That being said, I'd really look into getting a DOM and using XPath (or similar) to do your queries as regex's are not well known for their ability to deal with trees.
话虽如此,我真的会考虑获取一个 DOM 并使用 XPath(或类似的)来执行您的查询,因为正则表达式并不以处理树的能力而闻名。
回答by hyperboreean
I think more safer would be to have something like:
我认为更安全的是有类似的东西:
"<\s*select\s*>(.*?)<\s*/select\s*>"
For more security you should probably add \w* after the first select in case any other select options appear.
为了更安全,您应该在第一个选择后添加 \w* 以防出现任何其他选择选项。
Also the 3rd \s* could be probably skipped if your HTML is standard compliant.
如果您的 HTML 符合标准,则第三个 \s* 也可能会被跳过。
回答by dimo414
I understand that you don't think you need a full blown parser - we've all written an HTML regex parser at some point, thinking "My use case is so simple, surely I can use regex thistime!"
我知道您认为您不需要一个完整的解析器——我们都曾在某个时候编写过一个 HTML 正则表达式解析器,认为“我的用例非常简单,这次我当然可以使用正则表达式!”
But I think everyone who's gone and done it ultimately comes to the conclusion that just outsourcing the heavy lifting to one of the many excellent existing parsers would have been faster, easier, simpler, and safer. I know I have.
但我认为每个已经完成它的人最终都会得出结论,将繁重的工作外包给众多优秀的现有解析器之一会更快、更容易、更简单、更安全。我知道我有。
Check out jSoup- it's simple, it's fast, and it works. There's really no good reason not to use it.
查看jSoup- 它简单、快速且有效。真的没有充分的理由不使用它。
If you're still not convinced, the fact that you had to come and ask what the right pattern was - and you got three different answers in response - none of which do the whole job- should be telling that the problem is much more complex than it seems at first glance.
如果您仍然不相信,那么您必须来询问正确的模式是什么这一事实 -并且您得到了三个不同的答案 - 没有一个可以完成整个工作- 应该说明问题要复杂得多比乍一看。
回答by S?ren Ullidtz
Depending on your needs, I would also recommend doing a negative look-ahead to make sure you stop at the first occurrence of select.
根据您的需要,我还建议您做一个负面的前瞻,以确保您在第一次出现 select 时停止。
"(?<selectGroupName><select>((?:(?!select).)*)</select>)"
The important part here is "((?:(?!select).)*)" which takes anything that doesn't conflict with the negative look-ahead.
这里的重要部分是“((?:(?!select).)*)”,它采用任何与否定前瞻不冲突的内容。
The same could also be accomplished by using a lazy quantifier:
使用惰性量词也可以实现相同的效果:
"(?<selectGroupName><select>(.*?)</select>)"
These would both ensure that you will stop at the first occurrence of preventing you from taking several sections at the same time. It does however not protect you against nested select tags, on the contrary those would cause problems with this expression. With this expression the following would be an issue:
这些都将确保您在第一次出现时停止,以防止您同时参加多个部分。然而,它并不能保护您免受嵌套的 select 标签的影响,相反,这些会导致此表达式出现问题。使用此表达式,以下将是一个问题:
<select>
<select>
</select>
</select>
Without the look ahead or lazy quantifier the following would be an issue instead:
如果没有前瞻性或惰性量词,以下将是一个问题:
<select>
</select>
<a>
<select>
</select>
</a>

