如何使用正则表达式来解析 Java 中的 HTML?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/677038/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 17:47:08  来源:igfitidea点击:

How to use regular expressions to parse HTML in Java?

javaregex

提问by Ricardo Felgueiras

Please can someone tell me a simple way to find href and src tags in an html file using regular expressions in Java?
And then, how do I get the URL associated with the tag?

请有人告诉我使用Java中的正则表达式在html文件中查找href和src标签的简单方法吗?
然后,如何获取与标签关联的 URL?

Thanks for any suggestion.

感谢您的任何建议。

采纳答案by Dave Webb

Using regular expressions to pull values from HTML is always a mistake. HTML syntax is a lot more complex that it may first appear and it's very easy for a page to catch out even a very complex regular expression.

使用正则表达式从 HTML 中提取值总是一个错误。HTML 语法比它最初出现时要复杂得多,而且即使是非常复杂的正则表达式,页面也很容易识别出来。

Use an HTML Parserinstead. See also What are the pros and cons of the leading Java HTML parsers?

请改用HTML 解析器。另请参阅领先的 Java HTML 解析器的优缺点是什么?

回答by Scott Cowan

If you want to go down the html parsing route, which Dave and I recommend here's the code to parse a String Data for anchor tags and print their href.

如果你想沿着 html 解析路线走下去,我和 Dave 推荐这里是解析锚标签的字符串数据并打印它们的 href 的代码。

since your just using anchor tags you should be ok with just regex but if you want to do more go with a parser. The Mozilla HTML Parseris the best out there.

由于您只是使用锚标记,因此您应该只使用正则表达式就可以了,但是如果您想做更多事情,请使用解析器。在Mozilla的HTML解析器是最好的了。

File parserLibraryFile = new File("lib/MozillaHtmlParser/native/bin/MozillaParser" + EnviromentController.getSharedLibraryExtension());
                String parserLibrary = parserLibraryFile.getAbsolutePath();
                //  mozilla.dist.bin directory :
                final File mozillaDistBinDirectory = new File("lib/MozillaHtmlParser/mozilla.dist.bin."+ EnviromentController.getOperatingSystemName());

        MozillaParser.init(parserLibrary,mozillaDistBinDirectory.getAbsolutePath());
MozillaParser parser = new MozillaParser();
Document domDocument = parser.parse(data);
NodeList list = domDocument.getElementsByTagName("a");

for (int i = 0; i < list.getLength(); i++) {
    Node n = list.item(i);
    NamedNodeMap m = n.getAttributes();
    if (m != null) {
        Node attrNode = m.getNamedItem("href");
        if (attrNode != null)
           System.out.println(attrNode.getNodeValue());

回答by mP.

Dont use regular expressions use NekoHTML or TagSoup which are a bridge providing a SAX or DOM as in XML approach to visiting a HTML document.

不要使用正则表达式,使用 NekoHTML 或 TagSoup,它们是提供 SAX 或 DOM 的桥梁,就像在 XML 方法中访问 HTML 文档一样。

回答by Henryk Konsek

The other answers are true. Java Regex API is not a proper tool to achieve your goal. Use efficient, secure and well tested high-level tools mentioned in the other answers.

其他答案都是真的。Java Regex API 不是实现目标的合适工具。使用其他答案中提到的高效、安全且经过充分测试的高级工具。

If your question concerns rather Regex API than a real-life problem (learning purposes for example) - you can do it with the following code:

如果您的问题更像是 Regex API 而不是现实生活中的问题(例如学习目的) - 您可以使用以下代码来解决:

String html = "foo <a href='link1'>bar</a> baz <a href='link2'>qux</a> foo";
Pattern p = Pattern.compile("<a href='(.*?)'>");
Matcher m = p.matcher(html);
while(m.find()) {
   System.out.println(m.group(0));
   System.out.println(m.group(1));
}

And the output is:

输出是:

<a href='link1'>
link1
<a href='link2'>
link2

Please note that lazy/reluctant qualifier *? must be used in order to reduce the grouping to the single tag. Group 0 is the entire match, group 1 is the next group match (next pair of parenthesis).

请注意懒惰/不情愿的限定符 *? 必须使用以将分组减少到单个标签。第 0 组是整场比赛,第 1 组是下一组比赛(下一对括号)。

回答by J?rg W Mittag

Regular expressions can only parse regular languages, that's why they are called regularexpressions. HTML is not a regular language, ergo it cannot be parsed by regular expressions.

正则表达式只能解析正则语言,因此被称为正则表达式。HTML 不是正则语言,因此它不能被正则表达式解析。

HTML parsers, on the other hand, canparse HTML, that's why they are called HTML parsers.

另一方面,HTML 解析器可以解析 HTML,这就是它们被称为 HTML 解析器的原因。

You should use you favorite HTML parser instead.

您应该改用您最喜欢的 HTML 解析器。

回答by Guss

Contrary to popular opinion, regular expressions are useful tools to extract data from unstructured text (which HTML is).

与流行观点相反,正则表达式是从非结构化文本(即 HTML)中提取数据的有用工具。

If you are doing complex HTML data extraction (say, find all paragraphs in a page) then HTML parsing is probably the way to go. But if you just need to get some URLs from HREFs, then a regular expression would work fine and it will be very hard to break it.

如果您正在执行复杂的 HTML 数据提取(例如,查找页面中的所有段落),那么 HTML 解析可能是要走的路。但是如果您只需要从 HREF 获取一些 URL,那么正则表达式就可以正常工作,并且很难破解它。

Try something like this:

尝试这样的事情:

/<a[^>]+href=["']?([^'"> ]+)["']?[^>]*>/i