领先的 Java HTML 解析器的优缺点是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3152138/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What are the pros and cons of the leading Java HTML parsers?
提问by Avi Flax
Searching SO and Google, I've found that there are a few Java HTML parsers which are consistently recommended by various parties. Unfortunately it's hard to find any information on the strengths and weaknesses of the various libraries. I'm hoping that some people have spent some comparing these libraries, and can share what they've learned.
搜索 SO 和 Google,我发现有一些 Java HTML 解析器被各方一致推荐。不幸的是,很难找到有关各种库的优缺点的任何信息。我希望有些人花了一些时间比较这些库,并且可以分享他们所学到的东西。
Here's what I've seen:
这是我所看到的:
And if there's a major parser that I've missed, I'd love to hear about its pros and cons as well.
如果我遗漏了一个主要的解析器,我也很想听听它的优缺点。
Thanks!
谢谢!
采纳答案by BalusC
General
一般的
Almost all known HTML parsers implements the W3C DOM API(part of the JAXP API, Java API for XML processing) and gives you a org.w3c.dom.Document
back which is ready for direct use by JAXP API. The major differences are usually to be found in the features of the parser in question. Most parsers are to a certain degree forgiving and lenient with non-wellformed HTML ("tagsoup"), like JTidy, NekoHTML, TagSoupand HtmlCleaner. You usually use this kind of HTML parsers to "tidy" the HTML source (e.g. replacing the HTML-valid <br>
by a XML-valid <br />
), so that you can traverse it "the usual way" using the W3C DOM and JAXP API.
几乎所有已知的 HTML 解析器都实现了W3C DOM API(JAXP API 的一部分,用于 XML 处理的 Java API)并为您提供了一个org.w3c.dom.Document
可供 JAXP API 直接使用的支持。主要的区别通常是在所讨论的解析器的特性中。大多数解析器在某种程度上对格式不正确的 HTML(“tagsoup”)宽容和宽容,例如JTidy、NekoHTML、TagSoup和HtmlCleaner。您通常使用这种 HTML 解析器来“整理”HTML 源代码(例如,将 HTML-valid 替换<br>
为 XML-valid <br />
),以便您可以使用 W3C DOM 和 JAXP API“以通常的方式”遍历它。
The only ones which jumps out are HtmlUnitand Jsoup.
HtmlUnit
单位
HtmlUnitprovides a completely own API which gives you the possibility to act like a webbrowser programmatically. I.e. enter form values, click elements, invoke JavaScript, etcetera. It's much more than alone a HTML parser. It's a real "GUI-less webbrowser" and HTML unit testing tool.
HtmlUnit提供了一个完全自己的 API,它使您可以以编程方式像网络浏览器一样工作。即输入表单值、单击元素、调用 JavaScript 等。它不仅仅是一个 HTML 解析器。它是一个真正的“无 GUI 网络浏览器”和 HTML 单元测试工具。
Jsoup
汤
Jsoupalso provides a completely own API. It gives you the possibility to select elements using jQuery-like CSS selectorsand provides a slick API to traverse the HTML DOM tree to get the elements of interest.
Jsoup还提供了一个完全自己的 API。它使您可以使用类似jQuery的CSS 选择器来选择元素,并提供一个灵活的 API 来遍历 HTML DOM 树以获取感兴趣的元素。
Particularly the traversing of the HTML DOM tree is the major strength of Jsoup. Ones who have worked with org.w3c.dom.Document
know what a hell of pain it is to traverse the DOM using the verbose NodeList
and Node
APIs. True, XPath
makes the life easier, but still, it's another learning curve and it can end up to be still verbose.
特别是遍历 HTML DOM 树是 Jsoup 的主要优势。用过的人都org.w3c.dom.Document
知道使用 verboseNodeList
和Node
API遍历 DOM 是多么痛苦。确实,XPath
让生活更轻松,但仍然是另一个学习曲线,它最终可能仍然很冗长。
Here's an example which uses a "plain" W3C DOM parser like JTidy in combination with XPath to extract the first paragraph of your question and the names of all answerers (I am using XPath since without it, the code needed to gather the information of interest would otherwise grow up 10 times as big, without writing utility/helper methods).
这是一个示例,它使用像 JTidy 这样的“普通”W3C DOM 解析器结合 XPath 来提取问题的第一段和所有回答者的姓名(我正在使用 XPath,因为没有它,代码需要收集感兴趣的信息否则会增长 10 倍,而无需编写实用程序/辅助方法)。
String url = "http://stackoverflow.com/questions/3152138";
Document document = new Tidy().parseDOM(new URL(url).openStream(), null);
XPath xpath = XPathFactory.newInstance().newXPath();
Node question = (Node) xpath.compile("//*[@id='question']//*[contains(@class,'post-text')]//p[1]").evaluate(document, XPathConstants.NODE);
System.out.println("Question: " + question.getFirstChild().getNodeValue());
NodeList answerers = (NodeList) xpath.compile("//*[@id='answers']//*[contains(@class,'user-details')]//a[1]").evaluate(document, XPathConstants.NODESET);
for (int i = 0; i < answerers.getLength(); i++) {
System.out.println("Answerer: " + answerers.item(i).getFirstChild().getNodeValue());
}
And here's an example how to do exactly the same with Jsoup:
这是一个如何使用 Jsoup 完全相同的示例:
String url = "http://stackoverflow.com/questions/3152138";
Document document = Jsoup.connect(url).get();
Element question = document.select("#question .post-text p").first();
System.out.println("Question: " + question.text());
Elements answerers = document.select("#answers .user-details a");
for (Element answerer : answerers) {
System.out.println("Answerer: " + answerer.text());
}
Do you see the difference? It's not only less code, but Jsoup is also relatively easy to grasp if you already have moderate experience with CSS selectors (by e.g. developing websites and/or using jQuery).
你看得到差别吗?它不仅代码更少,而且如果您已经对 CSS 选择器有一定的经验(例如开发网站和/或使用 jQuery),那么 Jsoup 也相对容易掌握。
Summary
概括
The pros and cons of each should be clear enough now. If you just want to use the standard JAXP API to traverse it, then go for the first mentioned group of parsers. There are pretty a lotof them. Which one to choose depends on the features it provides (how is HTML cleaning made easy for you? are there some listeners/interceptors and tag-specific cleaners?) and the robustness of the library (how often is it updated/maintained/fixed?). If you like to unit test the HTML, then HtmlUnit is the way to go. If you like to extract specific data from the HTML (which is more than often the real world requirement), then Jsoup is the way to go.
各自的优缺点现在应该已经很清楚了。如果您只想使用标准的 JAXP API 来遍历它,那么请使用第一个提到的解析器组。有相当大量的人。选择哪一个取决于它提供的功能(如何让 HTML 清理对你来说变得容易?是否有一些侦听器/拦截器和特定于标签的清理器?)和库的健壮性(更新/维护/修复的频率如何? )。如果您喜欢对 HTML 进行单元测试,那么 HtmlUnit 是您要走的路。如果您喜欢从 HTML 中提取特定数据(这通常是现实世界的要求),那么 Jsoup 是您的最佳选择。
回答by Alohci
Add The validator.nu HTML Parser, an implementation of the HTML5 parsing algorithm in Java, to your list.
将 validator.nu HTML Parser(HTML5 解析算法在 Java 中的实现)添加到您的列表中。
On the plus side, it's specifically designed to match HTML5, and at the heart of the HTML5 validator, so highly likely to match future browser's parsing behaviour to a very high degree of accuracy.
从好的方面来说,它是专门为匹配 HTML5 而设计的,并且是 HTML5 验证器的核心,因此很有可能以非常高的准确度匹配未来浏览器的解析行为。
On the minus side, no browsers' legacy parsing works exactly like this, and as HTML5 is still in draft, subject to change.
不利的一面是,没有浏览器的旧式解析完全像这样工作,而且 HTML5 仍处于草案中,可能会发生变化。
In practice, such problems only affect obscure corner cases, and is for all practical purposes, an excellent parser.
在实践中,此类问题仅影响模糊的极端情况,并且对于所有实际目的而言,都是一个出色的解析器。
回答by Matt Solnit
This articlecompares certain aspects of the following parsers:
本文比较了以下解析器的某些方面:
- NekoHTML
- JTidy
- TagSoup
- HtmlCleaner
- NekoHTML
- JTidy
- 标签汤
- HtmlCleaner
It is by no means a complete summary, and it is from 2008. But you may find it helpful.
这绝不是一个完整的总结,它是从 2008 年开始的。但你可能会发现它很有帮助。
回答by MJB
回答by Adam Gent
I'll just add to @MJB answer after working with most of the HTML parsing libraries in Java, there is a huge pro/con that is omitted: parsers that preserve the formatting and incorrectness of the HTML on input and output.
在使用 Java 中的大多数 HTML 解析库之后,我将添加到 @MJB 答案中,有一个巨大的优点/缺点被省略:解析器在输入和输出时保留 HTML 的格式和不正确性。
That is most parsers when you change the document will blow away the whitespace, comments, and incorrectness of the DOM particularly if they are an XML like library.
也就是说,当您更改文档时,大多数解析器都会清除 DOM 的空白、注释和不正确性,特别是如果它们是类似 XML 的库。
Jerichois the only parser I know that allows you to manipulate nasty HTML while preserving whitespace formatting and the incorrectness of the HTML (if there is any).
Jericho是我所知道的唯一一个允许您操作讨厌的 HTML 的解析器,同时保留空白格式和 HTML 的不正确性(如果有的话)。
回答by Mark Butler
Two other options are HTMLCleanerand HTMLParser.
另外两个选项是HTMLCleaner和HTMLParser。
I have tried most of the parsers here for a crawler / data extraction framework I have been developing. I use HTMLCleaner for the bulk of the data extraction work. This is because it supports a reasonably modern dialect of HTML, XHTML, HTML 5, with namespaces, and it supports DOM, so it is possible to use it with Java's built in XPath implementation.
我已经尝试了这里的大部分解析器,用于我一直在开发的爬虫/数据提取框架。我使用 HTMLCleaner 进行大部分数据提取工作。这是因为它支持带有命名空间的 HTML、XHTML、HTML 5 的现代方言,并且它支持 DOM,因此它可以与 Java 的内置 XPath 实现一起使用。
It's a lot easier to do this with HTMLCleaner than some of the other parsers: JSoup for example supports a DOM like interface, rather than DOM, so some assembly required. Jericho has a SAX-line interface so again it is requires some work although Sujit Pal has a good description of how to do thisbut in the end HTMLCleaner just worked better.
与其他一些解析器相比,使用 HTMLCleaner 执行此操作要容易得多:例如,JSoup 支持类似 DOM 的接口,而不是 DOM,因此需要一些组装. Jericho 有一个 SAX-line 接口,所以它需要一些工作,尽管Sujit Pal 对如何做到这一点有很好的描述,但最终 HTMLCleaner 工作得更好。
I also use HTMLParser and Jericho for a table extraction task, which replaced some code written using Perl's libhtml-tableextract-perl. I use HTMLParser to filter the HTML for the table, then use Jericho to parse it. I agree with MJB's and Adam's comments that Jericho is good in some cases because it preserves the underlying HTML. It has a kind of non-standard SAX interface, so for XPath processing HTMLCleaner is better.
我还使用 HTMLParser 和 Jericho 进行表提取任务,它们替换了一些使用 Perl 的libhtml-tableextract-perl编写的代码。我使用 HTMLParser 过滤表格的 HTML,然后使用 Jericho 对其进行解析。我同意 MJB 和 Adam 的评论,即 Jericho 在某些情况下很好,因为它保留了底层 HTML。它有一种非标准的 SAX 接口,所以对于 XPath 处理 HTMLCleaner 更好。
Parsing HTML in Java is a surprisingly hard problem as all the parsers seem to struggle on certain types of malformed HTML content.
在 Java 中解析 HTML 是一个令人惊讶的难题,因为所有解析器似乎都在努力处理某些类型的格式错误的 HTML 内容。