Java 哪个 HTML 解析器最好？

Question

提问by Yatendra Goel

I code a lot of parsers. Up until now, I was using HtmlUnit headless browser for parsing and browser automation.

我编写了很多解析器。到目前为止，我一直在使用 HtmlUnit 无头浏览器进行解析和浏览器自动化。

Now, I want to separate both the tasks.

现在，我想将这两个任务分开。

As 80% of my work involves just parsing, I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.

由于我 80% 的工作只涉及解析，我想使用轻量级 HTML 解析器，因为在 HtmlUnit 中首先加载页面，然后获取源然后解析它需要很多时间。

I want to know which HTML parser is the best. The parser would be better if it is close to HtmlUnit parser.

我想知道哪个 HTML 解析器是最好的。如果接近 HtmlUnit 解析器，解析器会更好。

EDIT:

编辑：

By best, I want at least the following features:

最好的情况是，我至少需要以下功能：

Speed
Ease to locate any HtmlElement by its "id" or "name" or "tag type".

速度
通过其“id”或“名称”或“标签类型”轻松定位任何 HtmlElement。

It would be ok for me if it doesn't clean the dirty HTML code. I don't need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.

如果它不清理脏的 HTML 代码，对我来说就可以了。我不需要清理任何 HTML 源代码。我只需要一种最简单的方法来移动 HtmlElements 并从中获取数据。

Answer 1

采纳答案by Jonathan Hedley

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.

自插：我刚刚发布了一个新的 Java HTML 解析器：jsoup。我在这里提到它是因为我认为它会做你所追求的。

Its party trick is a CSS selector syntax to find elements, e.g.:

它的派对技巧是用于查找元素的 CSS 选择器语法，例如：

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();

See the Selectorjavadoc for more info.

有关更多信息，请参阅选择器javadoc。

This is a new project, so any ideas for improvement are very welcome!

这是一个新项目，因此非常欢迎任何改进的想法！

Answer 2

回答by tangens

The best I've seen so far is HtmlCleaner:

到目前为止我见过的最好的是HtmlCleaner：

HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.

HtmlCleaner 是用 Java 编写的开源 HTML 解析器。在 Web 上找到的 HTML 通常是脏的、格式错误的并且不适合进一步处理。对于此类文件的任何严重消耗，首先要清理混乱，并为标签，属性和普通文本带来秩序。对于给定的 HTML 文档，HtmlCleaner 重新排序各个元素并生成格式良好的 XML。默认情况下，它遵循大多数 Web 浏览器用于创建文档对象模型的类似规则。但是，用户可以为标签过滤和平衡提供自定义标签和规则集。

With HtmlCleaner you can locate any element using XPath.

借助 HtmlCleaner，您可以使用 XPath 定位任何元素。

For other html parsers see this SO question.

对于其他 html 解析器，请参阅此 SO 问题。

Answer 3

回答by Ms2ger

I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. It is the parser used in Mozilla from 2010-05-03

我建议Validator.nu 的 parser，基于 HTML5 解析算法。它是 Mozilla 从 2010-05-03 开始使用的解析器

Java 哪个 HTML 解析器最好？

提问by Yatendra Goel

采纳答案by Jonathan Hedley

回答by tangens

回答by Ms2ger

相关推荐

最近更新

标签

Java 哪个 HTML 解析器最好？

提问by Yatendra Goel

采纳答案by Jonathan Hedley

回答by tangens

回答by Ms2ger

相关推荐

Java 使用 selenium webdriver 从列表框中选择值

Java 未能传递结果 ResultInfo

Java 如何防止 Spring 3.0 MVC @ModelAttribute 变量出现在 URL 中？

在 Java 中，如何将 InputStream 转换为字节数组 (byte[])？

相关推荐

最近更新

标签