Java 哪个 HTML 解析器最好?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2168610/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 04:21:19  来源:igfitidea点击:

Which HTML Parser is the best?

javahtmlparsinghtml-parsingweb-scraping

提问by Yatendra Goel

I code a lot of parsers. Up until now, I was using HtmlUnit headless browser for parsing and browser automation.

我编写了很多解析器。到目前为止,我一直在使用 HtmlUnit 无头浏览器进行解析和浏览器自动化。

Now, I want to separate both the tasks.

现在,我想将这两个任务分开。

As 80% of my work involves just parsing, I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.

由于我 80% 的工作只涉及解析,我想使用轻量级 HTML 解析器,因为在 HtmlUnit 中首先加载页面,然后获取源然后解析它需要很多时间。

I want to know which HTML parser is the best. The parser would be better if it is close to HtmlUnit parser.

我想知道哪个 HTML 解析器是最好的。如果接近 HtmlUnit 解析器,解析器会更好。



EDIT:

编辑:

By best, I want at least the following features:

最好的情况是,我至少需要以下功能:

  1. Speed
  2. Ease to locate any HtmlElement by its "id" or "name" or "tag type".
  1. 速度
  2. 通过其“id”或“名称”或“标签类型”轻松定位任何 HtmlElement。

It would be ok for me if it doesn't clean the dirty HTML code. I don't need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.

如果它不清理脏的 HTML 代码,对我来说就可以了。我不需要清理任何 HTML 源代码。我只需要一种最简单的方法来移动 HtmlElements 并从中获取数据。

采纳答案by Jonathan Hedley

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.

自插:我刚刚发布了一个新的 Java HTML 解析器:jsoup。我在这里提到它是因为我认为它会做你所追求的。

Its party trick is a CSS selector syntax to find elements, e.g.:

它的派对技巧是用于查找元素的 CSS 选择器语法,例如:

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();

See the Selectorjavadoc for more info.

有关更多信息,请参阅选择器javadoc。

This is a new project, so any ideas for improvement are very welcome!

这是一个新项目,因此非常欢迎任何改进的想法!

回答by tangens

The best I've seen so far is HtmlCleaner:

到目前为止我见过的最好的是HtmlCleaner

HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.

HtmlCleaner 是用 Java 编写的开源 HTML 解析器。在 Web 上找到的 HTML 通常是脏的、格式错误的并且不适合进一步处理。对于此类文件的任何严重消耗,首先要清理混乱,并为标签,属性和普通文本带来秩序。对于给定的 HTML 文档,HtmlCleaner 重新排序各个元素并生成格式良好的 XML。默认情况下,它遵循大多数 Web 浏览器用于创建文档对象模型的类似规则。但是,用户可以为标签过滤和平衡提供自定义标签和规则集。

With HtmlCleaner you can locate any element using XPath.

借助 HtmlCleaner,您可以使用 XPath 定位任何元素。

For other html parsers see this SO question.

对于其他 html 解析器,请参阅此 SO 问题

回答by Ms2ger