使用 Java 将 HTML 文件读取到 DOM 树

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/457684/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 14:52:19  来源:igfitidea点击:

Reading HTML file to DOM tree using Java

javahtmldomparsing

提问by Stefan Teitge

Is there a parser/library which is able to read an HTML document into a DOM tree using Java? I'd like to use the standard DOM/XpathAPI that Java provides.

是否有解析器/库能够使用 Java 将 HTML 文档读入 DOM 树?我想使用DOM/XpathJava 提供的标准API。

Most libraries seem have custom API's to solve this task. Furthermore the conversion HTML to XML-DOM seems unsupported by the most of the available parsers.

大多数库似乎都有自定义 API 来解决这个任务。此外,大多数可用的解析器似乎都不支持将 HTML 转换为 XML-DOM。

Any ideas or experience with a good HTML DOM parser?

对一个好的 HTML DOM 解析器有什么想法或经验吗?

采纳答案by bobince

JTidy, either by processing the stream to XHTML then using your favourite DOM implementation to re-parse, or using parseDOM if the limited DOM imp that gives you is enough.

JTidy,或者通过将流处理为 XHTML 然后使用您最喜欢的 DOM 实现重新解析,或者使用 parseDOM 如果给您的有限 DOM imp 足够了。

Alternatively Neko.

或者Neko

回答by Pesto

Apache's Xerces2 parsershould do what you want.

Apache 的 Xerces2 解析器应该可以满足您的需求。

回答by Peter ?tibrany

TagSoupcan do what you want.

TagSoup可以为所欲为

回答by Ichiro Furusato

Since HTML files are generally problematic, you'll need to first clean them up using a parser/scanner. I've used JTidy but never happily. NekoHTML works okay, but any of these tools are always just making a best guess of what is intended. You're effectively asking to let a program alter a document's markup until it conforms to a schema. That will likely cause structural (markup), style or content loss. It's unavoidable, and you won't really know what's missing unless you manually scan via a browser (and then you have to trust the browser too).

由于 HTML 文件通常有问题,因此您需要首先使用解析器/扫描仪清理它们。我用过 JTidy,但从来没有开心过。NekoHTML 工作正常,但这些工具中的任何一个总是只是对意图做出最好的猜测。您实际上是在要求让程序更改文档的标记,直到它符合模式。这可能会导致结构(标记)、样式或内容丢失。这是不可避免的,除非您通过浏览器手动扫描(然后您也必须信任浏览器),否则您不会真正知道丢失了什么。

It really depends on your purpose — if you have thousands of ugly documents with tons of extraneous (non-HTML) markup, then a manual process is probably unreasonable. If your goal is accuracy on a few important documents, then manually fixing them is a reasonable proposition.

这真的取决于你的目的——如果你有成千上万个带有大量无关(非 HTML)标记的丑陋文档,那么手动过程可能是不合理的。如果您的目标是在一些重要文档上保持准确性,那么手动修复它们是一个合理的提议。

One approach is the manual process of repeatedly passing the source through a well-formed and/or validating parser, in an edit cycle using the error messages to eventually fix the broken markup. This does require some understanding of XML, but that's not a bad education to undertake.

一种方法是手动过程,在编辑周期中使用错误消息最终修复损坏的标记,通过格式良好和/或验证解析器重复传递源。这确实需要对 XML 有一定的了解,但这并不是一个糟糕的教育。

With Java 5 the necessary XML features — called the JAXP API — are now built into Java itself; you don't need any external libraries.

在 Java 5 中,必要的 XML 特性 — 称为 JAXP API — 现在已内置到 Java 本身中;您不需要任何外部库。

You first obtain an instance of a DocumentBuilderFactory, set its features, create a DocumentBuilder (parser), then call its parse() method with an InputSource. InputSource has a number of possible constructors, with a StringReader used in the following example:

首先获取 DocumentBuilderFactory 的实例,设置其功能,创建 DocumentBuilder(解析器),然后使用 InputSource 调用其 parse() 方法。InputSource 有许多可能的构造函数,以下示例中使用了 StringReader:

import javax.xml.parsers.*;
// ...

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
dbf.setNamespaceAware(true);
dbf.setIgnoringComments(false);
dbf.setIgnoringElementContentWhitespace(false);
dbf.setExpandEntityReferences(false);
DocumentBuilder db = dbf.newDocumentBuilder();
return db.parse(new InputSource(new StringReader(source)));

This returns a DOM Document. If you don't mind using external libraries there's also the JDOM and XOM APIs, and while these have some advantages over the SAX and DOM APIs in JAXP, they do require non-Java libraries to be added. The DOM can be somewhat cumbersome, but after so many years of using it I don't really mind any longer.

这将返回一个 DOM 文档。如果您不介意使用外部库,那么还有 JDOM 和 XOM API,虽然它们比 JAXP 中的 SAX 和 DOM API 有一些优势,但它们确实需要添加非 Java 库。DOM 可能有点麻烦,但在使用了这么多年之后,我真的不再介意了。

回答by Dewsworld

Here is a link that might be useful. It's a list of Open Source HTML Parser in Java Open Source HTML Parsers in Java

这是一个可能有用的链接。这是 Java 中的开源 HTML 解析器列表 Java 中的开源 HTML 解析器

回答by Ali Bagheri

Use https://jsoup.org, this is very simple and power.can read and change a html.

使用https://jsoup.org,这很简单而且很强大。可以读取和更改 html。

Sample:

样本:

Document doc = Jsoup.parse(page);  //page can be a file or string.
Element main = doc.getElementById("MainView");
Elements links = doc.select(".link");

for create elements can use j2html, https://j2html.com

对于创建元素可以使用 j2html,https://j2html.com