用于 Java 的 HTML/XML 解析器

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2129375/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 03:53:33  来源:igfitidea点击:

HTML/XML Parser for Java

javahtmlxmldomparsing

提问by Shayan

What HTML parsers have the following features:

哪些 HTML 解析器具有以下特点:

  • Fast
  • Thread-safe
  • Reliable and bug-free
  • Parses HTML and XML
  • Handles erroneous HTML
  • Has a DOM implementation
  • Supports HTML4, JavaScript, and CSS tags
  • Relatively simple, object-oriented API
  • 快速地
  • 线程安全
  • 可靠且无错误
  • 解析 HTML 和 XML
  • 处理错误的 HTML
  • 有一个 DOM 实现
  • 支持 HTML4、JavaScript 和 CSS 标签
  • 相对简单的面向对象的 API

What parser you think is better?

你认为哪个解析器更好?

Thank you.

谢谢你。

采纳答案by Shayan

Apache Tikais the best choice. Apache has recently extracted many sub-projects out of the existing projects and made them public. Tika is one of them that was previously a component of Apache Lucene. Because of Apache's support and reputation and the widely-used parent project Lucene it must be a very good choice. Furthermore, it is open-source.

Apache Tika是最佳选择。Apache 最近从现有项目中提取了许多子项目并公开。Tika 是其中之一,以前是 Apache Lucene 的一个组件。由于Apache的支持和声誉以及广泛使用的父项目Lucene,它一定是一个非常好的选择。此外,它是开源的。

A brief introduction from Apache Tika web site:

来自 Apache Tika 网站的简要介绍:

The Apache Tika? toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

阿帕奇蒂卡?工具包使用现有的解析器库从各种文档中检测和提取元数据和结构化文本内容。

And the supported formats are:

支持的格式是:

HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Audio formats
Image formats
Video formats
Java class files and archives
The mbox format
HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Audio formats
Image formats
Video formats
Java class files and archives
The mbox format

回答by Valentin Rocher

The best known are NekoHTMLand JTidy.

最著名的是NekoHTMLJTidy

NekoHTML is based on Xerces, and provides a simple adaptable SAXParserwhich implements XMLReaderJavaSE interface.

NekoHTML 基于 Xerces,并提供了一个简单的自适应SAXParser,它实现了XMLReaderJavaSE 接口。

JTidy is more intented into formatting your html code into something XML-valid, but is still very useful as an XML parser, producing a DOM tree if needed.

JTidy 更倾向于将您的 html 代码格式化为 XML 有效的内容,但作为 XML 解析器仍然非常有用,如果需要可以生成 DOM 树。

You could have a look at this listfor other alternatives.

您可以查看此列表以了解其他替代方案。

Another choice could be to use hpricotthrough jRuby.

另一种选择是通过 jRuby使用hpricot

回答by Kico Lobo

Well:

好:

there aren't so many good HTML parsers in java as you need, but here are some alternatives: http://java-source.net/open-source/html-parsers

java中没有你需要的那么多好的HTML解析器,但这里有一些替代方案:http: //java-source.net/open-source/html-parsers

Very few of them support Javascript. Actually, I think you'll have to do this part on your own using Rhino (http://www.mozilla.org/rhino/).

他们中很少有人支持Javascript。实际上,我认为您必须使用 Rhino ( http://www.mozilla.org/rhino/)自己完成这一部分。

回答by Pascal Thivent

I think that HTML Cleaneris what you're looking for. See its announcementon TheServerSide to see how it compare to JTidy, TagSoup, NekoHtml.

我认为HTML Cleaner正是您要找的。查看其在 TheServerSide 上的公告,了解它与 JTidy、TagSoup、NekoHtml 的比较。

回答by Pascal Thivent

you probably want to look at doing something like running Mozilla in headless mode. Here is a linkto get you started, I am sure you can use Google to find out more information.

你可能想看看在无头模式下运行 Mozilla 之类的事情。这是一个可以帮助您入门的链接,我相信您可以使用 Google 来查找更多信息。

回答by Cesar

Check out Web Harvest. It's both a library you can use and a data extraction tool, which sounds to me that's exactly what you want to do. You create XML script files to instruct the scraper how to extract the information you need and from where. The provided GUI is very useful to quickly test the scripts.

查看网络收获。它既是一个您可以使用的库,也是一个数据提取工具,在我看来,这正是您想要做的。您创建 XML 脚本文件来指示刮刀如何提取您需要的信息以及从何处提取信息。提供的 GUI 对于快速测试脚本非常有用。

Check out the project's samples pageto see if it's a good fit for what you are trying to do.

查看项目的示例页面,看看它是否适合您正在尝试做的事情。

回答by Ms2ger

Validator.nu's HTML parser, definitely. It's an implementation of the HTML5 parsing algorithm, and Gecko is in the process of replacing its own HTML parser with a C++ translation of this one.

肯定是Validator.nu 的 HTML 解析器。它是 HTML5 解析算法的一种实现,Gecko 正在用 C++ 翻译替换它自己的 HTML 解析器。