用于 Java 的 HTML/XML 解析器

Question

提问by Shayan

What HTML parsers have the following features:

哪些 HTML 解析器具有以下特点：

Fast
Thread-safe
Reliable and bug-free
Parses HTML and XML
Handles erroneous HTML
Has a DOM implementation
Supports HTML4, JavaScript, and CSS tags
Relatively simple, object-oriented API

快速地
线程安全
可靠且无错误
解析 HTML 和 XML
处理错误的 HTML
有一个 DOM 实现
支持 HTML4、JavaScript 和 CSS 标签
相对简单的面向对象的 API

What parser you think is better?

你认为哪个解析器更好？

Thank you.

谢谢你。

Answer 1

采纳答案by Shayan

Apache Tikais the best choice. Apache has recently extracted many sub-projects out of the existing projects and made them public. Tika is one of them that was previously a component of Apache Lucene. Because of Apache's support and reputation and the widely-used parent project Lucene it must be a very good choice. Furthermore, it is open-source.

Apache Tika是最佳选择。Apache 最近从现有项目中提取了许多子项目并公开。Tika 是其中之一，以前是 Apache Lucene 的一个组件。由于Apache的支持和声誉以及广泛使用的父项目Lucene，它一定是一个非常好的选择。此外，它是开源的。

A brief introduction from Apache Tika web site:

来自 Apache Tika 网站的简要介绍：

The Apache Tika? toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

阿帕奇蒂卡？工具包使用现有的解析器库从各种文档中检测和提取元数据和结构化文本内容。

And the supported formats are:

支持的格式是：

HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Audio formats
Image formats
Video formats
Java class files and archives
The mbox format

HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Audio formats
Image formats
Video formats
Java class files and archives
The mbox format

Answer 2

回答by Valentin Rocher

The best known are NekoHTMLand JTidy.

最著名的是NekoHTML和JTidy。

NekoHTML is based on Xerces, and provides a simple adaptable SAXParserwhich implements XMLReaderJavaSE interface.

NekoHTML 基于 Xerces，并提供了一个简单的自适应SAXParser，它实现了XMLReaderJavaSE 接口。

JTidy is more intented into formatting your html code into something XML-valid, but is still very useful as an XML parser, producing a DOM tree if needed.

JTidy 更倾向于将您的 html 代码格式化为 XML 有效的内容，但作为 XML 解析器仍然非常有用，如果需要可以生成 DOM 树。

You could have a look at this listfor other alternatives.

您可以查看此列表以了解其他替代方案。

Another choice could be to use hpricotthrough jRuby.

另一种选择是通过 jRuby使用hpricot。

Answer 3

回答by Kico Lobo

Well:

好：

there aren't so many good HTML parsers in java as you need, but here are some alternatives: http://java-source.net/open-source/html-parsers

java中没有你需要的那么多好的HTML解析器，但这里有一些替代方案：http: //java-source.net/open-source/html-parsers

Very few of them support Javascript. Actually, I think you'll have to do this part on your own using Rhino (http://www.mozilla.org/rhino/).

他们中很少有人支持Javascript。实际上，我认为您必须使用 Rhino ( http://www.mozilla.org/rhino/)自己完成这一部分。

Answer 4

回答by Pascal Thivent

I think that HTML Cleaneris what you're looking for. See its announcementon TheServerSide to see how it compare to JTidy, TagSoup, NekoHtml.

我认为HTML Cleaner正是您要找的。查看其在 TheServerSide 上的公告，了解它与 JTidy、TagSoup、NekoHtml 的比较。

Answer 5

回答by Pascal Thivent

you probably want to look at doing something like running Mozilla in headless mode. Here is a linkto get you started, I am sure you can use Google to find out more information.

你可能想看看在无头模式下运行 Mozilla 之类的事情。这是一个可以帮助您入门的链接，我相信您可以使用 Google 来查找更多信息。

Answer 6

回答by Cesar

Check out Web Harvest. It's both a library you can use and a data extraction tool, which sounds to me that's exactly what you want to do. You create XML script files to instruct the scraper how to extract the information you need and from where. The provided GUI is very useful to quickly test the scripts.

查看网络收获。它既是一个您可以使用的库，也是一个数据提取工具，在我看来，这正是您想要做的。您创建 XML 脚本文件来指示刮刀如何提取您需要的信息以及从何处提取信息。提供的 GUI 对于快速测试脚本非常有用。

Check out the project's samples pageto see if it's a good fit for what you are trying to do.

查看项目的示例页面，看看它是否适合您正在尝试做的事情。

Answer 7

回答by Ms2ger

Validator.nu's HTML parser, definitely. It's an implementation of the HTML5 parsing algorithm, and Gecko is in the process of replacing its own HTML parser with a C++ translation of this one.

肯定是Validator.nu 的 HTML 解析器。它是 HTML5 解析算法的一种实现，Gecko 正在用 C++ 翻译替换它自己的 HTML 解析器。

用于 Java 的 HTML/XML 解析器

提问by Shayan

采纳答案by Shayan

回答by Valentin Rocher

回答by Kico Lobo

回答by Pascal Thivent

回答by Pascal Thivent

回答by Cesar

回答by Ms2ger

相关推荐

最近更新

标签

用于 Java 的 HTML/XML 解析器

提问by Shayan

采纳答案by Shayan

回答by Valentin Rocher

回答by Kico Lobo

回答by Pascal Thivent

回答by Pascal Thivent

回答by Cesar

回答by Ms2ger

相关推荐

Java 运行从基于 gradle 的项目构建的可执行 jar 文件

Java：将字符串“\uFFFF”转换为字符

Java 在 thymeleaf 中为变量名称设置值

Java “软件导致连接中止：套接字写入错误”的官方原因

相关推荐

最近更新

标签