用于 Java 的 HTML/XML 解析器
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2129375/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
HTML/XML Parser for Java
提问by Shayan
What HTML parsers have the following features:
哪些 HTML 解析器具有以下特点:
- Fast
- Thread-safe
- Reliable and bug-free
- Parses HTML and XML
- Handles erroneous HTML
- Has a DOM implementation
- Supports HTML4, JavaScript, and CSS tags
- Relatively simple, object-oriented API
- 快速地
- 线程安全
- 可靠且无错误
- 解析 HTML 和 XML
- 处理错误的 HTML
- 有一个 DOM 实现
- 支持 HTML4、JavaScript 和 CSS 标签
- 相对简单的面向对象的 API
What parser you think is better?
你认为哪个解析器更好?
Thank you.
谢谢你。
采纳答案by Shayan
Apache Tikais the best choice. Apache has recently extracted many sub-projects out of the existing projects and made them public. Tika is one of them that was previously a component of Apache Lucene. Because of Apache's support and reputation and the widely-used parent project Lucene it must be a very good choice. Furthermore, it is open-source.
Apache Tika是最佳选择。Apache 最近从现有项目中提取了许多子项目并公开。Tika 是其中之一,以前是 Apache Lucene 的一个组件。由于Apache的支持和声誉以及广泛使用的父项目Lucene,它一定是一个非常好的选择。此外,它是开源的。
A brief introduction from Apache Tika web site:
来自 Apache Tika 网站的简要介绍:
The Apache Tika? toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.
阿帕奇蒂卡?工具包使用现有的解析器库从各种文档中检测和提取元数据和结构化文本内容。
And the supported formats are:
支持的格式是:
HyperText Markup Language XML and derived formats Microsoft Office document formats OpenDocument Format Portable Document Format Electronic Publication Format Rich Text Format Compression and packaging formats Text formats Audio formats Image formats Video formats Java class files and archives The mbox format
HyperText Markup Language XML and derived formats Microsoft Office document formats OpenDocument Format Portable Document Format Electronic Publication Format Rich Text Format Compression and packaging formats Text formats Audio formats Image formats Video formats Java class files and archives The mbox format
回答by Valentin Rocher
The best known are NekoHTMLand JTidy.
NekoHTML is based on Xerces, and provides a simple adaptable SAXParserwhich implements XMLReaderJavaSE interface.
NekoHTML 基于 Xerces,并提供了一个简单的自适应SAXParser,它实现了XMLReaderJavaSE 接口。
JTidy is more intented into formatting your html code into something XML-valid, but is still very useful as an XML parser, producing a DOM tree if needed.
JTidy 更倾向于将您的 html 代码格式化为 XML 有效的内容,但作为 XML 解析器仍然非常有用,如果需要可以生成 DOM 树。
You could have a look at this listfor other alternatives.
您可以查看此列表以了解其他替代方案。
Another choice could be to use hpricotthrough jRuby.
另一种选择是通过 jRuby使用hpricot。
回答by Kico Lobo
Well:
好:
there aren't so many good HTML parsers in java as you need, but here are some alternatives: http://java-source.net/open-source/html-parsers
java中没有你需要的那么多好的HTML解析器,但这里有一些替代方案:http: //java-source.net/open-source/html-parsers
Very few of them support Javascript. Actually, I think you'll have to do this part on your own using Rhino (http://www.mozilla.org/rhino/).
他们中很少有人支持Javascript。实际上,我认为您必须使用 Rhino ( http://www.mozilla.org/rhino/)自己完成这一部分。
回答by Pascal Thivent
I think that HTML Cleaneris what you're looking for. See its announcementon TheServerSide to see how it compare to JTidy, TagSoup, NekoHtml.
我认为HTML Cleaner正是您要找的。查看其在 TheServerSide 上的公告,了解它与 JTidy、TagSoup、NekoHtml 的比较。
回答by Pascal Thivent
回答by Cesar
Check out Web Harvest. It's both a library you can use and a data extraction tool, which sounds to me that's exactly what you want to do. You create XML script files to instruct the scraper how to extract the information you need and from where. The provided GUI is very useful to quickly test the scripts.
查看网络收获。它既是一个您可以使用的库,也是一个数据提取工具,在我看来,这正是您想要做的。您创建 XML 脚本文件来指示刮刀如何提取您需要的信息以及从何处提取信息。提供的 GUI 对于快速测试脚本非常有用。
Check out the project's samples pageto see if it's a good fit for what you are trying to do.
查看项目的示例页面,看看它是否适合您正在尝试做的事情。
回答by Ms2ger
Validator.nu's HTML parser, definitely. It's an implementation of the HTML5 parsing algorithm, and Gecko is in the process of replacing its own HTML parser with a C++ translation of this one.
肯定是Validator.nu 的 HTML 解析器。它是 HTML5 解析算法的一种实现,Gecko 正在用 C++ 翻译替换它自己的 HTML 解析器。