Java HTML 解析
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/238036/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Java HTML Parsing
提问by Richard Walton
I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for
我正在开发一个从网站上抓取数据的应用程序,我想知道我应该如何获取数据。具体来说,我需要使用特定 CSS 类的许多 div 标签中包含的数据 - 目前(出于测试目的)我只是在检查
div class = "classname"
in each line of HTML - This works, but I can't help but feel there is a better solution out there.
在每一行 HTML 中 - 这有效,但我不禁觉得那里有更好的解决方案。
Is there any nice way where I could give a class a line of HTML and have some nice methods like:
有什么好的方法可以给一个类一行 HTML 并有一些不错的方法,例如:
boolean usesClass(String CSSClassname);
String getText();
String getLink();
采纳答案by user31586
Several years ago I used JTidy for the same purpose:
几年前,我出于同样的目的使用了 JTidy:
"JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.
“JTidy 是 HTML Tidy 的 Java 端口,是 HTML 语法检查器和漂亮的打印机。就像它的非 Java 兄弟一样,JTidy 可以用作清理格式错误和有缺陷的 HTML 的工具。此外,JTidy 提供了一个 DOM 接口到正在处理的文档,这有效地使您能够将 JTidy 用作真实世界 HTML 的 DOM 解析器。
JTidy was written by Andy Quick, who later stepped down from the maintainer position. Now JTidy is maintained by a group of volunteers.
JTidy 由 Andy Quick 编写,他后来辞去了维护人员的职务。现在 JTidy 由一群志愿者维护。
More information on JTidy can be found on the JTidy SourceForge project page ."
有关 JTidy 的更多信息可以在 JTidy SourceForge 项目页面上找到。”
回答by Yuval
回答by PhiLho
回答by dave
The HTMLParser project (http://htmlparser.sourceforge.net/) might be a possibility. It seems to be pretty decent at handling malformed HTML. The following snippet should do what you need:
HTMLParser 项目 ( http://htmlparser.sourceforge.net/) 可能是一种可能性。在处理格式错误的 HTML 方面似乎相当不错。以下代码段应该可以满足您的需求:
Parser parser = new Parser(htmlInput);
CssSelectorNodeFilter cssFilter =
new CssSelectorNodeFilter("DIV.targetClassName");
NodeList nodes = parser.parse(cssFilter);
回答by Fernando Miguélez
The main problem as stated by preceding coments is malformed HTML, so an html cleaner or HTML-XML converter is a must. Once you get the XML code (XHTML) there are plenty of tools to handle it. You could get it with a simple SAX handler that extracts only the data you need or any tree-based method (DOM, JDOM, etc.) that let you even modify original code.
前面的评论所述的主要问题是 HTML 格式错误,因此必须使用 html 清洁器或 HTML-XML 转换器。获得 XML 代码 (XHTML) 后,有很多工具可以处理它。您可以使用一个简单的 SAX 处理程序来获取它,该处理程序仅提取您需要的数据或任何基于树的方法(DOM、JDOM 等),您甚至可以修改原始代码。
Here is a sample code that uses HTML cleanerto get all DIVs that use a certain class and print out all Text content inside it.
这是一个示例代码,它使用HTML 清洁器获取使用某个类的所有 DIV 并打印出其中的所有文本内容。
import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
/**
* @author Fernando Miguélez Palomo <fernandoDOTmiguelezATgmailDOTcom>
*/
public class TestHtmlParse
{
static final String className = "tags";
static final String url = "http://www.stackoverflow.com";
TagNode rootNode;
public TestHtmlParse(URL htmlPage) throws IOException
{
HtmlCleaner cleaner = new HtmlCleaner();
rootNode = cleaner.clean(htmlPage);
}
List getDivsByClass(String CSSClassname)
{
List divList = new ArrayList();
TagNode divElements[] = rootNode.getElementsByName("div", true);
for (int i = 0; divElements != null && i < divElements.length; i++)
{
String classType = divElements[i].getAttributeByName("class");
if (classType != null && classType.equals(CSSClassname))
{
divList.add(divElements[i]);
}
}
return divList;
}
public static void main(String[] args)
{
try
{
TestHtmlParse thp = new TestHtmlParse(new URL(url));
List divs = thp.getDivsByClass(className);
System.out.println("*** Text of DIVs with class '"+className+"' at '"+url+"' ***");
for (Iterator iterator = divs.iterator(); iterator.hasNext();)
{
TagNode divElement = (TagNode) iterator.next();
System.out.println("Text child nodes of DIV: " + divElement.getText().toString());
}
}
catch(Exception e)
{
e.printStackTrace();
}
}
}
回答by alex
HTMLUnit might be of help. It does a lot more stuff too.
HTMLUnit 可能会有所帮助。它也做了很多事情。
回答by FolksLord
Jericho: http://jericho.htmlparser.net/docs/index.html
耶利哥:http: //jericho.htmlparser.net/docs/index.html
Easy to use, supports not well formed HTML, a lot of examples.
易于使用,支持格式不正确的 HTML,大量示例。
回答by rajsite
Another library that might be useful for HTML processing is jsoup. Jsoup tries to clean malformed HTML and allows html parsing in Java using jQuery like tag selector syntax.
另一个可能对 HTML 处理有用的库是 jsoup。Jsoup 尝试清理格式错误的 HTML,并允许使用 jQuery 之类的标签选择器语法在 Java 中解析 html。
回答by Mike Samuel
The nu.validator
project is an excellent, high performance HTML parser that doesn't cut corners correctness-wise.
该nu.validator
项目是一个出色的高性能 HTML 解析器,不会在正确性方面走捷径。
The Validator.nu HTML Parser is an implementation of the HTML5 parsing algorithm in Java. The parser is designed to work as a drop-in replacement for the XML parser in applications that already support XHTML 1.x content with an XML parser and use SAX, DOM or XOM to interface with the parser. Low-level functionality is provided for applications that wish to perform their own IO and support document.write() with scripting. The parser core compiles on Google Web Toolkit and can be automatically translated into C++. (The C++ translation capability is currently used for porting the parser for use in Gecko.)
Validator.nu HTML Parser 是 HTML5 解析算法的 Java 实现。该解析器被设计为在已经通过 XML 解析器支持 XHTML 1.x 内容并使用 SAX、DOM 或 XOM 与解析器交互的应用程序中作为 XML 解析器的直接替代品。为希望执行自己的 IO 并通过脚本支持 document.write() 的应用程序提供了低级功能。解析器核心在 Google Web Toolkit 上编译,可以自动翻译成 C++。(C++ 翻译功能目前用于移植解析器以在 Gecko 中使用。)
回答by Vincent Massol
You can also use XWiki HTML Cleaner:
您还可以使用XWiki HTML Cleaner:
It uses HTMLCleanerand extends it to generate valid XHTML 1.1 content.
它使用HTMLCleaner并对其进行扩展以生成有效的 XHTML 1.1 内容。