Java HTML 解析

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/238036/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 11:52:19  来源:igfitidea点击:

Java HTML Parsing

javahtmlparsingweb-scraping

提问by Richard Walton

I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for

我正在开发一个从网站上抓取数据的应用程序,我想知道我应该如何获取数据。具体来说,我需要使用特定 CSS 类的许多 div 标签中包含的数据 - 目前(出于测试目的)我只是在检查

div class = "classname"

in each line of HTML - This works, but I can't help but feel there is a better solution out there.

在每一行 HTML 中 - 这有效,但我不禁觉得那里有更好的解决方案。

Is there any nice way where I could give a class a line of HTML and have some nice methods like:

有什么好的方法可以给一个类一行 HTML 并有一些不错的方法,例如:

boolean usesClass(String CSSClassname);
String getText();
String getLink();

采纳答案by user31586

Several years ago I used JTidy for the same purpose:

几年前,我出于同样的目的使用了 JTidy:

http://jtidy.sourceforge.net/

http://jtidy.sourceforge.net/

"JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

“JTidy 是 HTML Tidy 的 Java 端口,是 HTML 语法检查器和漂亮的打印机。就像它的非 Java 兄弟一样,JTidy 可以用作清理格式错误和有缺陷的 HTML 的工具。此外,JTidy 提供了一个 DOM 接口到正在处理的文档,这有效地使您能够将 JTidy 用作真实世界 HTML 的 DOM 解析器。

JTidy was written by Andy Quick, who later stepped down from the maintainer position. Now JTidy is maintained by a group of volunteers.

JTidy 由 Andy Quick 编写,他后来辞去了维护人员的职务。现在 JTidy 由一群志愿者维护。

More information on JTidy can be found on the JTidy SourceForge project page ."

有关 JTidy 的更多信息可以在 JTidy SourceForge 项目页面上找到。”

回答by Yuval

If your HTML is well-formed, you can easily employ an XML parser to do the job for you... If you're only reading, SAXwould be ideal.

如果您的 HTML 格式正确,您可以轻松地使用 XML 解析器来为您完成这项工作……如果您只是阅读,SAX将是理想的选择。

回答by PhiLho

You might be interested by TagSoup, a Java HTML parser able to handle malformed HTML. XML parsers would work only on well formed XHTML.

您可能对TagSoup感兴趣,这是一个能够处理格式错误的 HTML 的 Java HTML 解析器。XML 解析器只能在格式良好的 XHTML 上工作。

回答by dave

The HTMLParser project (http://htmlparser.sourceforge.net/) might be a possibility. It seems to be pretty decent at handling malformed HTML. The following snippet should do what you need:

HTMLParser 项目 ( http://htmlparser.sourceforge.net/) 可能是一种可能性。在处理格式错误的 HTML 方面似乎相当不错。以下代码段应该可以满足您的需求:

Parser parser = new Parser(htmlInput);
CssSelectorNodeFilter cssFilter = 
    new CssSelectorNodeFilter("DIV.targetClassName");
NodeList nodes = parser.parse(cssFilter);

回答by Fernando Miguélez

The main problem as stated by preceding coments is malformed HTML, so an html cleaner or HTML-XML converter is a must. Once you get the XML code (XHTML) there are plenty of tools to handle it. You could get it with a simple SAX handler that extracts only the data you need or any tree-based method (DOM, JDOM, etc.) that let you even modify original code.

前面的评论所述的主要问题是 HTML 格式错误,因此必须使用 html 清洁器或 HTML-XML 转换器。获得 XML 代码 (XHTML) 后,有很多工具可以处理它。您可以使用一个简单的 SAX 处理程序来获取它,该处理程序仅提取您需要的数据或任何基于树的方法(DOM、JDOM 等),您甚至可以修改原始代码。

Here is a sample code that uses HTML cleanerto get all DIVs that use a certain class and print out all Text content inside it.

这是一个示例代码,它使用HTML 清洁器获取使用某个类的所有 DIV 并打印出其中的所有文本内容。

import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;

/**
 * @author Fernando Miguélez Palomo <fernandoDOTmiguelezATgmailDOTcom>
 */
public class TestHtmlParse
{
    static final String className = "tags";
    static final String url = "http://www.stackoverflow.com";

    TagNode rootNode;

    public TestHtmlParse(URL htmlPage) throws IOException
    {
        HtmlCleaner cleaner = new HtmlCleaner();
        rootNode = cleaner.clean(htmlPage);
    }

    List getDivsByClass(String CSSClassname)
    {
        List divList = new ArrayList();

        TagNode divElements[] = rootNode.getElementsByName("div", true);
        for (int i = 0; divElements != null && i < divElements.length; i++)
        {
            String classType = divElements[i].getAttributeByName("class");
            if (classType != null && classType.equals(CSSClassname))
            {
                divList.add(divElements[i]);
            }
        }

        return divList;
    }

    public static void main(String[] args)
    {
        try
        {
            TestHtmlParse thp = new TestHtmlParse(new URL(url));

            List divs = thp.getDivsByClass(className);
            System.out.println("*** Text of DIVs with class '"+className+"' at '"+url+"' ***");
            for (Iterator iterator = divs.iterator(); iterator.hasNext();)
            {
                TagNode divElement = (TagNode) iterator.next();
                System.out.println("Text child nodes of DIV: " + divElement.getText().toString());
            }
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }
    }
}

回答by alex

HTMLUnit might be of help. It does a lot more stuff too.

HTMLUnit 可能会有所帮助。它也做了很多事情。

http://htmlunit.sourceforge.net/1

http://htmlunit.sourceforge.net/ 1

回答by FolksLord

Jericho: http://jericho.htmlparser.net/docs/index.html

耶利哥:http: //jericho.htmlparser.net/docs/index.html

Easy to use, supports not well formed HTML, a lot of examples.

易于使用,支持格式不正确的 HTML,大量示例。

回答by rajsite

Another library that might be useful for HTML processing is jsoup. Jsoup tries to clean malformed HTML and allows html parsing in Java using jQuery like tag selector syntax.

另一个可能对 HTML 处理有用的库是 jsoup。Jsoup 尝试清理格式错误的 HTML,并允许使用 jQuery 之类的标签选择器语法在 Java 中解析 html。

http://jsoup.org/

http://jsoup.org/

回答by Mike Samuel

The nu.validatorproject is an excellent, high performance HTML parser that doesn't cut corners correctness-wise.

nu.validator项目是一个出色的高性能 HTML 解析器,不会在正确性方面走捷径。

The Validator.nu HTML Parser is an implementation of the HTML5 parsing algorithm in Java. The parser is designed to work as a drop-in replacement for the XML parser in applications that already support XHTML 1.x content with an XML parser and use SAX, DOM or XOM to interface with the parser. Low-level functionality is provided for applications that wish to perform their own IO and support document.write() with scripting. The parser core compiles on Google Web Toolkit and can be automatically translated into C++. (The C++ translation capability is currently used for porting the parser for use in Gecko.)

Validator.nu HTML Parser 是 HTML5 解析算法的 Java 实现。该解析器被设计为在已经通过 XML 解析器支持 XHTML 1.x 内容并使用 SAX、DOM 或 XOM 与解析器交互的应用程序中作为 XML 解析器的直接替代品。为希望执行自己的 IO 并通过脚本支持 document.write() 的应用程序提供了低级功能。解析器核心在 Google Web Toolkit 上编译,可以自动翻译成 C++。(C++ 翻译功能目前用于移植解析器以在 Gecko 中使用。)

回答by Vincent Massol

You can also use XWiki HTML Cleaner:

您还可以使用XWiki HTML Cleaner

It uses HTMLCleanerand extends it to generate valid XHTML 1.1 content.

它使用HTMLCleaner并对其进行扩展以生成有效的 XHTML 1.1 内容。