java 如何仅解析 HTML 中的文本

Question

提问by Jesvin

how can i parse only text from a web page using jsoup using java?

我如何使用 java 使用 jsoup 仅解析网页中的文本？

Answer 1

回答by Ryan Berger

From jsoup cookbook: http://jsoup.org/cookbook/extracting-data/attributes-text-html

来自 jsoup 食谱：http: //jsoup.org/cookbook/extracting-data/attributes-text-html

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
String text = doc.body().text(); // "An example link"

Answer 2

回答by camickr

Using classes that are part of the JDK:

使用属于 JDK 的类：

import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;

class GetHTMLText
{
    public static void main(String[] args)
        throws Exception
    {
        EditorKit kit = new HTMLEditorKit();
        Document doc = kit.createDefaultDocument();

        // The Document class does not yet handle charset's properly.
        doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);

        // Create a reader on the HTML content.

        Reader rd = getReader(args[0]);

        // Parse the HTML.

        kit.read(rd, doc, 0);

        //  The HTML text is now stored in the document

        System.out.println( doc.getText(0, doc.getLength()) );
    }

    // Returns a reader on the HTML data. If 'uri' begins
    // with "http:", it's treated as a URL; otherwise,
    // it's assumed to be a local filename.

    static Reader getReader(String uri)
        throws IOException
    {
        // Retrieve from Internet.
        if (uri.startsWith("http:"))
        {
            URLConnection conn = new URL(uri).openConnection();
            return new InputStreamReader(conn.getInputStream());
        }
        // Retrieve from file.
        else
        {
            return new FileReader(uri);
        }
    }
}

Answer 3

回答by jjnguy

Well, here is a quick method I threw together once. It uses regular expressions to get the job done. Most people will agree that this is not a good way to go about doing it. SO, use at your own risk.

好吧，这是我曾经拼凑过的一个快速方法。它使用正则表达式来完成工作。大多数人会同意这不是一个好的方法。所以，使用风险自负。

public static String getPlainText(String html) {
    String htmlBody = html.replaceAll("<hr>", ""); // one off for horizontal rule lines
    String plainTextBody = htmlBody.replaceAll("<[^<>]+>([^<>]*)<[^<>]+>", "");
    plainTextBody = plainTextBody.replaceAll("<br ?/>", "");
    return decodeHtml(plainTextBody);
}

This was originally used in my API wrapper for the Stack Overflow API. So, it was only tested under a small subset of html tags.

这最初是在我的堆栈溢出 API 的 API 包装器中使用的。因此，它仅在一小部分 html 标签下进行了测试。

java 如何仅解析 HTML 中的文本

提问by Jesvin

回答by Ryan Berger

回答by camickr

回答by jjnguy

相关推荐

最近更新

标签

java 如何仅解析 HTML 中的文本

提问by Jesvin

回答by Ryan Berger

回答by camickr

回答by jjnguy

相关推荐

java.lang.SecurityException：禁止的包名：java.lang

java java中的模糊逻辑

java 如何在 JSF 2.0 (Sun Mojarra) 中获得选项卡式窗格组件

如何编写 Java 客户端来访问 WSDL 文件？

相关推荐

最近更新

标签