java 如何仅解析 HTML 中的文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3507353/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to Parse Only Text from HTML
提问by Jesvin
how can i parse only text from a web page using jsoup using java?
我如何使用 java 使用 jsoup 仅解析网页中的文本?
回答by Ryan Berger
From jsoup cookbook: http://jsoup.org/cookbook/extracting-data/attributes-text-html
来自 jsoup 食谱:http: //jsoup.org/cookbook/extracting-data/attributes-text-html
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
String text = doc.body().text(); // "An example link"
回答by camickr
Using classes that are part of the JDK:
使用属于 JDK 的类:
import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
class GetHTMLText
{
public static void main(String[] args)
throws Exception
{
EditorKit kit = new HTMLEditorKit();
Document doc = kit.createDefaultDocument();
// The Document class does not yet handle charset's properly.
doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
// Create a reader on the HTML content.
Reader rd = getReader(args[0]);
// Parse the HTML.
kit.read(rd, doc, 0);
// The HTML text is now stored in the document
System.out.println( doc.getText(0, doc.getLength()) );
}
// Returns a reader on the HTML data. If 'uri' begins
// with "http:", it's treated as a URL; otherwise,
// it's assumed to be a local filename.
static Reader getReader(String uri)
throws IOException
{
// Retrieve from Internet.
if (uri.startsWith("http:"))
{
URLConnection conn = new URL(uri).openConnection();
return new InputStreamReader(conn.getInputStream());
}
// Retrieve from file.
else
{
return new FileReader(uri);
}
}
}
回答by jjnguy
Well, here is a quick method I threw together once. It uses regular expressions to get the job done. Most people will agree that this is not a good way to go about doing it. SO, use at your own risk.
好吧,这是我曾经拼凑过的一个快速方法。它使用正则表达式来完成工作。大多数人会同意这不是一个好的方法。所以,使用风险自负。
public static String getPlainText(String html) {
String htmlBody = html.replaceAll("<hr>", ""); // one off for horizontal rule lines
String plainTextBody = htmlBody.replaceAll("<[^<>]+>([^<>]*)<[^<>]+>", "");
plainTextBody = plainTextBody.replaceAll("<br ?/>", "");
return decodeHtml(plainTextBody);
}
This was originally used in my API wrapper for the Stack Overflow API. So, it was only tested under a small subset of html tags.
这最初是在我的堆栈溢出 API 的 API 包装器中使用的。因此,它仅在一小部分 html 标签下进行了测试。

