在 Java 中将 HTML 转换为纯文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3911385/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert HTML to plain text in Java
提问by brayne
I need to convert HTML to plain text. My only requirement of formatting is to retain new lines in the plain text. New lines should be displayed not only in the case of <br>
but other tags, e.g. <tr/>
, </p>
leads to a new line too.
我需要将 HTML 转换为纯文本。我对格式化的唯一要求是在纯文本中保留新行。新行不仅应该在 的情况下显示,<br>
而且其他标签也应该显示,例如<tr/>
,</p>
也会导致新行。
Sample HTML pages for testing are:
用于测试的示例 HTML 页面是:
- http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html
- http://www.javadb.com/write-to-file-using-bufferedwriter
- http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html
- http://www.javadb.com/write-to-file-using-bufferedwriter
Note that these are only random URLs.
请注意,这些只是随机 URL。
I have tried out various libraries (JSoup, Javax.swing, Apache utils) mentioned in the answers to this StackOverflow questionto convert HTML to plain text.
我已经尝试了这个 StackOverflow 问题的答案中提到的各种库(JSoup、Javax.swing、Apache utils)来将 HTML 转换为纯文本。
Example using JSoup:
使用 JSoup 的示例:
public class JSoupTest {
@Test
public void SimpleParse() {
try {
Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
System.out.print(doc.text());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Example with HTMLEditorKit:
HTMLEditorKit 示例:
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class Html2Text extends HTMLEditorKit.ParserCallback {
StringBuffer s;
public Html2Text() {}
public void parse(Reader in) throws IOException {
s = new StringBuffer();
ParserDelegator delegator = new ParserDelegator();
// the third parameter is TRUE to ignore charset directive
delegator.parse(in, this, Boolean.TRUE);
}
public void handleText(char[] text, int pos) {
s.append(text);
}
public String getText() {
return s.toString();
}
public static void main (String[] args) {
try {
// the HTML to convert
URL url = new URL("http://www.javadb.com/write-to-file-using-bufferedwriter");
URLConnection conn = url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
String finalContents = "";
while ((inputLine = reader.readLine()) != null) {
finalContents += "\n" + inputLine.replace("<br", "\n<br");
}
BufferedWriter writer = new BufferedWriter(new FileWriter("samples/testHtml.html"));
writer.write(finalContents);
writer.close();
FileReader in = new FileReader("samples/testHtml.html");
Html2Text parser = new Html2Text();
parser.parse(in);
in.close();
System.out.println(parser.getText());
}
catch (Exception e) {
e.printStackTrace();
}
}
}
回答by camickr
I would guess you could use the ParserCallback.
我猜你可以使用 ParserCallback。
You would need to add code to support the tags that require special handling. There are:
您需要添加代码来支持需要特殊处理的标签。有:
- handleStartTag
- handleEndTag
- handleSimpleTag
- 句柄起始标签
- 句柄结束标签
- 处理简单标签
callbacks that should allow you to check for the tags you want to monitor and then append a newline character to your buffer.
回调应该允许您检查要监视的标签,然后将换行符附加到缓冲区。
回答by Suresh Kumar
回答by mschonaker
回答by PhiLho
Building on your example, with a hint from html to plain text?message:
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
public class TestJsoup
{
public void SimpleParse()
{
try
{
Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
// Trick for better formatting
doc.body().wrap("<pre></pre>");
String text = doc.text();
// Converting nbsp entities
text = text.replaceAll("\u00A0", " ");
System.out.print(text);
}
catch (IOException e)
{
e.printStackTrace();
}
}
public static void main(String args[])
{
TestJsoup tjs = new TestJsoup();
tjs.SimpleParse();
}
}
回答by Sam Barnum
Have your parser append text content and newlines to a StringBuilder.
让您的解析器将文本内容和换行符附加到 StringBuilder。
final StringBuilder sb = new StringBuilder();
HTMLEditorKit.ParserCallback parserCallback = new HTMLEditorKit.ParserCallback() {
public boolean readyForNewline;
@Override
public void handleText(final char[] data, final int pos) {
String s = new String(data);
sb.append(s.trim());
readyForNewline = true;
}
@Override
public void handleStartTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
if (readyForNewline && (t == HTML.Tag.DIV || t == HTML.Tag.BR || t == HTML.Tag.P)) {
sb.append("\n");
readyForNewline = false;
}
}
@Override
public void handleSimpleTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
handleStartTag(t, a, pos);
}
};
new ParserDelegator().parse(new StringReader(html), parserCallback, false);
回答by John Camerin
JSoup is not FreeMarker (or any other customer/non-HTML tag) compatible. Consider this as the most pure solution for converting Html to plain text.
JSoup 与 FreeMarker(或任何其他客户/非 HTML 标签)不兼容。将此视为将 Html 转换为纯文本的最纯粹的解决方案。
return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();