在 Java 中将 HTML 转换为纯文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3911385/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 06:42:59  来源:igfitidea点击:

Convert HTML to plain text in Java

javaparsingplaintextjsouphtmleditorkit

提问by brayne

I need to convert HTML to plain text. My only requirement of formatting is to retain new lines in the plain text. New lines should be displayed not only in the case of <br>but other tags, e.g. <tr/>, </p>leads to a new line too.

我需要将 HTML 转换为纯文本。我对格式化的唯一要求是在纯文本中保留新行。新行不仅应该在 的情况下显示,<br>而且其他标签也应该显示,例如<tr/></p>也会导致新行。

Sample HTML pages for testing are:

用于测试的示例 HTML 页面是:

Note that these are only random URLs.

请注意,这些只是随机 URL。

I have tried out various libraries (JSoup, Javax.swing, Apache utils) mentioned in the answers to this StackOverflow questionto convert HTML to plain text.

我已经尝试了这个 StackOverflow 问题的答案中提到的各种库(JSoup、Javax.swing、Apache utils)来将 HTML 转换为纯文本。

Example using JSoup:

使用 JSoup 的示例:

public class JSoupTest {

 @Test
 public void SimpleParse() {
  try {
   Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
   System.out.print(doc.text());

  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
 }
}

Example with HTMLEditorKit:

HTMLEditorKit 示例:

import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Html2Text extends HTMLEditorKit.ParserCallback {
 StringBuffer s;

 public Html2Text() {}

 public void parse(Reader in) throws IOException {
   s = new StringBuffer();
   ParserDelegator delegator = new ParserDelegator();
   // the third parameter is TRUE to ignore charset directive
   delegator.parse(in, this, Boolean.TRUE);
 }

 public void handleText(char[] text, int pos) {
   s.append(text);
 }

 public String getText() {
   return s.toString();
 }

 public static void main (String[] args) {
   try {
     // the HTML to convert
    URL  url = new URL("http://www.javadb.com/write-to-file-using-bufferedwriter");
    URLConnection conn = url.openConnection();
    BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
    String inputLine;
    String finalContents = "";
    while ((inputLine = reader.readLine()) != null) {
     finalContents += "\n" + inputLine.replace("<br", "\n<br");
    }
    BufferedWriter writer = new BufferedWriter(new FileWriter("samples/testHtml.html"));
    writer.write(finalContents);
    writer.close();

     FileReader in = new FileReader("samples/testHtml.html");
     Html2Text parser = new Html2Text();
     parser.parse(in);
     in.close();
     System.out.println(parser.getText());
   }
   catch (Exception e) {
     e.printStackTrace();
   }
 }
}



回答by camickr

I would guess you could use the ParserCallback.

我猜你可以使用 ParserCallback。

You would need to add code to support the tags that require special handling. There are:

您需要添加代码来支持需要特殊处理的标签。有:

  1. handleStartTag
  2. handleEndTag
  3. handleSimpleTag
  1. 句柄起始标签
  2. 句柄结束标签
  3. 处理简单标签

callbacks that should allow you to check for the tags you want to monitor and then append a newline character to your buffer.

回调应该允许您检查要监视的标签,然后将换行符附加到缓冲区。

回答by Suresh Kumar

You can use XSLT for this purpose. Take a look at this linkwhich addresses a similar problem.

为此,您可以使用 XSLT。看看这个链接,它解决了一个类似的问题。

Hope it is helpful.

希望它有帮助。

回答by mschonaker

I would use SAX. If your document is not well-formed XHTML, I would transform it with JTidy.

我会使用SAX。如果您的文档不是格式良好的 XHTML,我会使用JTidy对其进行转换。

回答by PhiLho

Building on your example, with a hint from html to plain text?message:

以您的示例为基础,从html 到纯文本的提示信息:

import java.io.*;

import org.jsoup.*;
import org.jsoup.nodes.*;

public class TestJsoup
{
  public void SimpleParse()
  {
    try
    {
      Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
      // Trick for better formatting
      doc.body().wrap("<pre></pre>");
      String text = doc.text();
      // Converting nbsp entities
      text = text.replaceAll("\u00A0", " ");
      System.out.print(text);
    }
    catch (IOException e)
    {
      e.printStackTrace();
    }
  }

  public static void main(String args[])
  {
    TestJsoup tjs = new TestJsoup();
    tjs.SimpleParse();
  }
}

回答by Sam Barnum

Have your parser append text content and newlines to a StringBuilder.

让您的解析器将文本内容和换行符附加到 StringBuilder。

final StringBuilder sb = new StringBuilder();
HTMLEditorKit.ParserCallback parserCallback = new HTMLEditorKit.ParserCallback() {
    public boolean readyForNewline;

    @Override
    public void handleText(final char[] data, final int pos) {
        String s = new String(data);
        sb.append(s.trim());
        readyForNewline = true;
    }

    @Override
    public void handleStartTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
        if (readyForNewline && (t == HTML.Tag.DIV || t == HTML.Tag.BR || t == HTML.Tag.P)) {
            sb.append("\n");
            readyForNewline = false;
        }
    }

    @Override
    public void handleSimpleTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
        handleStartTag(t, a, pos);
    }
};
new ParserDelegator().parse(new StringReader(html), parserCallback, false);

回答by John Camerin

JSoup is not FreeMarker (or any other customer/non-HTML tag) compatible. Consider this as the most pure solution for converting Html to plain text.

JSoup 与 FreeMarker(或任何其他客户/非 HTML 标签)不兼容。将此视为将 Html 转换为纯文本的最纯粹的解决方案。

http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726My code:

http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726我的代码:

return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();