在 Java 中将 HTML 转换为纯文本

Question

提问by brayne

I need to convert HTML to plain text. My only requirement of formatting is to retain new lines in the plain text. New lines should be displayed not only in the case of <br>but other tags, e.g. <tr/>, </p>leads to a new line too.

我需要将 HTML 转换为纯文本。我对格式化的唯一要求是在纯文本中保留新行。新行不仅应该在的情况下显示，<br>而且其他标签也应该显示，例如<tr/>，</p>也会导致新行。

Sample HTML pages for testing are:

用于测试的示例 HTML 页面是：

Note that these are only random URLs.

请注意，这些只是随机 URL。

I have tried out various libraries (JSoup, Javax.swing, Apache utils) mentioned in the answers to this StackOverflow questionto convert HTML to plain text.

我已经尝试了这个 StackOverflow 问题的答案中提到的各种库（JSoup、Javax.swing、Apache utils）来将 HTML 转换为纯文本。

Example using JSoup:

使用 JSoup 的示例：

public class JSoupTest {

 @Test
 public void SimpleParse() {
  try {
   Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
   System.out.print(doc.text());

  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
 }
}

Example with HTMLEditorKit:

HTMLEditorKit 示例：

import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Html2Text extends HTMLEditorKit.ParserCallback {
 StringBuffer s;

 public Html2Text() {}

 public void parse(Reader in) throws IOException {
   s = new StringBuffer();
   ParserDelegator delegator = new ParserDelegator();
   // the third parameter is TRUE to ignore charset directive
   delegator.parse(in, this, Boolean.TRUE);
 }

 public void handleText(char[] text, int pos) {
   s.append(text);
 }

 public String getText() {
   return s.toString();
 }

 public static void main (String[] args) {
   try {
     // the HTML to convert
    URL  url = new URL("http://www.javadb.com/write-to-file-using-bufferedwriter");
    URLConnection conn = url.openConnection();
    BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
    String inputLine;
    String finalContents = "";
    while ((inputLine = reader.readLine()) != null) {
     finalContents += "\n" + inputLine.replace("<br", "\n<br");
    }
    BufferedWriter writer = new BufferedWriter(new FileWriter("samples/testHtml.html"));
    writer.write(finalContents);
    writer.close();

     FileReader in = new FileReader("samples/testHtml.html");
     Html2Text parser = new Html2Text();
     parser.parse(in);
     in.close();
     System.out.println(parser.getText());
   }
   catch (Exception e) {
     e.printStackTrace();
   }
 }
}

Answer 1

回答by camickr

I would guess you could use the ParserCallback.

我猜你可以使用 ParserCallback。

You would need to add code to support the tags that require special handling. There are:

您需要添加代码来支持需要特殊处理的标签。有：

handleStartTag
handleEndTag
handleSimpleTag

句柄起始标签
句柄结束标签
处理简单标签

callbacks that should allow you to check for the tags you want to monitor and then append a newline character to your buffer.

回调应该允许您检查要监视的标签，然后将换行符附加到缓冲区。

Answer 2

回答by Suresh Kumar

You can use XSLT for this purpose. Take a look at this linkwhich addresses a similar problem.

为此，您可以使用 XSLT。看看这个链接，它解决了一个类似的问题。

Hope it is helpful.

希望它有帮助。

Answer 3

回答by mschonaker

I would use SAX. If your document is not well-formed XHTML, I would transform it with JTidy.

我会使用SAX。如果您的文档不是格式良好的 XHTML，我会使用JTidy对其进行转换。

Answer 4

回答by PhiLho

Building on your example, with a hint from html to plain text?message:

以您的示例为基础，从html 到纯文本的提示？信息：

import java.io.*;

import org.jsoup.*;
import org.jsoup.nodes.*;

public class TestJsoup
{
  public void SimpleParse()
  {
    try
    {
      Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
      // Trick for better formatting
      doc.body().wrap("<pre></pre>");
      String text = doc.text();
      // Converting nbsp entities
      text = text.replaceAll("\u00A0", " ");
      System.out.print(text);
    }
    catch (IOException e)
    {
      e.printStackTrace();
    }
  }

  public static void main(String args[])
  {
    TestJsoup tjs = new TestJsoup();
    tjs.SimpleParse();
  }
}

Answer 5

回答by Sam Barnum

Have your parser append text content and newlines to a StringBuilder.

让您的解析器将文本内容和换行符附加到 StringBuilder。

final StringBuilder sb = new StringBuilder();
HTMLEditorKit.ParserCallback parserCallback = new HTMLEditorKit.ParserCallback() {
    public boolean readyForNewline;

    @Override
    public void handleText(final char[] data, final int pos) {
        String s = new String(data);
        sb.append(s.trim());
        readyForNewline = true;
    }

    @Override
    public void handleStartTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
        if (readyForNewline && (t == HTML.Tag.DIV || t == HTML.Tag.BR || t == HTML.Tag.P)) {
            sb.append("\n");
            readyForNewline = false;
        }
    }

    @Override
    public void handleSimpleTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
        handleStartTag(t, a, pos);
    }
};
new ParserDelegator().parse(new StringReader(html), parserCallback, false);

Answer 6

回答by John Camerin

JSoup is not FreeMarker (or any other customer/non-HTML tag) compatible. Consider this as the most pure solution for converting Html to plain text.

JSoup 与 FreeMarker（或任何其他客户/非 HTML 标签）不兼容。将此视为将 Html 转换为纯文本的最纯粹的解决方案。

http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726My code:

http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726我的代码：

return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();

在 Java 中将 HTML 转换为纯文本

提问by brayne

回答by camickr

回答by Suresh Kumar

回答by mschonaker

回答by PhiLho

回答by Sam Barnum

回答by John Camerin

相关推荐

最近更新

标签

在 Java 中将 HTML 转换为纯文本

提问by brayne

回答by camickr

回答by Suresh Kumar

回答by mschonaker

回答by PhiLho

回答by Sam Barnum

回答by John Camerin

相关推荐

选择并测试java反编译器

Java Spring MVC：将多个 URL 映射到同一个控制器

Java 如何使用在 IntelliJ IDEA 中创建的 GUI 表单

Java 使用 Hibernate Criteria 获取最大 id 的记录

相关推荐

最近更新

标签