Java 如何将 HTML 文本转换为纯文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3607965/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 02:35:12  来源:igfitidea点击:

how to convert HTML text to plain text?

javahtml

提问by MGSenthil

friend's I have to parse the description from url,where parsed content have few html tags,so how can I convert it to plain text.

朋友,我必须从 url 解析描述,其中解析的内容几乎没有 html 标签,所以我如何将其转换为纯文本。

回答by ankitjaininfo

Use a HTML parser like htmlCleaner

使用像htmlCleaner这样的 HTML 解析器

For detailed answer : How to remove HTML tag in Java

详细答案:如何在 Java 中删除 HTML 标签

回答by Jon Freedman

I'd recommend parsing the raw HTML through jTidywhich should give you output which you can write xpath expressions against. This is the most robust way I've found of scraping HTML.

我建议通过jTidy解析原始 HTML,它应该为您提供可以编写 xpath 表达式的输出。这是我发现的最强大的抓取 HTML 的方法。

回答by Sean Patrick Floyd

Just getting rid of HTML tags is simple:

去掉 HTML 标签很简单:

// replace all occurrences of one or more HTML tags with optional
// whitespace inbetween with a single space character 
String strippedText = htmlText.replaceAll("(?s)<[^>]*>(\s*<[^>]*>)*", " ");

But unfortunately the requirements are never that simple:

但不幸的是,要求从来没有那么简单:

Usually, <p>and <div>elements need a separate handling, there may be cdata blocks with >characters (e.g. javascript) that mess up the regex etc.

通常情况下,<p><div>元件需要一个单独的处理,可能存在与CDATA块>字符(例如JavaScript的),该弄乱正则表达式等

回答by Kandha

You can use this single line to remove the html tags and display it as plain text.

您可以使用这一行删除 html 标签并将其显示为纯文本。

htmlString=htmlString.replaceAll("\<.*?\>", "");

回答by Ganesan Palanisamy

If you want to parse like browser display, use:

如果要像浏览器显示一样解析,请使用:

import net.htmlparser.jericho.*;
import java.util.*;
import java.io.*;
import java.net.*;

public class RenderToText {
    public static void main(String[] args) throws Exception {
        String sourceUrlString="data/test.html";
        if (args.length==0)
          System.err.println("Using default argument of \""+sourceUrlString+'"');
        else
            sourceUrlString=args[0];
        if (sourceUrlString.indexOf(':')==-1) sourceUrlString="file:"+sourceUrlString;
        Source source=new Source(new URL(sourceUrlString));
        String renderedText=source.getRenderer().toString();
        System.out.println("\nSimple rendering of the HTML document:\n");
        System.out.println(renderedText);
  }
}

I hope this will help to parse table also in the browser format.

我希望这也有助于以浏览器格式解析表格。

Thanks, Ganesh

谢谢,加内什

回答by John Camerin

I needed a plain text representation of some HTML which included FreeMarker tags. The problem was handed to me with a JSoup solution, but JSoup was escaping the FreeMarker tags, thus breaking the functionality. I also tried htmlCleaner (sourceforge), but that left the HTML header and style content (tags removed). http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726

我需要一些 HTML 的纯文本表示,其中包括 FreeMarker 标签。这个问题是通过 JSoup 解决方案交给我的,但 JSoup 正在转义 FreeMarker 标签,从而破坏了功能。我也尝试过 htmlCleaner (sourceforge),但这留下了 HTML 标题和样式内容(已删除标签)。 http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726

My code:

我的代码:

return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();

The maxLineLengthensures lines are not artificially wrapped at 80 characters. The setNewLine(null)uses the same new line character(s) as the source.

maxLineLength确保线不人为地在80个字符缠绕。在setNewLine(null)作为源使用相同的新行字符(或多个)。

回答by Ranjit

Yes, Jsoupwill be the better option. Just do like below to convert the whole HTML text to plain text.

是的,Jsoup将是更好的选择。只需像下面一样将整个 HTML 文本转换为纯文本。

String plainText= Jsoup.parse(yout_html_text).text();

回答by Ruslanas

I use HTMLUtil.textFromHTML(value)from

我用HTMLUtil.textFromHTML(value)

<dependency>
    <groupId>org.clapper</groupId>
    <artifactId>javautil</artifactId>
    <version>3.2.0</version>
</dependency>