Java 如何将 HTML 文本转换为纯文本?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3607965/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to convert HTML text to plain text?
提问by MGSenthil
friend's I have to parse the description from url,where parsed content have few html tags,so how can I convert it to plain text.
朋友,我必须从 url 解析描述,其中解析的内容几乎没有 html 标签,所以我如何将其转换为纯文本。
回答by ankitjaininfo
Use a HTML parser like htmlCleaner
使用像htmlCleaner这样的 HTML 解析器
For detailed answer : How to remove HTML tag in Java
详细答案:如何在 Java 中删除 HTML 标签
回答by Jon Freedman
回答by Sean Patrick Floyd
Just getting rid of HTML tags is simple:
去掉 HTML 标签很简单:
// replace all occurrences of one or more HTML tags with optional
// whitespace inbetween with a single space character
String strippedText = htmlText.replaceAll("(?s)<[^>]*>(\s*<[^>]*>)*", " ");
But unfortunately the requirements are never that simple:
但不幸的是,要求从来没有那么简单:
Usually, <p>
and <div>
elements need a separate handling, there may be cdata blocks with >
characters (e.g. javascript) that mess up the regex etc.
通常情况下,<p>
和<div>
元件需要一个单独的处理,可能存在与CDATA块>
字符(例如JavaScript的),该弄乱正则表达式等
回答by Kandha
You can use this single line to remove the html tags and display it as plain text.
您可以使用这一行删除 html 标签并将其显示为纯文本。
htmlString=htmlString.replaceAll("\<.*?\>", "");
回答by Ganesan Palanisamy
If you want to parse like browser display, use:
如果要像浏览器显示一样解析,请使用:
import net.htmlparser.jericho.*;
import java.util.*;
import java.io.*;
import java.net.*;
public class RenderToText {
public static void main(String[] args) throws Exception {
String sourceUrlString="data/test.html";
if (args.length==0)
System.err.println("Using default argument of \""+sourceUrlString+'"');
else
sourceUrlString=args[0];
if (sourceUrlString.indexOf(':')==-1) sourceUrlString="file:"+sourceUrlString;
Source source=new Source(new URL(sourceUrlString));
String renderedText=source.getRenderer().toString();
System.out.println("\nSimple rendering of the HTML document:\n");
System.out.println(renderedText);
}
}
I hope this will help to parse table also in the browser format.
我希望这也有助于以浏览器格式解析表格。
Thanks, Ganesh
谢谢,加内什
回答by John Camerin
I needed a plain text representation of some HTML which included FreeMarker tags. The problem was handed to me with a JSoup solution, but JSoup was escaping the FreeMarker tags, thus breaking the functionality. I also tried htmlCleaner (sourceforge), but that left the HTML header and style content (tags removed). http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726
我需要一些 HTML 的纯文本表示,其中包括 FreeMarker 标签。这个问题是通过 JSoup 解决方案交给我的,但 JSoup 正在转义 FreeMarker 标签,从而破坏了功能。我也尝试过 htmlCleaner (sourceforge),但这留下了 HTML 标题和样式内容(已删除标签)。 http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726
My code:
我的代码:
return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();
The maxLineLength
ensures lines are not artificially wrapped at 80 characters.
The setNewLine(null)
uses the same new line character(s) as the source.
的maxLineLength
确保线不人为地在80个字符缠绕。在setNewLine(null)
作为源使用相同的新行字符(或多个)。
回答by Ranjit
回答by Ruslanas
I use HTMLUtil.textFromHTML(value)
from
我用HTMLUtil.textFromHTML(value)
从
<dependency>
<groupId>org.clapper</groupId>
<artifactId>javautil</artifactId>
<version>3.2.0</version>
</dependency>