用于 HTML 到文本转换的开源 Java 库

Question

提问by David Rabinowitz

Can you recommend an open source Java library (preferably ASL/BSD/LGPL license) that converts HTML to plaintext - cleans all the tags, converts entities (&, , etc.) and handles <br> and tables properly.

您能否推荐一个将 HTML 转换为纯文本的开源 Java 库（最好是 ASL/BSD/LGPL 许可）- 清除所有标签，转换实体（&、等）并正确处理 <br> 和表格。

More Info

更多信息

I have the HTML as a string, there's no need to fetch it from the web. Also, what I'm looking is for a method like this:

我将 HTML 作为字符串，无需从网络上获取它。另外，我正在寻找这样的方法：

String convertHtmlToPlainText(String html)

Answer 1

采纳答案by Chris R

Try Jericho.

试试杰里科。

The TextExtractorclass sounds like it will do what you want. Sorry can't post a 2nd link as I'm a new user but scroll down the homepage a bit and there's a link to it.

该TextExtractor类听起来像它会做你想要什么。抱歉，我是新用户，无法发布第二个链接，但稍微向下滚动主页，就会有一个链接。

Answer 2

回答by Ahmed Ashour

HtmlUnit, it even shows the page after processing JavaScript / Ajax.

HtmlUnit，它甚至显示处理 JavaScript / Ajax 后的页面。

Answer 3

回答by Pkunk

The blikiengine can do this, in two steps. See info.bliki.wiki / Home

该bliki引擎能做到这一点，在两个步骤。见info.bliki.wiki / 主页

How to convert HTML to Mediawiki text -- nediawiki text is already a rather plain text format, but you can convert it further
How to convert Mediawiki text to plain text -- your goal.

如何将 HTML 转换为 Mediawiki 文本 -- nediawiki 文本已经是一种相当纯文本格式，但您可以进一步转换它
如何将 Mediawiki 文本转换为纯文本——您的目标。

It will be some 7-8 lines of code, like this:

这将是一些 7-8 行代码，如下所示：

// html to wiki
import info.bliki.html.HTML2WikiConverter;
import info.bliki.html.wikipedia.ToWikipedia;
// wiki to plain text
import info.bliki.wiki.filter.PlainTextConverter;
import info.bliki.wiki.model.WikiModel;
...
String sbodyhtml = readFile( infilepath ); //get content as string
  HTML2WikiConverter conv = new HTML2WikiConverter();
  conv.setInputHTML( sbodyhtml );
String resultwiki = conv.toWiki(new ToWikipedia());
  WikiModel wikiModel = new WikiModel("${image}", "${title}");
String plainStr = wikiModel.render(new PlainTextConverter(false), resultwiki );
System.out.println( plainStr );

Jsoup can do this simpler:

Jsoup 可以更简单地做到这一点：

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
...
Document doc = Jsoup.parse(sbodyhtml);
String plainStr = doc.body().text();

but in the result you lose all paragraph formatting -- there will be no any newlines.

但结果你失去了所有的段落格式——不会有任何换行符。

Answer 4

回答by Rich Seller

I use TagSoup, it is available for several languages and does a really good job with HTML found "in the wild". It produces either a cleaned up version of the HTML or XML, that you can then process with some DOM/SAX parser.

我使用TagSoup，它可用于多种语言，并且在“野外”发现的 HTML 方面做得非常好。它生成 HTML 或 XML 的清理版本，然后您可以使用一些 DOM/SAX 解析器进行处理。

Answer 5

回答by firefly2442

I've used Apache Commons Langto go the other way. But it looks like it can do what you need via StringEscapeUtils.

我已经使用Apache Commons Lang走了另一条路。但看起来它可以通过StringEscapeUtils做你需要的事情。

用于 HTML 到文本转换的开源 Java 库

提问by David Rabinowitz

采纳答案by Chris R

回答by Ahmed Ashour

回答by Pkunk

回答by Rich Seller

回答by firefly2442

相关推荐

最近更新

标签

用于 HTML 到文本转换的开源 Java 库

提问by David Rabinowitz

采纳答案by Chris R

回答by Ahmed Ashour

回答by Pkunk

回答by Rich Seller

回答by firefly2442

相关推荐

java 如何在仅内存模式下运行 HSQLDB 服务器

Java：在类路径中的所有类上循环

Java：如何从文件中获取缩略图

如何每天从 Java 运行任务？

相关推荐

最近更新

标签