java 使用 JSoup 在保留换行符的同时删除 HTML 实体

Question

提问by joshschreuder

I have been using JSoupto parse lyrics and it has been great until now, but have run into a problem.

我一直在使用JSoup来解析歌词，直到现在它都很棒，但遇到了问题。

I can use Node.html()to return the full HTML of the desired node, which retains line breaks as such:

我可以使用Node.html()返回所需节点的完整 HTML，它保留换行符如下：

Gl&oacute;andi augu, silfurn&aacute;tt
<br />Bl&oacute;&eth; alv&ouml;ru, starir &aacute;
<br />&Oacute;&eth;ur hundur er &iacute; v&iacute;gam&oacute;&eth;, &iacute; maga... m&eacute;r
<br />
<br />Kolni&eth;ur gref, kvik sem dreg h&eacute;r
<br />Kolni&eth;ur svart, hvergi bjart n&eacute;

But has the unfortunate side-effect, as you can see, of retaining HTML entities and tags.

但是有一个不幸的副作用，如您所见，保留 HTML 实体和标签。

However, if I use Node.text(), I can get a better looking result, free of tags and entities:

但是，如果我使用Node.text()，我可以获得更好看的结果，没有标签和实体：

Glóandi augu, silfurnátt Blóe alv?ru, starir á óeur hundur er í vígamóe, í maga... mér Kolnieur gref, kvik sem dreg hér Kolnieur svart,

Which has another unfortunate side-effect of removing the line breaks and compressing into a single line.

这还有另一个不幸的副作用，即删除换行符并压缩成一行。

Simply replacing <br />from the node before calling Node.text()yields the same result, and it seems that that method is compressing the text onto a single line in the method itself, ignoring newlines.

<br />在调用之前简单地从节点替换会Node.text()产生相同的结果，并且该方法似乎将文本压缩到方法本身的一行中，忽略换行符。

Is it possible to have the best of both worlds, and have tags and entities replaced correctly which preserving the line breaks, or is there another method or way of decoding entities and removing tags without having to replace them manually?

是否有可能两全其美，并正确替换标签和实体以保留换行符，或者是否有另一种方法或方式来解码实体并删除标签而无需手动替换它们？

Answer 1

采纳答案by qwerty

(disclaimer) I haven't used this API ... but a quick look at the docs suggests that you could visit each descendent node and dump out its text contents. Breaks could be inserted when special tags like <br>are encountered.

（免责声明）我没有使用过这个 API ......但是快速浏览文档表明您可以访问每个后代节点并转储其文本内容。当<br>遇到特殊标签时可以插入中断。

The TextNode.getWholeText()call also looks useful.

该TextNode.getWholeText（）调用看起来也很有用。

Answer 2

回答by petrumo

based on another answer from stackoverflowI added a few fixes and came with

基于来自stackoverflow 的另一个答案，我添加了一些修复并附带

    String text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2nl").replaceAll("\n", "br2nl")).text();
    text = text.replaceAll("br2nl ", "\n").replaceAll("br2nl", "\n").trim();

Hope this helps

希望这可以帮助

java 使用 JSoup 在保留换行符的同时删除 HTML 实体

提问by joshschreuder

采纳答案by qwerty

回答by petrumo

相关推荐

最近更新

标签

java 使用 JSoup 在保留换行符的同时删除 HTML 实体

提问by joshschreuder

采纳答案by qwerty

回答by petrumo

相关推荐

java 在java中设置代理

为 Web 应用程序在 Django-Apache 和 Java-Tomcat 之间进行选择

java 套接字认证服务器？

java 用于 Quartz 调度器的 Junit

相关推荐

最近更新

标签