java 使用 JSoup 在保留换行符的同时删除 HTML 实体

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5348455/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 10:43:18  来源:igfitidea点击:

Removing HTML entities while preserving line breaks with JSoup

javahtmlparsingjsoup

提问by joshschreuder

I have been using JSoupto parse lyrics and it has been great until now, but have run into a problem.

我一直在使用JSoup来解析歌词,直到现在它都很棒,但遇到了问题。

I can use Node.html()to return the full HTML of the desired node, which retains line breaks as such:

我可以使用Node.html()返回所需节点的完整 HTML,它保留换行符如下:

Glóandi augu, silfurnátt
<br />Bl&oacute;&eth; alv&ouml;ru, starir &aacute;
<br />&Oacute;&eth;ur hundur er &iacute; v&iacute;gam&oacute;&eth;, &iacute; maga... m&eacute;r
<br />
<br />Kolni&eth;ur gref, kvik sem dreg h&eacute;r
<br />Kolni&eth;ur svart, hvergi bjart n&eacute;

But has the unfortunate side-effect, as you can see, of retaining HTML entities and tags.

但是有一个不幸的副作用,如您所见,保留 HTML 实体和标签。

However, if I use Node.text(), I can get a better looking result, free of tags and entities:

但是,如果我使用Node.text(),我可以获得更好看的结果,没有标签和实体:

Glóandi augu, silfurnátt Blóe alv?ru, starir á óeur hundur er í vígamóe, í maga... mér Kolnieur gref, kvik sem dreg hér Kolnieur svart,

Which has another unfortunate side-effect of removing the line breaks and compressing into a single line.

这还有另一个不幸的副作用,即删除换行符并压缩成一行。

Simply replacing <br />from the node before calling Node.text()yields the same result, and it seems that that method is compressing the text onto a single line in the method itself, ignoring newlines.

<br />在调用之前简单地从节点替换会Node.text()产生相同的结果,并且该方法似乎将文本压缩到方法本身的一行中,忽略换行符。

Is it possible to have the best of both worlds, and have tags and entities replaced correctly which preserving the line breaks, or is there another method or way of decoding entities and removing tags without having to replace them manually?

是否有可能两全其美,并正确替换标签和实体以保留换行符,或者是否有另一种方法或方式来解码实体并删除标签而无需手动替换它们?

采纳答案by qwerty

(disclaimer) I haven't used this API ... but a quick look at the docs suggests that you could visit each descendent node and dump out its text contents. Breaks could be inserted when special tags like <br>are encountered.

(免责声明)我没有使用过这个 API ......但是快速浏览文档表明您可以访问每个后代节点并转储其文本内容。当<br>遇到特殊标签时可以插入中断。

The TextNode.getWholeText()call also looks useful.

TextNode.getWholeText()调用看起来也很有用。

回答by petrumo

based on another answer from stackoverflowI added a few fixes and came with

基于来自stackoverflow 的另一个答案,我添加了一些修复并附带

    String text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2nl").replaceAll("\n", "br2nl")).text();
    text = text.replaceAll("br2nl ", "\n").replaceAll("br2nl", "\n").trim();

Hope this helps

希望这可以帮助