如何在 Java 中删除 HTML 标签

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1699313/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 21:42:03  来源:igfitidea点击:

How to remove HTML tag in Java

javahtmlregex

提问by freddiefujiwara

Is there the regular expression that can completely remove a HTML tag? By the way, I'm using Java.

有没有可以完全去除HTML标签的正则表达式?顺便说一句,我正在使用Java。

采纳答案by tangens

You should use a HTML parser instead. I like htmlCleaner, because it gives me a pretty printed version of the HTML.

您应该改用 HTML 解析器。我喜欢htmlCleaner,因为它给了我一个漂亮的 HTML 打印版本。

With htmlCleaner you can do:

使用 htmlCleaner,您可以:

TagNode root = htmlCleaner.clean( stream );
Object[] found = root.evaluateXPath( "//div[id='something']" );
if( found.length > 0 && found instanceof TagNode ) {
    ((TagNode)found[0]).removeFromTree();
}

回答by Moishe Lettvin

No. Regular expressions can not by definition parse HTML.

不可以。正则表达式不能根据定义解析 HTML。

You could use a regex to s/<[^>]*\>//or something naive like that but it's going to be insufficient, especially if you're interested in removing the contents of tags.

您可以使用正则表达式s/<[^>]*\>//或类似的东西,但这还不够,特别是如果您有兴趣删除标签的内容。

As another poster said, use an actual HTML parser.

正如另一位海报所说,使用实际的 HTML 解析器。

回答by Andrey Adamovich

If you just need to remove tags then you can use this regular expression:

如果你只需要删除标签,那么你可以使用这个正则表达式:

content = content.replaceAll("<[^>]+>", "");

It will remove only tags, but not other HTML stuff. For more complex things you should use parser.

它只会删除标签,而不会删除其他 HTML 内容。对于更复杂的事情,您应该使用解析器。

EDIT: To avoid problems with HTML comments you can do the following:

编辑:为避免 HTML 注释出现问题,您可以执行以下操作:

content = content.replaceAll("<!--.*?-->", "").replaceAll("<[^>]+>", "");

回答by BalusC

Alternatively, if your intent is to displayuser-controlled input back to the client, then you can also just replace all <by &lt;and all >by &gt;. This way the HTML won't be interpreted as-is by the client's application (the webbrowser).

或者,如果您的意图是将用户控制的输入显示回客户端,那么您也可以只替换 all <by&lt;和 all >by &gt;。这样,客户端的应用程序(网络浏览器)就不会按原样解释 HTML。

If you're using JSP as view technology, then you can use JSTL's c:outfor this. It will escape all HTML entities by default. So for example

如果您使用 JSP 作为视图技术,那么您可以c:out为此使用 JSTL 。默认情况下,它将转义所有 HTML 实体。所以例如

<c:out value="<script>alert('XSS');</script>" />

will NOT display the alert, but just show the actual string as is.

不会显示警报,而只是按原样显示实际的字符串。

回答by Kandha

you can use this simple code to remove all html tags...

您可以使用这个简单的代码来删除所有 html 标签...

htmlString.replaceAll("\<.*?\>", ""))

回答by Simon

There is JSoupwhich is a java library made for HTML manipulation. Look at the clean()method and the WhiteListobject. Easy to use solution!

JSoup这是HTML操作做一个Java库。看clean()方法和WhiteList对象。易于使用的解决方案!

回答by Saeid Zebardast

You don't need any HTML parser. The below code removes all HTML comments:

您不需要任何 HTML 解析器。以下代码删除所有 HTML 注释:

htmlString = htmlString.replaceAll("(?s)<!--.*?-->", "");

htmlString = htmlString.replaceAll("(?s)<!--.*?-->", "");