从字符串中删除 HTML 标签的正则表达式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11229831/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 01:19:10  来源:igfitidea点击:

Regular expression to remove HTML tags from a string

htmlregex

提问by danny

Possible Duplicate:
Regular expression to remove HTML tags

可能的重复:
删除 HTML 标签的正则表达式

Is there an expression which will get the value between two HTML tags?

是否有一个表达式可以获取两个 HTML 标签之间的值?

Given this:

鉴于这种:

<td class="played">0</td>

I am looking for an expression which will return 0, stripping the <td>tags.

我正在寻找一个将返回的表达式0,剥离<td>标签。

回答by Roddy of the Frozen Peas

You should not attempt to parse HTML with regex. HTML is not a regular language, so any regex you come up with will likely fail on some esoteric edge case. Please refer to the seminal answer to this questionfor specifics. While mostly formatted as a joke, it makes a very good point.

您不应该尝试使用正则表达式解析 HTML。HTML 不是常规语言,因此您提出的任何正则表达式都可能会在某些深奥的边缘情况下失败。具体请参考这个问题的开创性回答。虽然大部分格式是一个笑话,但它提出了一个很好的观点。



The following examples are Java, but the regex will be similar -- if not identical -- for other languages.

以下示例是 Java,但对于其他语言,正则表达式将是相似的(如果不完全相同)。



String target = someString.replaceAll("<[^>]*>", "");

Assuming your non-html does not contain any < or > and that your input string is correctly structured.

假设您的非 html 不包含任何 < 或 > 并且您的输入字符串结构正确。

If you know they're a specific tag -- for example you know the text contains only <td>tags, you could do something like this:

如果你知道它们是一个特定的标签——例如你知道文本只包含<td>标签,你可以这样做:

String target = someString.replaceAll("(?i)<td[^>]*>", "");

Edit: Ωmega brought up a good point in a comment on another post that this would result in multiple results all being squished together if there were multiple tags.

编辑:Ωmega 在另一篇文章的评论中提出了一个好观点,即如果有多个标签,这将导致多个结果都被挤压在一起。

For example, if the input string were <td>Something</td><td>Another Thing</td>, then the above would result in SomethingAnother Thing.

例如,如果输入字符串是<td>Something</td><td>Another Thing</td>,那么上面的结果将是SomethingAnother Thing

In a situation where multiple tags are expected, we could do something like:

在需要多个标签的情况下,我们可以执行以下操作:

String target = someString.replaceAll("(?i)<td[^>]*>", " ").replaceAll("\s+", " ").trim();

This replaces the HTML with a single space, then collapses whitespace, and then trims any on the ends.

这将用单个空格替换 HTML,然后折叠空格,然后修剪末端的任何空格。

回答by Joey

A trivial approach would be to replace

一种简单的方法是替换

<[^>]*>

with nothing. But depending on how ill-structured your input is that may well fail.

一无所有。但是,根据您输入的结构不良程度,这很可能会失败。

回答by mihaisimi

You could do it with jsoup http://jsoup.org/

你可以用 jsoup http://jsoup.org/

Whitelist whitelist = Whitelist.none();
String cleanStr = Jsoup.clean(yourText, whitelist);