从字符串中删除 HTML 标签的正则表达式

Question

提问by danny

Possible Duplicate:
Regular expression to remove HTML tags

可能的重复：
删除 HTML 标签的正则表达式

Is there an expression which will get the value between two HTML tags?

是否有一个表达式可以获取两个 HTML 标签之间的值？

Given this:

鉴于这种：

<td class="played">0</td>

I am looking for an expression which will return 0, stripping the <td>tags.

我正在寻找一个将返回的表达式0，剥离<td>标签。

Answer 1

回答by Roddy of the Frozen Peas

You should not attempt to parse HTML with regex. HTML is not a regular language, so any regex you come up with will likely fail on some esoteric edge case. Please refer to the seminal answer to this questionfor specifics. While mostly formatted as a joke, it makes a very good point.

您不应该尝试使用正则表达式解析 HTML。HTML 不是常规语言，因此您提出的任何正则表达式都可能会在某些深奥的边缘情况下失败。具体请参考这个问题的开创性回答。虽然大部分格式是一个笑话，但它提出了一个很好的观点。

The following examples are Java, but the regex will be similar -- if not identical -- for other languages.

以下示例是 Java，但对于其他语言，正则表达式将是相似的（如果不完全相同）。

String target = someString.replaceAll("<[^>]*>", "");

Assuming your non-html does not contain any < or > and that your input string is correctly structured.

假设您的非 html 不包含任何 < 或 > 并且您的输入字符串结构正确。

If you know they're a specific tag -- for example you know the text contains only <td>tags, you could do something like this:

如果你知道它们是一个特定的标签——例如你知道文本只包含<td>标签，你可以这样做：

String target = someString.replaceAll("(?i)<td[^>]*>", "");

Edit: Ωmega brought up a good point in a comment on another post that this would result in multiple results all being squished together if there were multiple tags.

编辑：Ωmega 在另一篇文章的评论中提出了一个好观点，即如果有多个标签，这将导致多个结果都被挤压在一起。

For example, if the input string were <td>Something</td><td>Another Thing</td>, then the above would result in SomethingAnother Thing.

例如，如果输入字符串是<td>Something</td><td>Another Thing</td>，那么上面的结果将是SomethingAnother Thing。

In a situation where multiple tags are expected, we could do something like:

在需要多个标签的情况下，我们可以执行以下操作：

String target = someString.replaceAll("(?i)<td[^>]*>", " ").replaceAll("\s+", " ").trim();

This replaces the HTML with a single space, then collapses whitespace, and then trims any on the ends.

这将用单个空格替换 HTML，然后折叠空格，然后修剪末端的任何空格。

Answer 2

回答by Joey

A trivial approach would be to replace

一种简单的方法是替换

<[^>]*>

with nothing. But depending on how ill-structured your input is that may well fail.

一无所有。但是，根据您输入的结构不良程度，这很可能会失败。

Answer 3

回答by mihaisimi

You could do it with jsoup http://jsoup.org/

你可以用 jsoup http://jsoup.org/

Whitelist whitelist = Whitelist.none();
String cleanStr = Jsoup.clean(yourText, whitelist);

从字符串中删除 HTML 标签的正则表达式

提问by danny

回答by Roddy of the Frozen Peas

回答by Joey

回答by mihaisimi

相关推荐

最近更新

标签

从字符串中删除 HTML 标签的正则表达式

提问by danny

回答by Roddy of the Frozen Peas

回答by Joey

回答by mihaisimi

相关推荐

Html 可以使用 Twitter Bootstrap 实现 Modernizr 吗？

Html 在右侧浮动 div，然后在窄屏幕上下拉

HTML 表格：使用 CSS 设置第二列的宽度

Html 如何在同一个div中的两个按钮之间留出空间？

相关推荐

最近更新

标签