Java 正则表达式去除 HTML 标签

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4075742/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 11:19:36  来源:igfitidea点击:

Regex to strip HTML tags

javahtmlregex

提问by ADIT

I have this HTML input:

我有这个 HTML 输入:

<font size="5"><p>some text</p>
<p> another text</p></font>

I'd like to use regex to remove the HTML tags so that the output is:

我想使用正则表达式删除 HTML 标签,以便输出:

some text
another text

Can anyone suggest how to do this with regex?

谁能建议如何用正则表达式做到这一点?

采纳答案by Prabhakaran

You can go with HTML parser called Jericho Html parser.

您可以使用名为 Jericho Html 解析器的 HTML 解析器。

you can download it from here - http://jericho.htmlparser.net/docs/index.html

你可以从这里下载 - http://jericho.htmlparser.net/docs/index.html

Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognized or invalid HTML. It also provides high-level HTML form manipulation functions.

Jericho HTML Parser 是一个 Java 库,允许分析和操作 HTML 文档的各个部分,包括服务器端标签,同时逐字再现任何无法识别或无效的 HTML。它还提供高级 HTML 表单操作功能。

The presence of badly formatted HTML does not interfere with the parsing

格式错误的 HTML 的存在不会干扰解析

回答by aioobe

Since you asked, here's a quick and dirty solution:

既然你问了,这是一个快速而肮脏的解决方案:

String stripped = input.replaceAll("<[^>]*>", "");

(Ideone.com demo)

( Ideone.com 演示)

Using regexps to deal with HTML is a pretty bad idea though. The above hack won't deal with stuff like

不过,使用正则表达式来处理 HTML 是一个非常糟糕的主意。上面的 hack 不会处理像

  • <tag attribute=">">Hello</tag>
  • <script>if (a < b) alert('Hello>');</script>
  • <tag attribute=">">Hello</tag>
  • <script>if (a < b) alert('Hello>');</script>

etc.

等等。

A better approach would be to use for instance Jsoup. To remove all tags from a string, you can for instance do Jsoup.parse(html).text().

更好的方法是使用例如Jsoup。要从字符串中删除所有标签,您可以例如执行Jsoup.parse(html).text().

回答by BalusC

Use a HTML parser. Here's a Jsoupexample.

使用 HTML 解析器。这是一个Jsoup示例。

String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = Jsoup.parse(input).text();
System.out.println(stripped);

Result:

结果:

some text another text

Or if you want to preserve newlines:

或者,如果您想保留换行符:

String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
for (String line : input.split("\n")) {
    String stripped = Jsoup.parse(line).text();
    System.out.println(stripped);
}

Result:

结果:

some text
another text

Jsoup offers more advantages as well. You could easily extract specific parts of the HTML document using the select()method which accepts jQuery-like CSS selectors. It only requires the document to be semantically well-formed. The presence of the since 1998 deprecated <font>tag is already not a very good indication, but if you know the HTML structure in depth detail beforehand, it'll still be doable.

Jsoup 也提供了更多的优势。您可以使用select()接受类似 jQuery 的 CSS 选择器的方法轻松提取 HTML 文档的特定部分。它只要求文档在语义上格式良好。自 1998 年以来已弃用的<font>标记的存在已经不是一个很好的指示,但如果您事先了解 HTML 结构的详细信息,它仍然是可行的。

See also:

也可以看看:

回答by Fabiano Francesconi

If you use Jericho, then you just have to use something like this:

如果你使用Jericho,那么你只需要使用这样的东西:

public String extractAllText(String htmlText){
    Source source = new Source(htmlText);
    return source.getTextExtractor().toString();
}

Of course you can do the same even with an Element:

当然,即使使用Element

for (Element link : links) {
  System.out.println(link.getTextExtractor().toString());
}

回答by Alexis Dufrenoy

Starting from aioobe's code, I tried something more daring:

从 aioobe 的代码开始,我尝试了一些更大胆的事情:

String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = input.replaceAll("</?(font|p){1}.*?/?>", "");
System.out.println(stripped);

The code to strip every HTML tag would look like this:

去除每个 HTML 标签的代码如下所示:

public class HtmlSanitizer {

    private static String pattern;

    private final static String [] tagsTab = {"!doctype","a","abbr","acronym","address","applet","area","article","aside","audio","b","base","basefont","bdi","bdo","bgsound","big","blink","blockquote","body","br","button","canvas","caption","center","cite","code","col","colgroup","content","data","datalist","dd","decorator","del","details","dfn","dir","div","dl","dt","element","em","embed","fieldset","figcaption","figure","font","footer","form","frame","frameset","h1","h2","h3","h4","h5","h6","head","header","hgroup","hr","html","i","iframe","img","input","ins","isindex","kbd","keygen","label","legend","li","link","listing","main","map","mark","marquee","menu","menuitem","meta","meter","nav","nobr","noframes","noscript","object","ol","optgroup","option","output","p","param","plaintext","pre","progress","q","rp","rt","ruby","s","samp","script","section","select","shadow","small","source","spacer","span","strike","strong","style","sub","summary","sup","table","tbody","td","template","textarea","tfoot","th","thead","time","title","tr","track","tt","u","ul","var","video","wbr","xmp"};

    static {
        StringBuffer tags = new StringBuffer();
        for (int i=0;i<tagsTab.length;i++) {
            tags.append(tagsTab[i].toLowerCase()).append('|').append(tagsTab[i].toUpperCase());
            if (i<tagsTab.length-1) {
                tags.append('|');
            }
        }
        pattern = "</?("+tags.toString()+"){1}.*?/?>";
    }

    public static String sanitize(String input) {
        return input.replaceAll(pattern, "");
    }

    public final static void main(String[] args) {
        System.out.println(HtmlSanitizer.pattern);

        System.out.println(HtmlSanitizer.sanitize("<font size=\"5\"><p>some text</p><br/> <p>another text</p></font>"));
    }

}

I wrote this in order to be Java 1.4 compliant, for some sad reasons, so feel free to use for each and StringBuilder...

我写这个是为了与 Java 1.4 兼容,出于一些可悲的原因,所以请随意使用每个和 StringBuilder ...

Advantages:

好处:

  • You can generate lists of tags you want to strip, which means you can keep those you want
  • You avoid stripping stuff that isn't an HTML tag
  • You keep the whitespaces
  • 您可以生成要删除的标签列表,这意味着您可以保留想要的标签
  • 你避免剥离不是 HTML 标签的东西
  • 你保留空格

Drawbacks:

缺点:

  • You have to list all HTML tags you want to strip from your string. Which can be a lot, for example if you want to strip everything.
  • 您必须列出要从字符串中删除的所有 HTML 标签。这可能很多,例如,如果您想剥离所有内容。

If you see any other drawbacks, I would really be glad to know them.

如果您看到任何其他缺点,我真的很高兴知道它们。