在 Java 中剥离 HTML 标签

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/832620/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 19:59:53  来源:igfitidea点击:

Stripping HTML tags in Java

javahtml

提问by Todd

Is there an existing Java library which provides a method to strip all HTML tags from a String? I'm looking for something equivalent to the strip_tagsfunction in PHP.

是否有现有的 Java 库提供了一种从字符串中去除所有 HTML 标签的方法?我正在寻找strip_tags与 PHP 中的函数等效的东西。

I know that I can use a regex as described in this Stackoverflow question, however I was curious if there may already be a stripTags()method floating around somewhere in the Apache Commons library that can be used.

我知道我可以使用这个 Stackoverflow question 中描述的正则表达式,但是我很好奇是否已经有一种stripTags()方法可以在 Apache Commons 库中的某个地方使用。

采纳答案by Todd

After having this question open for almost a week, I can say with some certainty that there is no method available in the Java API or Apache libaries which strips HTML tags from a String. You would either have to use an HTML parser as described in the previous answers, or write a simple regular expression to strip out the tags.

在打开这个问题将近一周之后,我可以肯定地说,Java API 或 Apache 库中没有可用的方法可以从字符串中剥离 HTML 标签。您要么必须使用前面答案中描述的 HTML 解析器,要么编写一个简单的正则表达式来去除标签。

回答by Charlie Martin

There may be some, but the most robust thing is to use an actual HTML parser. There's one here, and if it's reasonably well formed, you can also use SAX or another XML parser.

可能有一些,但最可靠的是使用实际的 HTML 解析器。有一个在这里,如果它是相当不错形成,也可以使用SAX或另一个XML分析器。

回答by Solomon Duskis

I've used nekoHtmlto do that. It can strip all tags but it can just as easily keep or strip a subset of tags.

我已经使用nekoHtml来做到这一点。它可以剥离所有标签,但也可以轻松保留或剥离一部分标签。

回答by Jason Fritcher

Whatever you do, make sure you normalize the data before you start trying to strip tags. I recently attended a web app security workshop that covered XSS filter evasion. One would normally think that searching for <or &lt;or its hex equivalent would be sufficient. I was blown away after seeing a slide with 70 ways that <can be encoded to beat filters.

无论您做什么,请确保在开始尝试剥离标签之前对数据进行标准化。我最近参加了一个涵盖 XSS 过滤器规避的网络应用安全研讨会。人们通常会认为搜索<or&lt;或它的十六进制等效项就足够了。看到一张幻灯片有 70 种<可以编码以击败过滤器的方法后,我被震撼到了。

Update:

更新:

Below is the presentation I was referring to, see slide 26 for the 70 ways to encode <.

下面是我所指的演示文稿,请参阅幻灯片 26 了解 70 种编码<.

Filter Evasion: Houdini on the Wire

过滤规避:Houdini on the Wire

回答by Arthur

Wicket uses the following method to escape html, located in: org.apache.wicket.util.string.Strings

Wicket 使用以下方法转义 html,位于:org.apache.wicket.util.string.Strings

public static CharSequence escapeMarkup(final String s, final boolean escapeSpaces,
    final boolean convertToHtmlUnicodeEscapes)
{
    if (s == null)
    {
        return null;
    }
    else
    {
        int len = s.length();
        final AppendingStringBuffer buffer = new AppendingStringBuffer((int)(len * 1.1));

        for (int i = 0; i < len; i++)
        {
            final char c = s.charAt(i);

            switch (c)
            {
                case '\t' :
                    if (escapeSpaces)
                    {
                        // Assumption is four space tabs (sorry, but that's
                        // just how it is!)
                        buffer.append("&nbsp;&nbsp;&nbsp;&nbsp;");
                    }
                    else
                    {
                        buffer.append(c);
                    }
                    break;

                case ' ' :
                    if (escapeSpaces)
                    {
                        buffer.append("&nbsp;");
                    }
                    else
                    {
                        buffer.append(c);
                    }
                    break;

                case '<' :
                    buffer.append("&lt;");
                    break;

                case '>' :
                    buffer.append("&gt;");
                    break;

                case '&' :

                    buffer.append("&amp;");
                    break;

                case '"' :
                    buffer.append("&quot;");
                    break;

                case '\'' :
                    buffer.append("&#039;");
                    break;

                default :

                    if (convertToHtmlUnicodeEscapes)
                    {
                        int ci = 0xffff & c;
                        if (ci < 160)
                        {
                            // nothing special only 7 Bit
                            buffer.append(c);
                        }
                        else
                        {
                            // Not 7 Bit use the unicode system
                            buffer.append("&#");
                            buffer.append(new Integer(ci).toString());
                            buffer.append(';');
                        }
                    }
                    else
                    {
                        buffer.append(c);
                    }

                    break;
            }
        }

        return buffer;
    }
}

回答by Jakob Alexander Eichler

This is what I found on google on it. For me it worked fine.

这是我在谷歌上找到的。对我来说它工作得很好。

String noHTMLString = htmlString.replaceAll("\<.*?\>", "");

回答by Lou

Hi I know this thread is old but it still came out tops on Google, and I was looking for a quick fix to the same problem. Couldn't find anything useful so I came up with this code snippet -- hope it helps someone. It just loops over the string and skips all the tags. Plain & simple.

嗨,我知道这个帖子很旧,但它仍然在 Google 上名列前茅,我一直在寻找解决同一问题的快速方法。找不到任何有用的东西,所以我想出了这个代码片段——希望它对某人有所帮助。它只是遍历字符串并跳过所有标签。朴实无华。

boolean intag = false;
String inp = "<H1>Some <b>HTML</b> <span style=blablabla>text</span>";
String outp = "";

for (int i=0; i < inp.length(); ++i)
{
    if (!intag && inp.charAt(i) == '<')
        {
            intag = true;
            continue;
        }
        if (intag && inp.charAt(i) == '>')
        {
            intag = false;
            continue;
        }
        if (!intag)
        {
            outp = outp + inp.charAt(i);
        }
}   
return outp;

回答by jebbie

Use JSoup, it's well documented, available on Maven and after a day of spending time with several libraries, for me, it is the best one i can imagine.. My own opinion is, that a job like that, parsing html into plain-text, should be possible in one line of code -> otherwise the library has failed somehow... just saying ^^ So here it is, the one-liner of JSoup - in Markdown4J, something like that is not possible, in Markdownj too, in htmlCleaner this is pain in the ass with somewhat about 50 lines of code...

使用JSoup,它有据可查,可在 Maven 上使用,经过一天的使用几个库,对我来说,它是我能想象到的最好的。文本,应该可以在一行代码中实现 -> 否则该库以某种方式失败了......只是说 ^^ 所以这里是 JSoup 的单行代码 - 在 Markdown4J 中,这样的事情是不可能的,在 Markdownj 中也是, 在 htmlCleaner 中,大约有 50 行代码,这很麻烦...

String plain = new HtmlToPlainText().getPlainText(Jsoup.parse(html));

And what you got is real plain-text (not just the html-source-code as a String, like in other libs lol) -> he really does a great job on that. It is more or less the same quality as Markdownify for PHP....

你得到的是真正的纯文本(不仅仅是作为字符串的 html 源代码,就像在其他库中一样) -> 他在这方面做得很好。它或多或少与 PHP 的 Markdownify 质量相同....

回答by michaeldd

I know that this question is quite old, but I have been looking for this too and it seems that it is still not easy to find a good and easy solution in java.

我知道这个问题已经很老了,但我也一直在寻找这个问题,似乎在java中找到一个好的和简单的解决方案仍然不容易。

Today I came across this little functions lib. It actually attempts to imitate the php strip_tagsfunction.

今天我遇到了这个小函数库。它实际上试图模仿phpstrip_tags函数。

http://jmelo.lyncode.com/java-strip_tags-php-function/

http://jmelo.lyncode.com/java-strip_tags-php-function/

It works like this (copied from their site):

它的工作原理是这样的(从他们的网站复制):

    import static com.lyncode.jtwig.functions.util.HtmlUtils.stripTags;

    public class StripTagsExample {
      public static void main(String... args) {
        String result = stripTags("<!-- <a href='test'></a>--><a>Test</a>", "");
        // Produced result: Test
      }
    }

回答by Mitja Gustin

With pure iterative approach and no regex :

使用纯迭代方法并且没有正则表达式:

public String stripTags(final String html) {

    final StringBuilder sbText = new StringBuilder(1000);
    final StringBuilder sbHtml = new StringBuilder(1000);

    boolean isText = true;

    for (char ch : html.toCharArray()) {
        if (isText) { // outside html
            if (ch != '<') {
                sbText.append(ch);
                continue;
            } else {   // switch mode             
                isText = false;      
                sbHtml.append(ch); 
                continue;
            }
        }else { // inside html
            if (ch != '>') {
                sbHtml.append(ch);
                continue;
            } else {      // switch mode    
                isText = true;     
                sbHtml.append(ch); 
                continue;
            }
        }
    }

    return sbText.toString();
}