C# 如何过滤除特定白名单之外的所有 HTML 标签?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/307013/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-03 22:31:14  来源:igfitidea点击:

How do I filter all HTML tags except a certain whitelist?

c#htmlvb.netregex

提问by richardtallent

This is for .NET. IgnoreCase is set and MultiLine is NOT set.

这是针对 .NET 的。IgnoreCase 已设置,但 MultiLine 未设置。

Usually I'm decent at regex, maybe I'm running low on caffeine...

通常我在正则表达式方面表现不错,也许我的咖啡因不足......

Users are allowed to enter HTML-encoded entities (<lt;, <amp;, etc.), and to use the following HTML tags:

允许用户输入 HTML 编码的实体(<lt;、<amp; 等),并使用以下 HTML 标签:

u, i, b, h3, h4, br, a, img

Self-closing <br/> and <img/> are allowed, with or without the extra space, but are not required.

允许自闭合 <br/> 和 <img/> ,有或没有额外空间,但不是必需的。

I want to:

我想要:

  1. Strip all starting and ending HTML tags other than those listed above.
  2. Remove attributes from the remaining tags, exceptanchors can have an href.
  1. 去除上面列出的所有开始和结束 HTML 标签。
  2. 从剩余的标签中删除属性,除了锚点可以有一个 href。

My search pattern (replaced with an empty string) so far:

到目前为止,我的搜索模式(替换为空字符串):

<(?!i|b|h3|h4|a|img|/i|/b|/h3|/h4|/a|/img)[^>]+>

This seemsto be stripping all but the start and end tags I want, but there are three problems:

似乎剥离了除我想要的开始和结束标签之外的所有标签,但存在三个问题:

  1. Having to include the end tag version of each allowed tag is ugly.
  2. The attributes survive. Can this happen in a single replacement?
  3. Tags starting withthe allowed tag names slip through. E.g., "<abbrev>" and "<iframe>".
  1. 必须包含每个允许标签的结束标签版本是丑陋的。
  2. 属性存活。这可以在一次更换中发生吗?
  3. 标签开始允许的标签名漏网之鱼。例如,“<abbrev>”和“<iframe>”。

The following suggested pattern does not strip out tags that have no attributes.

以下建议的模式不会去除没有属性的标签。

</?(?!i|b|h3|h4|a|img)\b[^>]*>

As mentioned below, ">" is legal in an attribute value, but it's safe to say I won't support that. Also, there will be no CDATA blocks, etc. to worry about. Just a little HTML.

如下所述,“>”在属性值中是合法的,但可以肯定地说我不会支持。此外,不会有 CDATA 块等需要担心。只是一点点HTML。

Loophole's answer is the best one so far, thanks! Here's his pattern (hoping the PRE works better for me):

漏洞的答案是迄今为止最好的答案,谢谢!这是他的模式(希望 PRE 更适合我):

static string SanitizeHtml(string html)
{
    string acceptable = "script|link|title";
    string stringPattern = @"</?(?(?=" + acceptable + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*??)?)*\s*/?>";
    return Regex.Replace(html, stringPattern, "sausage");
}

Some small tweaks I think could still be made to this answer:

我认为仍然可以对这个答案进行一些小的调整:

  1. I think this could be modified to capture simple HTML comments (those that do not themselves contain tags) by adding "!--" to the "acceptable" variable and making a small change to the end of the expression to allow for an optional trailing "\s--".

  2. I think this would break if there are multiple whitespace characters between attributes (example: heavily-formatted HTML with line breaks and tabs between attributes).

  1. 我认为这可以修改为捕获简单的 HTML 注释(那些本身不包含标签的注释),方法是将“!--”添加到“可接受的”变量并对表达式的末尾进行小的更改以允许可选的尾随“\s--”。

  2. 我认为如果属性之间有多个空白字符,这会中断(例如:在属性之间带有换行符和制表符的重格式 HTML)。

Edit 2009-07-23:Here's the final solution I went with (in VB.NET):

编辑 2009-07-23:这是我使用的最终解决方案(在 VB.NET 中):

 Dim AcceptableTags As String = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
 Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & _
      ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*??)?)*\s*/?>"
 html = Regex.Replace(html, WhiteListPattern, "", RegExOptions.Compiled)

The caveat is that the HREF attribute of A tags still gets scrubbed, which is not ideal.

需要注意的是,A 标签的 HREF 属性仍然会被清除,这并不理想。

采纳答案by Jason Kelley

Here's a function I wrote for this task:

这是我为此任务编写的函数:

static string SanitizeHtml(string html)
{
    string acceptable = "script|link|title";
    string stringPattern = @"</?(?(?=" + acceptable + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:(["",']?).*??)?)*\s*/?>";
    return Regex.Replace(html, stringPattern, "sausage");
}

Edit: For some reason I posted a correction to my previous answer as a separate answer, so I am consolidating them here.

编辑:出于某种原因,我将之前的更正作为单独的答案发布,因此我将它们合并到这里。

I will explain the regex a bit, because it is a little long.

我将解释一下正则表达式,因为它有点长。

The first part matches an open bracket and 0 or 1 slashes (in case it's a close tag).

第一部分匹配一个左括号和 0 或 1 个斜杠(如果它是一个结束标记)。

Next you see an if-then construct with a look ahead. (?(?=SomeTag)then|else) I am checking to see if the next part of the string is one of the acceptable tags. You can see that I concatenate the regex string with the acceptable variable, which is the acceptable tag names seperated by a verticle bar so that any of the terms will match. If it is a match, you can see I put in the word "notag" because no tag would match that and if it is acceptable I want to leave it alone. Otherwise I move on to the else part, where i match any tag name [a-z,A-Z,0-9]+

接下来,您会看到一个 if-then 结构,并向前看。(?(?=SomeTag)then|else) 我正在检查字符串的下一部分是否是可接受的标签之一。您可以看到我将正则表达式字符串与可接受的变量连接起来,该变量是由垂直条分隔的可接受的标签名称,以便任何术语都匹配。如果匹配,你可以看到我输入了“notag”这个词,因为没有标签可以匹配它,如果可以接受,我想不理会它。否则,我将转到 else 部分,在那里我匹配任何标签名称 [az,AZ,0-9]+

Next, I want to match 0 or more attributes, which I assume are in the form attribute="value". so now I group this part representing an attribute but I use the ?: to prevent this group from being captured for speed: (?:\s[a-z,A-Z,0-9,-]+=?(?:(["",']?).?\1?))

接下来,我想匹配 0 个或多个属性,我假设这些属性的格式为 attribute="value"。所以现在我将代表一个属性的这部分分组,但我使用 ?: 来防止这个组被捕获以提高速度: (?:\s[az,AZ,0-9,-]+=?(?:([" ",']?). ?\1?))

Here I begin with the whitespace character that would be between the tag and attribute names, then match an attribute name: [a-z,A-Z,0-9,-]+

在这里,我从标记名称和属性名称之间的空白字符开始,然后匹配一个属性名称:[az,AZ,0-9,-]+

next I match an equals sign, and then either quote. I group the quote so it will be captured, and I can do a backreference later \1 to match the same type of quote. In between these two quotes, you can see I use the period to match anything, however I use the lazy version *? instead of the greedy version * so that it will only match up to the next quote that would end this value.

接下来我匹配一个等号,然后是引号。我将引用分组以便将其捕获,稍后我可以进行反向引用 \1 以匹配相同类型的引用。在这两个引号之间,您可以看到我使用句点来匹配任何内容,但是我使用的是惰性版本 *? 而不是贪婪版本 * 以便它只匹配将结束此值的下一个引号。

next we put a * after closing the groups with parenthesis so that it will match multiple attirbute/value combinations (or none). Last we match some whitespace with \s, and 0 or 1 ending slashes in the tag for xml style self closing tags.

接下来我们在用括号关闭组后放置一个 * 以便它匹配多个属性/值组合(或无)。最后,我们将一些空格与 \s 和 0 或 1 结尾斜杠匹配,用于 xml 样式的自关闭标签。

You can see I'm replacing the tags with sausage, because I'm hungry, but you could replace them with empty string too to just clear them out.

你可以看到我用香肠替换标签,因为我饿了,但你也可以用空字符串替换它们来清除它们。

回答by Sherm Pendley

Attributes are the major problem with using regexes to try to work with HTML. Consider the sheer number of potential attributes, and the fact that most of them are optional, and also the fact that they can appear in any order, and the fact that ">" is a legal character in quoted attribute values. When you start trying to take all of that into account, the regex you'd need to deal with it all will quickly become unmanageable.

属性是使用正则表达式尝试处理 HTML 的主要问题。考虑潜在属性的绝对数量,以及它们中的大多数是可选的事实,以及它们可以以任何顺序出现的事实,以及“>”在引用的属性值中是合法字符的事实。当您开始尝试将所有这些都考虑在内时,您需要处理的所有正则表达式将很快变得难以管理。

What I would do instead is use an event-based HTML parser, or one that gives you a DOM tree that you can walk through.

我会做的是使用基于事件的 HTML 解析器,或者为您提供可以遍历的 DOM 树的解析器。

回答by CMS

This is a good working example on html tag filtering:

这是 html 标签过滤的一个很好的工作示例:

Sanitize HTML

清理 HTML

回答by Jan Goyvaerts

The reason that adding the word boundary \b didn't work is that you didn't put it inside the lookahead. Thus, \b will be attempted after < where it will always match if the < starts an HTML tag.

添加单词边界 \b 不起作用的原因是您没有将它放在前瞻中。因此, \b 将在 < 之后尝试,如果 < 开始一个 HTML 标签,它将始终匹配。

Put it inside the lookahead like this:

像这样把它放在前瞻中:

<(?!/?(i|b|h3|h4|a|img)\b)[^>]+>

This also shows how you can put the / before the list of tags, rather than with each tag.

这也显示了如何将 / 放在标签列表之前,而不是放在每个标签之前。

回答by Jason Kelley

I think i originally intended to make the values optional, but didn't follow through, as I can see that I added a ?after the equals sign and grouped the value portion of the match. Let's add a ?after that group (marked with a carot) to make it optional in the match as well. I'm not at my compiler right now, but see if this works:

我想我最初打算将值设为可选,但没有遵循,因为我可以看到我?在等号后添加了一个并将匹配的值部分分组。让我们?在该组之后添加一个(用胡萝卜标记)以使其在比赛中也成为可选。我现在不在我的编译器,但看看这是否有效:

@"</?(?(?=" + acceptable + @")notag|[a-z,A-Z,0-9]+)(?:\s[a-z,A-Z,0-9,\-]+=?(?:(["",']?).*??)?)*\s*/?>";
                                                                                             ^

回答by richardtallent

I just noticed the current solution allows tags that start withany of the acceptable tags. Thus, if "b" is an acceptable tag, "blink" is too. Not a huge deal, but something to consider if you are strict about how you filter HTML. You certainly wouldn't want to allow "s" as an acceptable tag, as it would allow "script".

我刚刚注意到当前的解决方案允许任何可接受的标签开头的标签。因此,如果“b”是一个可接受的标签,那么“blink”也是。没什么大不了的,但是如果您对过滤 HTML 的方式很严格,则需要考虑一些事情。您当然不想允许“s”作为可接受的标签,因为它会允许“脚本”。

回答by Chirag

    /// <summary>
    /// Trims the ignoring spacified tags
    /// </summary>
    /// <param name="text">the text from which html is to be removed</param>
    /// <param name="isRemoveScript">specify if you want to remove scripts</param>
    /// <param name="ignorableTags">specify the tags that are to be ignored while stripping</param>
    /// <returns>Stripped Text</returns>
    public static string StripHtml(string text, bool isRemoveScript, params string[] ignorableTags)
    {
        if (!string.IsNullOrEmpty(text))
        {
            text = text.Replace("&lt;", "<");
            text = text.Replace("&gt;", ">");
            string ignorePattern = null;

            if (isRemoveScript)
            {
                text = Regex.Replace(text, "<script[^<]*</script>", string.Empty, RegexOptions.IgnoreCase);
            }
            if (!ignorableTags.Contains("style"))
            {
                text = Regex.Replace(text, "<style[^<]*</style>", string.Empty, RegexOptions.IgnoreCase);
            }
            foreach (string tag in ignorableTags)
            {
                //the character b spoils the regex so replace it with strong
                if (tag.Equals("b"))
                {
                    text = text.Replace("<b>", "<strong>");
                    text = text.Replace("</b>", "</strong>");
                    if (ignorableTags.Contains("strong"))
                    {
                        ignorePattern = string.Format("{0}(?!strong)(?!/strong)", ignorePattern);
                    }
                }
                else
                {
                    //Create ignore pattern fo the tags to ignore
                    ignorePattern = string.Format("{0}(?!{1})(?!/{1})", ignorePattern, tag);
                }

            }
            //finally add the ignore pattern into regex <[^<]*> which is used to match all html tags
            ignorePattern = string.Format(@"<{0}[^<]*>", ignorePattern);
            text = Regex.Replace(text, ignorePattern, "", RegexOptions.IgnoreCase);
        }

        return text;
    }