如何使用 C# 验证字符串不包含 HTML

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/204646/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-03 17:56:07  来源:igfitidea点击:

How to validate that a string doesn't contain HTML using C#

c#htmlvalidation

提问by Ben Mills

Does anyone have a simple, efficient way of checking that a string doesn't contain HTML? Basically, I want to check that certain fields only contain plain text. I thought about looking for the < character, but that can easily be used in plain text. Another way might be to create a new System.Xml.Linq.XElement using:

有没有人有一种简单有效的方法来检查字符串是否包含 HTML?基本上,我想检查某些字段是否仅包含纯文本。我想寻找 < 字符,但它可以很容易地在纯文本中使用。另一种方法可能是使用以下方法创建一个新的 System.Xml.Linq.XElement:

XElement.Parse("<wrapper>" + MyString + "</wrapper>")

and check that the XElement contains no child elements, but this seems a little heavyweight for what I need.

并检查 XElement 是否不包含子元素,但这对于我需要的东西来说似乎有点重量级。

采纳答案by Ben Mills

I just tried my XElement.Parse solution. I created an extension method on the string class so I can reuse the code easily:

我刚刚尝试了我的 XElement.Parse 解决方案。我在字符串类上创建了一个扩展方法,以便我可以轻松地重用代码:

public static bool ContainsXHTML(this string input)
{
    try
    {
        XElement x = XElement.Parse("<wrapper>" + input + "</wrapper>");
        return !(x.DescendantNodes().Count() == 1 && x.DescendantNodes().First().NodeType == XmlNodeType.Text);
    }
    catch (XmlException ex)
    {
        return true;
    }
}

One problem I found was that plain text ampersand and less than characters cause an XmlException and indicate that the field contains HTML (which is wrong). To fix this, the input string passed in first needs to have the ampersands and less than characters converted to their equivalent XHTML entities. I wrote another extension method to do that:

我发现的一个问题是纯文本与符号和小于字符会导致 XmlException 并指示该字段包含 HTML(这是错误的)。为了解决这个问题,传入的输入字符串首先需要将&符号和小于字符转换为它们等效的 XHTML 实体。我写了另一个扩展方法来做到这一点:

public static string ConvertXHTMLEntities(this string input)
{
    // Convert all ampersands to the ampersand entity.
    string output = input;
    output = output.Replace("&amp;", "amp_token");
    output = output.Replace("&", "&amp;");
    output = output.Replace("amp_token", "&amp;");

    // Convert less than to the less than entity (without messing up tags).
    output = output.Replace("< ", "&lt; ");
    return output;
}

Now I can take a user submitted string and check that it doesn't contain HTML using the following code:

现在我可以使用用户提交的字符串并使用以下代码检查它是否不包含 HTML:

bool ContainsHTML = UserEnteredString.ConvertXHTMLEntities().ContainsXHTML();

I'm not sure if this is bullet proof, but I think it's good enough for my situation.

我不确定这是否是防弹的,但我认为这对我的情况来说已经足够了。

回答by ICR

The following will match any matching set of tags. i.e. <b>this</b>

以下将匹配任何匹配的标签集。即<b>这个</b>

Regex tagRegex = new Regex(@"<\s*([^ >]+)[^>]*>.*?<\s*/\s*\s*>");

The following will match any single tag. i.e. <b> (it doesn't have to be closed).

以下将匹配任何单个标签。即 <b> (它不必关闭)。

Regex tagRegex = new Regex(@"<[^>]+>");

You can then use it like so

然后你可以像这样使用它

bool hasTags = tagRegex.IsMatch(myString);

回答by Josef

Here you go:

干得好:

using System.Text.RegularExpressions;
private bool ContainsHTML(string checkString)
{
  return Regex.IsMatch(checkString, "<(.|\n)*?>");
}

That is the simplest way, since items in brackets are unlikely to occur naturally.

这是最简单的方法,因为括号中的项目不太可能自然发生。

回答by DOK

Angle brackets may not be your only challenge. Other characters can also be potentially harmful script injection. Such as the common double hyphen "--", which can also used in SQL injection. And there are others.

尖括号可能不是您唯一的挑战。其他字符也可能是潜在有害的脚本注入。比如常见的双连字符“--”,也可以用在SQL注入中。还有其他人。

On an ASP.Net page, if validateRequest = true in machine.config, web.config or the page directive, the user will get an error page stating "A potentially dangerous Request.Form value was detected from the client" if an HTML tag or various other potential script-injection attacks are detected. You probably want to avoid this and provide a more elegant, less-scary UI experience.

在 ASP.Net 页面上,如果 machine.config、web.config 或页面指令中的 validateRequest = true,则用户将收到一个错误页面,指出“从客户端检测到潜在危险的 Request.Form 值”(如果 HTML 标记)或检测到其他各种潜在的脚本注入攻击。您可能希望避免这种情况并提供更优雅、更不可怕的 UI 体验。

You could test for both the opening and closing tags <> using a regular expression, and allow the text if only one of them occcurs. Allow < or >, but not < followed by some text and then >, in that order.

您可以使用正则表达式测试开始和结束标记 <>,如果只有其中一个出现,则允许文本。允许 < 或 >,但不允许 < 后跟一些文本,然后是 >,按此顺序。

You could allow angle brackets and HtmlEncode the text to preserve them when the data is persisted.

您可以允许尖括号和 HtmlEncode 文本在数据持久化时保留它们。

回答by J c

You could ensure plain text by encoding the input using HttpUtility.HtmlEncode.

您可以通过使用HttpUtility.HtmlEncode对输入进行编码来确保纯文本。

In fact, depending on how strict you want the check to be, you could use it to determine if the string contains HTML:

事实上,根据您希望检查的严格程度,您可以使用它来确定字符串是否包含 HTML:

bool containsHTML = (myString != HttpUtility.HtmlEncode(myString));

回答by Mark

Beware when using the HttpUtility.HtmlEncode method mentioned above. If you are checking some text with special characters, but not HTML, it will evaluate incorrectly. Maybe that's why J c used "...depending on how strict you want the check to be..."

使用上面提到的 HttpUtility.HtmlEncode 方法时要小心。如果您正在检查一些带有特殊字符而不是 HTML 的文本,它将错误地评估。也许这就是为什么 J c 使用“...取决于您希望检查的严格程度...”

回答by kns98

this also checks for things like < br /> self enclosed tags with optional whitespace. the list does not contain new html5 tags.

这也会检查诸如 < br /> 带有可选空格的自封闭标签之类的东西。该列表不包含新的 html5 标签。

internal static class HtmlExts
{
    public static bool containsHtmlTag(this string text, string tag)
    {
        var pattern = @"<\s*" + tag + @"\s*\/?>";
        return Regex.IsMatch(text, pattern, RegexOptions.IgnoreCase);
    }

    public static bool containsHtmlTags(this string text, string tags)
    {
        var ba = tags.Split('|').Select(x => new {tag = x, hastag = text.containsHtmlTag(x)}).Where(x => x.hastag);

        return ba.Count() > 0;
    }

    public static bool containsHtmlTags(this string text)
    {
        return
            text.containsHtmlTags(
                "a|abbr|acronym|address|area|b|base|bdo|big|blockquote|body|br|button|caption|cite|code|col|colgroup|dd|del|dfn|div|dl|DOCTYPE|dt|em|fieldset|form|h1|h2|h3|h4|h5|h6|head|html|hr|i|img|input|ins|kbd|label|legend|li|link|map|meta|noscript|object|ol|optgroup|option|p|param|pre|q|samp|script|select|small|span|strong|style|sub|sup|table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|ul|var");
    }
}