从字符串中删除 HTML 标签，包括 C# 中的

Question

提问by rampuriyaaa

How can I remove all the HTML tags including &nbsp using regex in C#. My string looks like

如何在 C# 中使用正则表达式删除所有 HTML 标签，包括。我的字符串看起来像

  "<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"

Answer 1

采纳答案by Ravi Thapliyal

If you can't use an HTML parser oriented solution to filter out the tags, here's a simple regex for it.

如果您不能使用面向 HTML 解析器的解决方案来过滤掉标签，这里有一个简单的正则表达式。

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", "").Trim();

You should ideally make another pass through a regex filter that takes care of multiple spaces as

理想情况下，您应该再次通过处理多个空格的正则表达式过滤器

string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");

Answer 2

回答by Jonesopolis

this:

这个：

(<.+?> | &nbsp;)

will match any tag or  

将匹配任何标签或  

string regex = @"(<.+?>|&nbsp;)";
var x = Regex.Replace(originalString, regex, "").Trim();

then x = hello

那么 x = hello

Answer 3

回答by David S.

I've been using this function for a while. Removes pretty much any messy html you can throw at it and leaves the text intact.

我一直在使用这个功能一段时间。删除几乎所有您可以扔给它的凌乱 html 并保持文本完整。

        private static readonly Regex _tags_ = new Regex(@"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled);

        //add characters that are should not be removed to this regex
        private static readonly Regex _notOkCharacter_ = new Regex(@"[^\w;&#@.:/\?=|%!() -]", RegexOptions.Compiled);

        public static String UnHtml(String html)
        {
            html = HttpUtility.UrlDecode(html);
            html = HttpUtility.HtmlDecode(html);

            html = RemoveTag(html, "<!--", "-->");
            html = RemoveTag(html, "<script", "</script>");
            html = RemoveTag(html, "<style", "</style>");

            //replace matches of these regexes with space
            html = _tags_.Replace(html, " ");
            html = _notOkCharacter_.Replace(html, " ");
            html = SingleSpacedTrim(html);

            return html;
        }

        private static String RemoveTag(String html, String startTag, String endTag)
        {
            Boolean bAgain;
            do
            {
                bAgain = false;
                Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase);
                if (startTagPos < 0)
                    continue;
                Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase);
                if (endTagPos <= startTagPos)
                    continue;
                html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length);
                bAgain = true;
            } while (bAgain);
            return html;
        }

        private static String SingleSpacedTrim(String inString)
        {
            StringBuilder sb = new StringBuilder();
            Boolean inBlanks = false;
            foreach (Char c in inString)
            {
                switch (c)
                {
                    case '\r':
                    case '\n':
                    case '\t':
                    case ' ':
                        if (!inBlanks)
                        {
                            inBlanks = true;
                            sb.Append(' ');
                        }   
                        continue;
                    default:
                        inBlanks = false;
                        sb.Append(c);
                        break;
                }
            }
            return sb.ToString().Trim();
        }

Answer 4

回答by MRP

var noHtml = Regex.Replace(inputHTML, @"<[^>]*(>|$)|&nbsp;|&zwnj;|&raquo;|&laquo;", string.Empty).Trim();

Answer 5

回答by Don Rolling

I took @Ravi Thapliyal's code and made a method: It is simple and might not clean everything, but so far it is doing what I need it to do.

我采用了@Ravi Thapliyal 的代码并制作了一个方法：它很简单，可能无法清理所有内容，但到目前为止它正在做我需要它做的事情。

public static string ScrubHtml(string value) {
    var step1 = Regex.Replace(value, @"<[^>]+>|&nbsp;", "").Trim();
    var step2 = Regex.Replace(step1, @"\s{2,}", " ");
    return step2;
}

Answer 6

回答by Ehsan88

Sanitizing an Html document involves a lot of tricky things. This package maybe of help: https://github.com/mganss/HtmlSanitizer

清理 Html 文档涉及很多棘手的事情。这个包可能有帮助：https: //github.com/mganss/HtmlSanitizer

Answer 7

回答by Ananth Ram

(<([^>]+)>|&nbsp;)

You can test it here: https://regex101.com/r/kB0rQ4/1

你可以在这里测试：https: //regex101.com/r/kB0rQ4/1

Answer 8

回答by nivs1978

HTML is in its basic form just XML. You could Parse your text in an XmlDocument object, and on the root element call InnerText to extract the text. This will strip all HTML tages in any form and also deal with special characters like <   all in one go.

HTML 的基本形式就是 XML。您可以在 XmlDocument 对象中解析文本，并在根元素上调用 InnerText 来提取文本。这将删除任何形式的所有 HTML 标签，并处理诸如 < 之类的特殊字符。一口气完成。

Answer 9

回答by Sabique A Khan

I have used the @RaviThapliyal & @Don Rolling's code but made a little modification. Since we are replacing the &nbsp with empty string but instead &nbsp should be replaced with space, so added an additional step. It worked for me like a charm.

我使用了@RaviThapliyal 和@Don Rolling 的代码，但做了一些修改。由于我们将替换为空字符串，而应替换为空格，因此添加了一个额外的步骤。它对我来说就像一种魅力。

public static string FormatString(string value) {
    var step1 = Regex.Replace(value, @"<[^>]+>", "").Trim();
    var step2 = Regex.Replace(step1, @"&nbsp;", " ");
    var step3 = Regex.Replace(step2, @"\s{2,}", " ");
    return step3;
}

Used &nbps without semicolon because it was getting formatted by the Stack Overflow.

使用不带分号的 &nbps，因为它被 Stack Overflow 格式化了。

从字符串中删除 HTML 标签，包括 C# 中的

提问by rampuriyaaa

采纳答案by Ravi Thapliyal

回答by Jonesopolis

回答by David S.

回答by MRP

回答by Don Rolling

回答by Ehsan88

回答by Ananth Ram

回答by nivs1978

回答by Sabique A Khan

相关推荐

最近更新

标签

从字符串中删除 HTML 标签，包括 C# 中的

提问by rampuriyaaa

采纳答案by Ravi Thapliyal

回答by Jonesopolis

回答by David S.

回答by MRP

回答by Don Rolling

回答by Ehsan88

回答by Ananth Ram

回答by nivs1978

回答by Sabique A Khan

相关推荐

C# 在 Gridview 中从 HiddenField 设置和检索值

C# 检测到实体框架自引用循环

C# 错误：对已定义的类型声明的引用，但找不到

C# 对象引用未设置为对象 asp.net 的实例

相关推荐

最近更新

标签