C# 如何从 ASP.NET 中的字符串中去除 HTML 标签？

Question

提问by daniel

Using ASP.NET, how can I strip the HTML tags from a given string reliably (i.e. not using regex)? I am looking for something like PHP's strip_tags.

使用 ASP.NET，如何可靠地从给定字符串中去除 HTML 标记（即不使用正则表达式）？我正在寻找类似 PHP 的strip_tags.

Example:

例子：

<ul><li>Hello</li></ul>

Output:

输出：

"Hello"

“你好”

I am trying not to reinvent the wheel, but I have not found anything that meets my needs so far.

我试图不重新发明子，但到目前为止我还没有找到任何满足我需求的东西。

Answer 1

采纳答案by Tomalak

If it is just stripping allHTML tags from a string, this works ~~reliably~~with regex as well. Replace:

如果它只是从字符串中剥离所有HTML 标签，这也可以与正则表达式一起~~可靠地~~工作。代替：

<[^>]*(>|$)

with the empty string, globally. Don't forget to normalize the string afterwards, replacing:

使用空字符串，全局。之后不要忘记标准化字符串，替换：

[\s\r\n]+

with a single space, and trimming the result. Optionally replace any HTML character entities back to the actual characters.

一个空格，并修剪结果。可选择将任何 HTML 字符实体替换回实际字符。

Note:

注意：

There is a limitation: HTML and XML allow >in attribute values. This solution willreturn broken markup when encountering such values.
The solution is technically safe, as in: The result will never contain anything that could be used to do cross site scripting or to break a page layout. It is just not very clean.
As with all things HTML and regex:
Use a proper parserif you must get it right under all circumstances.

有一个限制：HTML 和 XML 允许>属性值。遇到此类值时，此解决方案将返回损坏的标记。
该解决方案在技术上是安全的，例如：结果永远不会包含任何可用于执行跨站点脚本或破坏页面布局的内容。它只是不是很干净。
与所有 HTML 和正则表达式一样：如果您必须在所有情况下都正确，
请使用适当的解析器。

Answer 2

回答by user95144

Regex.Replace(htmlText, "<.*?>", string.Empty);

Answer 3

回答by Andrei R?nea

I have written a pretty fast method in c# which beats the hell out of the Regex. It is hosted in an articleon CodeProject.

我在 c# 中编写了一个非常快速的方，它击败了正则表达式。它托管在CodeProject 上的一篇文章中。

Its advantages are, among better performance the ability to replace named and numbered HTML entities (those like &amp;and &203;) and comment blocks replacement and more.

它的优点是，除了更好的性能之外，还可以替换命名和编号的 HTML 实体（如&amp;和&203;）和注释块替换等。

Please read the related article on CodeProject.

请阅读有关 CodeProject的相关文章。

Thank you.

谢谢你。

Answer 4

回答by Andrei R?nea

string result = Regex.Replace(anytext, @"<(.|\n)*?>", string.Empty);

Answer 5

回答by Serapth

Go download HTMLAgilityPack, now! ;) Download LInk

现在就去下载 HTMLAgilityPack！;) 下载链接

This allows you to load and parse HTML. Then you can navigate the DOM and extract the inner values of all attributes. Seriously, it will take you about 10 lines of code at the maximum. It is one of the greatest free .net libraries out there.

这允许您加载和解析 HTML。然后您可以导航 DOM 并提取所有属性的内部值。说真的，它最多需要大约 10 行代码。它是最好的免费 .net 库之一。

Here is a sample:

这是一个示例：

            string htmlContents = new System.IO.StreamReader(resultsStream,Encoding.UTF8,true).ReadToEnd();

            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(htmlContents);
            if (doc == null) return null;

            string output = "";
            foreach (var node in doc.DocumentNode.ChildNodes)
            {
                output += node.InnerText;
            }

Answer 6

回答by Michael Tipton

I've posted this on the asp.net forums, and it still seems to be one of the easiest solutions out there. I won't guarantee it's the fastest or most efficient, but it's pretty reliable. In .NET you can use the HTML Web Control objects themselves. All you really need to do is insert your string into a temporary HTML object such as a DIV, then use the built-in 'InnerText' to grab all text that is not contained within tags. See below for a simple C# example:

我已经在 asp.net 论坛上发布了这个，它似乎仍然是最简单的解决方案之一。我不能保证它是最快或最有效的，但它非常可靠。在 .NET 中，您可以使用 HTML Web Control 对象本身。您真正需要做的就是将您的字符串插入到一个临时的 HTML 对象（例如 DIV）中，然后使用内置的“InnerText”来获取所有未包含在标签中的文本。请参阅下面的简单 C# 示例：


System.Web.UI.HtmlControls.HtmlGenericControl htmlDiv = new System.Web.UI.HtmlControls.HtmlGenericControl("div");
htmlDiv.InnerHtml = htmlString;
String plainText = htmlDiv.InnerText;

Answer 7

回答by meramez

protected string StripHtml(string Txt)
{
    return Regex.Replace(Txt, "<(.|\n)*?>", string.Empty);
}    

Protected Function StripHtml(Txt as String) as String
    Return Regex.Replace(Txt, "<(.|\n)*?>", String.Empty)
End Function

Answer 8

回答by Bucket

For those of you who can't use the HtmlAgilityPack, .NETs XML reader is an option. This can fail on well formatted HTML though so always add a catch with regx as a backup. Note this is NOT fast, but it does provide a nice opportunity for old school step through debugging.

对于那些不能使用 HtmlAgilityPack 的人，可以选择 .NETs XML 阅读器。这可能会在格式良好的 HTML 上失败，所以总是添加一个带有 regx 的捕获作为备份。请注意，这并不快，但它确实为老式逐步调试提供了一个很好的机会。

public static string RemoveHTMLTags(string content)
    {
        var cleaned = string.Empty;
        try
        {
            StringBuilder textOnly = new StringBuilder();
            using (var reader = XmlNodeReader.Create(new System.IO.StringReader("<xml>" + content + "</xml>")))
            {
                while (reader.Read())
                {
                    if (reader.NodeType == XmlNodeType.Text)
                        textOnly.Append(reader.ReadContentAsString());
                }
            }
            cleaned = textOnly.ToString();
        }
        catch
        {
            //A tag is probably not closed. fallback to regex string clean.
            string textOnly = string.Empty;
            Regex tagRemove = new Regex(@"<[^>]*(>|$)");
            Regex compressSpaces = new Regex(@"[\s\r\n]+");
            textOnly = tagRemove.Replace(content, string.Empty);
            textOnly = compressSpaces.Replace(textOnly, " ");
            cleaned = textOnly;
        }

        return cleaned;
    }

Answer 9

回答by Annie

For those who are complining about Michael Tiptop's solution not working, here is the .Net4+ way of doing it:

对于那些抱怨 Michael Tiptop 的解决方案不起作用的人，这里是 .Net4+ 的做：

public static string StripTags(this string markup)
{
    try
    {
        StringReader sr = new StringReader(markup);
        XPathDocument doc;
        using (XmlReader xr = XmlReader.Create(sr,
                           new XmlReaderSettings()
                           {
                               ConformanceLevel = ConformanceLevel.Fragment
                               // for multiple roots
                           }))
        {
            doc = new XPathDocument(xr);
        }

        return doc.CreateNavigator().Value; // .Value is similar to .InnerText of  
                                           //  XmlDocument or JavaScript's innerText
    }
    catch
    {
        return string.Empty;
    }
}

Answer 10

回答by user3638478

Simply use string.StripHTML();

只需使用 string.StripHTML();

C# 如何从 ASP.NET 中的字符串中去除 HTML 标签？

提问by daniel

Example:

例子：

Output:

输出：

采纳答案by Tomalak

回答by user95144

回答by Andrei R?nea

回答by Andrei R?nea

回答by Serapth

回答by Michael Tipton

回答by meramez

回答by Bucket

回答by Annie

回答by user3638478

相关推荐

最近更新

标签

C# 如何从 ASP.NET 中的字符串中去除 HTML 标签？

提问by daniel

Example:

例子：

Output:

输出：

采纳答案by Tomalak

回答by user95144

回答by Andrei R?nea

回答by Andrei R?nea

回答by Serapth

回答by Michael Tipton

回答by meramez

回答by Bucket

回答by Annie

回答by user3638478

相关推荐

C# WPF 中的 SnapsToDevicePixels 是什么意思？

如何在 C# 中锁定一个整数？

C# 使用 url 编码的斜杠获取 URL

C# 将对象序列化为 XmlDocument

相关推荐

最近更新

标签