使用 C# 正则表达式删除 HTML 标签

Question

提问by

How do I use C# regular expression to replace/remove all HTML tags, including the angle brackets? Can someone please help me with the code?

如何使用 C# 正则表达式替换/删除所有 HTML 标记，包括尖括号？有人可以帮我写代码吗？

Answer 1

回答by Ryan Emerle

Regex regex = new Regex(@"</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>", RegexOptions.Singleline);

Source

来源

Answer 2

回答by Daniel Brückner

As often stated before, you should not use regular expressions to process XML or HTML documents. They do not perform very well with HTML and XML documents, because there is no way to express nested structures in a general way.

如前所述，您不应使用正则表达式来处理 XML 或 HTML 文档。它们在 HTML 和 XML 文档中表现不佳，因为无以通用方式表达嵌套结构。

You could use the following.

您可以使用以下内容。

String result = Regex.Replace(htmlDocument, @"<[^>]*>", String.Empty);

This will work for most cases, but there will be cases (for example CDATA containing angle brackets) where this will not work as expected.

这适用于大多数情况，但在某些情况下（例如包含尖括号的 CDATA），这不会按预期工作。

Answer 3

回答by JasonTrue

The correct answer is don't do that, use the HTML Agility Pack.

正确答案是不要那样做，使用HTML Agility Pack。

Edited to add:

编辑添加：

To shamelessly steal from the comment below by jesse, and to avoid being accused of inadequately answering the question after all this time, here's a simple, reliable snippet using the HTML Agility Pack that works with even most imperfectly formed, capricious bits of HTML:

为了无耻地窃取 jesse 下面的评论，并避免在这么长时间后被指责回答问题不充分，这里有一个使用 HTML Agility Pack 的简单、可靠的片段，它甚至可以处理最不完美、反复无常的 HTML 部分：

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(Properties.Resources.HtmlContents);
var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);
StringBuilder output = new StringBuilder();
foreach (string line in text)
{
   output.AppendLine(line);
}
string textOnly = HttpUtility.HtmlDecode(output.ToString());

There are very few defensible cases for using a regular expression for parsing HTML, as HTML can't be parsed correctly without a context-awareness that's very painful to provide even in a nontraditional regex engine. You can get part way there with a RegEx, but you'll need to do manual verifications.

使用正则表达式来解析 HTML 的情况很少，因为如果没有上下文感知就无正确解析 HTML，即使在非传统的正则表达式引擎中提供上下文感知也是非常痛苦的。您可以使用 RegEx 进行部分操作，但您需要进行手动验证。

Html Agility Pack can provide you a robust solution that will reduce the need to manually fix up the aberrations that can result from naively treating HTML as a context-free grammar.

Html Agility Pack 可以为您提供强大的解决方案，减少手动修复可能因天真地将 HTML 视为上下文无关语而导致的异常的需要。

A regular expression may get you mostly what you want most of the time, but it will fail on very common cases. If you can find a better/faster parser than HTML Agility Pack, go for it, but please don't subject the world to more broken HTML hackery.

大多数情况下，正则表达式可能会获得您想要的大部分内容，但在非常常见的情况下它会失败。如果您能找到比 HTML Agility Pack 更好/更快的解析器，那就去尝试吧，但请不要让世界陷入更糟糕的 HTML 黑客。

Answer 4

回答by Alan Moore

The question is too broad to be answered definitively. Are you talking about removing all tags from a real-world HTML document, like a web page? If so, you would have to:

这个问题太广泛了，无明确回答。您是在谈论从真实世界的 HTML 文档（如网页）中删除所有标签吗？如果是这样，您将不得不：

remove the <!DOCTYPE declaration or <?xml prolog if they exist
remove all SGML comments
remove the entire HEAD element
remove all SCRIPT and STYLE elements
do Grabthar-knows-what with FORM and TABLE elements
remove the remaining tags
remove the <![CDATA[ and ]]> sequences from CDATA sections but leave their contents alone

删除 <!DOCTYPE 声明或 <?xml prolog（如果存在）
删除所有 SGML 注释
删除整个 HEAD 元素
删除所有 SCRIPT 和 STYLE 元素
用 FORM 和 TABLE 元素做 Grabthar-knows-what
删除剩余的标签
从 CDATA 部分中删除 <![CDATA[ 和 ]]> 序列，但保留它们的内容

That's just off the top of my head--I'm sure there's more. Once you've done all that, you'll end up with words, sentences and paragraphs run together in some places, and big chunks of useless whitespace in others.

这只是我的头顶 - 我相信还有更多。一旦你完成了所有这些，你最终会在一些地方出现单词、句子和段落并排在一起，而在其他地方出现大块无用的空白。

But, assuming you're working with just a fragment and you can get away with simply removing all tags, here's the regex I would use:

但是，假设您只处理一个片段并且您只需删除所有标签就可以逃脱，这是我将使用的正则表达式：

@"(?></?\w+)(?>(?:[^>'""]+|'[^']*'|""[^""]*"")*)>"

Matching single- and double-quoted strings in their own alternatives is sufficient to deal with the problem of angle brackets in attribute values. I don't see any need to explicitly match the attribute names and other stuff inside the tag, like the regex in Ryan's answer does; the first alternative handles all of that.

匹配单引号和双引号字符串在它们自己的替代方案中足以处理属性值中的尖括号问题。我认为不需要像 Ryan 回答中的正则表达式那样显式匹配标签内的属性名称和其他内容；第一个选择处理所有这些。

In case you're wondering about those (?>...)constructs, they're atomic groups. They make the regex a little more efficient, but more importantly, they prevent runaway backtracking, which is something you should always watch out for when you mix alternation and nested quantifiers as I've done. I don't really think that would be a problem here, but I know if I don't mention it, someone else will. ;-)

如果您想知道这些(?>...)构造，它们是atomic groups。它们使正则表达式的效率更高一些，但更重要的是，它们可以防止失控的回溯，当您像我所做的那样混合交替和嵌套量词时，您应该始终注意这一点。我真的不认为这会成为这里的问题，但我知道如果我不提到它，其他人会提到。;-)

This regex isn't perfect, of course, but it's probably as good as you'll ever need.

当然，这个正则表达式并不完美，但它可能与您需要的一样好。

Answer 5

回答by Swaroop

use this..

用这个..

@"(?></?\w+)(?>(?:[^>'""]+|'[^']*'|""[^""]*"")*)>"

Answer 6

回答by CountZero

I would like to echo Jason's response though sometimes you need to naively parse some Html and pull out the text content.

我想回应 Jason 的回应，但有时您需要天真地解析一些 Html 并提取文本内容。

I needed to do this with some Html which had been created by a rich text editor, always fun and games.

我需要用一些由富文本编辑器创建的 Html 来做到这一点，总是很有趣和游戏。

In this case you may need to remove the content of some tags as well as just the tags themselves.

在这种情况下，您可能需要删除某些标签的内容以及标签本身。

In my case and tags were thrown into this mix. Some one may find my (very slightly) less naive implementation a useful starting point.

在我的例子中，标签被扔进了这个组合中。有些人可能会发现我的（非常轻微的）不那么幼稚的实现是一个有用的起点。

   /// <summary>
    /// Removes all html tags from string and leaves only plain text
    /// Removes content of <xml></xml> and <style></style> tags as aim to get text content not markup /meta data.
    /// </summary>
    /// <param name="input"></param>
    /// <returns></returns>
    public static string HtmlStrip(this string input)
    {
        input = Regex.Replace(input, "<style>(.|\n)*?</style>",string.Empty);
        input = Regex.Replace(input, @"<xml>(.|\n)*?</xml>", string.Empty); // remove all <xml></xml> tags and anything inbetween.  
        return Regex.Replace(input, @"<(.|\n)*?>", string.Empty); // remove any tags but not there content "<p>bob<span> johnson</span></p>" becomes "bob johnson"
    }

Answer 7

回答by zzzzBov

@JasonTrue is correct, that stripping HTML tags should not be done via regular expressions.

@JasonTrue 是正确的，不应通过正则表达式剥离 HTML 标签。

It's quite simple to strip HTML tags using HtmlAgilityPack:

使用 HtmlAgilityPack 去除 HTML 标签非常简单：

public string StripTags(string input) {
    var doc = new HtmlDocument();
    doc.LoadHtml(input ?? "");
    return doc.DocumentNode.InnerText;
}

Answer 8

回答by Owidat

try regular expression method at this URL: http://www.dotnetperls.com/remove-html-tags

在此 URL 尝试正则表达式方：http: //www.dotnetperls.com/remove-html-tags

/// <summary>
/// Remove HTML from string with Regex.
/// </summary>
public static string StripTagsRegex(string source)
{
return Regex.Replace(source, "<.*?>", string.Empty);
}

/// <summary>
/// Compiled regular expression for performance.
/// </summary>
static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);

/// <summary>
/// Remove HTML from string with compiled Regex.
/// </summary>
public static string StripTagsRegexCompiled(string source)
{
return _htmlRegex.Replace(source, string.Empty);
}

Answer 9

回答by GRUNGER

Add .+?in <[^>]*>and try this regex (base on this):

添加.+?在<[^>]*>和尝试这个表达式（基础上这个）：

<[^>].+?>

c# .net regex demo

c#.net正则表达式演示

Answer 10

回答by AnisNoorAli

Use this method to remove tags:

使用此方删除标签：

public string From_To(string text, string from, string to)
{
    if (text == null)
        return null;
    string pattern = @"" + from + ".*?" + to;
    Regex rx = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
    MatchCollection matches = rx.Matches(text);
    return matches.Count <= 0 ? text : matches.Cast<Match>().Where(match => !string.IsNullOrEmpty(match.Value)).Aggregate(text, (current, match) => current.Replace(match.Value, ""));
}

使用 C# 正则表达式删除 HTML 标签

提问by

回答by Ryan Emerle

回答by Daniel Brückner

回答by JasonTrue

回答by Alan Moore

回答by Swaroop

回答by CountZero

回答by zzzzBov

回答by Owidat

回答by GRUNGER

回答by AnisNoorAli

相关推荐

最近更新

标签

使用 C# 正则表达式删除 HTML 标签

提问by

回答by Ryan Emerle

回答by Daniel Brückner

回答by JasonTrue

回答by Alan Moore

回答by Swaroop

回答by CountZero

回答by zzzzBov

回答by Owidat

回答by GRUNGER

回答by AnisNoorAli

相关推荐

C# 最佳实践 - 格式化多种货币

从 C# 启动电子邮件应用程序 (MAPI)（带附件）

加密和解密密码的最佳实践？(C#/.NET)

在 C# 中最小化所有打开的窗口

相关推荐

最近更新

标签