从字符串中删除 HTML 标签,包括 C# 中的
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19523913/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Remove HTML tags from string including   in C#
提问by rampuriyaaa
How can I remove all the HTML tags including   using regex in C#. My string looks like
如何在 C# 中使用正则表达式删除所有 HTML 标签,包括 。我的字符串看起来像
"<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div> </div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"
采纳答案by Ravi Thapliyal
If you can't use an HTML parser oriented solution to filter out the tags, here's a simple regex for it.
如果您不能使用面向 HTML 解析器的解决方案来过滤掉标签,这里有一个简单的正则表达式。
string noHTML = Regex.Replace(inputHTML, @"<[^>]+>| ", "").Trim();
You should ideally make another pass through a regex filter that takes care of multiple spaces as
理想情况下,您应该再次通过处理多个空格的正则表达式过滤器
string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");
回答by Jonesopolis
this:
这个:
(<.+?> | )
will match any tag or
将匹配任何标签或
string regex = @"(<.+?>| )";
var x = Regex.Replace(originalString, regex, "").Trim();
then x = hello
那么 x = hello
回答by David S.
I've been using this function for a while. Removes pretty much any messy html you can throw at it and leaves the text intact.
我一直在使用这个功能一段时间。删除几乎所有您可以扔给它的凌乱 html 并保持文本完整。
private static readonly Regex _tags_ = new Regex(@"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled);
//add characters that are should not be removed to this regex
private static readonly Regex _notOkCharacter_ = new Regex(@"[^\w;&#@.:/\?=|%!() -]", RegexOptions.Compiled);
public static String UnHtml(String html)
{
html = HttpUtility.UrlDecode(html);
html = HttpUtility.HtmlDecode(html);
html = RemoveTag(html, "<!--", "-->");
html = RemoveTag(html, "<script", "</script>");
html = RemoveTag(html, "<style", "</style>");
//replace matches of these regexes with space
html = _tags_.Replace(html, " ");
html = _notOkCharacter_.Replace(html, " ");
html = SingleSpacedTrim(html);
return html;
}
private static String RemoveTag(String html, String startTag, String endTag)
{
Boolean bAgain;
do
{
bAgain = false;
Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase);
if (startTagPos < 0)
continue;
Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase);
if (endTagPos <= startTagPos)
continue;
html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length);
bAgain = true;
} while (bAgain);
return html;
}
private static String SingleSpacedTrim(String inString)
{
StringBuilder sb = new StringBuilder();
Boolean inBlanks = false;
foreach (Char c in inString)
{
switch (c)
{
case '\r':
case '\n':
case '\t':
case ' ':
if (!inBlanks)
{
inBlanks = true;
sb.Append(' ');
}
continue;
default:
inBlanks = false;
sb.Append(c);
break;
}
}
return sb.ToString().Trim();
}
回答by MRP
var noHtml = Regex.Replace(inputHTML, @"<[^>]*(>|$)| |‌|»|«", string.Empty).Trim();
回答by Don Rolling
I took @Ravi Thapliyal's code and made a method: It is simple and might not clean everything, but so far it is doing what I need it to do.
我采用了@Ravi Thapliyal 的代码并制作了一个方法:它很简单,可能无法清理所有内容,但到目前为止它正在做我需要它做的事情。
public static string ScrubHtml(string value) {
var step1 = Regex.Replace(value, @"<[^>]+>| ", "").Trim();
var step2 = Regex.Replace(step1, @"\s{2,}", " ");
return step2;
}
回答by Ehsan88
Sanitizing an Html document involves a lot of tricky things. This package maybe of help: https://github.com/mganss/HtmlSanitizer
清理 Html 文档涉及很多棘手的事情。这个包可能有帮助:https: //github.com/mganss/HtmlSanitizer
回答by Ananth Ram
(<([^>]+)>| )
You can test it here: https://regex101.com/r/kB0rQ4/1
你可以在这里测试:https: //regex101.com/r/kB0rQ4/1
回答by nivs1978
HTML is in its basic form just XML. You could Parse your text in an XmlDocument object, and on the root element call InnerText to extract the text. This will strip all HTML tages in any form and also deal with special characters like < all in one go.
HTML 的基本形式就是 XML。您可以在 XmlDocument 对象中解析文本,并在根元素上调用 InnerText 来提取文本。这将删除任何形式的所有 HTML 标签,并处理诸如 < 之类的特殊字符。 一口气完成。
回答by Sabique A Khan
I have used the @RaviThapliyal & @Don Rolling's code but made a little modification. Since we are replacing the   with empty string but instead   should be replaced with space, so added an additional step. It worked for me like a charm.
我使用了@RaviThapliyal 和@Don Rolling 的代码,但做了一些修改。由于我们将 替换为空字符串,而 应替换为空格,因此添加了一个额外的步骤。它对我来说就像一种魅力。
public static string FormatString(string value) {
var step1 = Regex.Replace(value, @"<[^>]+>", "").Trim();
var step2 = Regex.Replace(step1, @" ", " ");
var step3 = Regex.Replace(step2, @"\s{2,}", " ");
return step3;
}
Used &nbps without semicolon because it was getting formatted by the Stack Overflow.
使用不带分号的 &nbps,因为它被 Stack Overflow 格式化了。