如何使用 C# 清理 HTML 标签
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1038431/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to clean HTML tags using C#
提问by guaike
For example:
例如:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>title</title>
</head>
<body>
<a href="aaa.asp?id=1"> I want to get this text </a>
<div>
<h1>this is my want!!</h1>
<b>this is my want!!!</b>
</div>
</body>
</html>
and the result is:
结果是:
I want to get this text
this is my want!!
this is my want!!!
采纳答案by Marc Gravell
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string s = doc.DocumentNode.SelectSingleNode("//body").InnerText;
回答by ólafur Waage
I would recommend using something like HTMLTidy.
我建议使用HTMLTidy 之类的东西。
Here's a tutorialon it to get you started.
回答by rahul
Why do you want to make it server side?
你为什么要让它成为服务器端?
For that you have to make the container element runat="server"
and then take the innerText
of the element.
为此,您必须制作容器元素runat="server"
,然后获取innerText
元素的 。
You can do the same in javascript without making the element runat="server"
您可以在 javascript 中执行相同操作,而无需使元素 runat="server"
回答by Andrew Marsh
If you just want to remove the html tags then use a regular expression that deletes anything between "<" and ">".
如果您只想删除 html 标签,请使用正则表达式删除“<”和“>”之间的任何内容。
回答by diegodsp
Use this function...
使用这个功能...
public string Strip(string text)
{
return Regex.Replace(text, @"<(.|\n)*?>", string.Empty);
}
回答by James Lawruk
You can start with this simple function below. Disclaimer: This code is suitable for basic HTML, but will not handle all valid HTML situations and edge cases. Tags within quotes is an example.The advantage of this code is you can easy follow the execution in a debugger, and it can be easy modified to fit edge cases specific to you.
您可以从下面这个简单的功能开始。免责声明:此代码适用于基本 HTML,但不会处理所有有效的 HTML 情况和边缘情况。引号内的标签就是一个例子。此代码的优点是您可以轻松地在调试器中跟踪执行,并且可以轻松修改以适应特定于您的边缘情况。
public static string RemoveTags(string html)
{
string returnStr = "";
bool insideTag = false;
for (int i = 0; i < html.Length; ++i)
{
char c = html[i];
if (c == '<')
insideTag = true;
if (!insideTag)
returnStr += c;
if (c == '>')
insideTag = false;
}
return returnStr;
}