C# HTML 敏捷包 - 在不删除内容的情况下删除不需要的标签?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12787449/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
HTML agility pack - removing unwanted tags without removing content?
提问by Mathias Lykkegaard Lorenzen
I've seen a few related questions out here, but they don’t exactly talk about the same problem I am facing.
我在这里看到了一些相关的问题,但它们并没有完全谈论我面临的相同问题。
I want to use the HTML Agility Packto remove unwanted tags from my HTML without losing the content within the tags.
我想使用HTML Agility Pack从我的 HTML 中删除不需要的标签,而不会丢失标签中的内容。
So for instance, in my scenario, I would like to preserve the tags "b", "i" and "u".
例如,在我的场景中,我想保留标签“ b”、“ i”和“ u”。
And for an input like:
对于像这样的输入:
<p>my paragraph <div>and my <b>div</b></div> are <i>italic</i> and <b>bold</b></p>
<p>my paragraph <div>and my <b>div</b></div> are <i>italic</i> and <b>bold</b></p>
The resulting HTML should be:
生成的 HTML 应该是:
my paragraph and my <b>div</b> are <i>italic</i> and <b>bold</b>
my paragraph and my <b>div</b> are <i>italic</i> and <b>bold</b>
I tried using HtmlNode's Removemethod, but it removes my content too. Any suggestions?
我尝试使用HtmlNode'sRemove方法,但它也删除了我的内容。有什么建议?
采纳答案by Mathias Lykkegaard Lorenzen
I wrote an algorithm based on Oded's suggestions. Here it is. Works like a charm.
我根据 Oded 的建议编写了一个算法。这里是。奇迹般有效。
It removes all tags except strong, em, uand raw text nodes.
它除了删除所有标签strong,em,u和原始文本节点。
internal static string RemoveUnwantedTags(string data)
{
if(string.IsNullOrEmpty(data)) return string.Empty;
var document = new HtmlDocument();
document.LoadHtml(data);
var acceptableTags = new String[] { "strong", "em", "u"};
var nodes = new Queue<HtmlNode>(document.DocumentNode.SelectNodes("./*|./text()"));
while(nodes.Count > 0)
{
var node = nodes.Dequeue();
var parentNode = node.ParentNode;
if(!acceptableTags.Contains(node.Name) && node.Name != "#text")
{
var childNodes = node.SelectNodes("./*|./text()");
if (childNodes != null)
{
foreach (var child in childNodes)
{
nodes.Enqueue(child);
parentNode.InsertBefore(child, node);
}
}
parentNode.RemoveChild(node);
}
}
return document.DocumentNode.InnerHtml;
}
回答by Oded
Before removing a node, get its parent and its InnerText, then remove the node and re-assign the InnerTextto the parent.
在删除节点之前,先获取其父节点及其InnerText,然后删除节点并将 重新分配InnerText给父节点。
var parent = node.ParentNode;
var innerText = parent.InnerText;
node.Remove();
parent.AppendChild(doc.CreateTextNode(innerText));
回答by Nathan Phillips
Try the following, you might find it a bit neater than the other proposed solutions:
尝试以下操作,您可能会发现它比其他建议的解决方案更简洁:
public static int RemoveNodesButKeepChildren(this HtmlNode rootNode, string xPath)
{
HtmlNodeCollection nodes = rootNode.SelectNodes(xPath);
if (nodes == null)
return 0;
foreach (HtmlNode node in nodes)
node.RemoveButKeepChildren();
return nodes.Count;
}
public static void RemoveButKeepChildren(this HtmlNode node)
{
foreach (HtmlNode child in node.ChildNodes)
node.ParentNode.InsertBefore(child, node);
node.Remove();
}
public static bool TestYourSpecificExample()
{
string html = "<p>my paragraph <div>and my <b>div</b></div> are <i>italic</i> and <b>bold</b></p>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
document.DocumentNode.RemoveNodesButKeepChildren("//div");
document.DocumentNode.RemoveNodesButKeepChildren("//p");
return document.DocumentNode.InnerHtml == "my paragraph and my <b>div</b> are <i>italic</i> and <b>bold</b>";
}
回答by theyetiman
How to recursively remove a given list of unwanted html tags from an html string
如何从 html 字符串中递归删除给定的不需要的 html 标签列表
I took @mathias answer and improved his extension method so that you can supply a list of tags to exclude as a List<string>(e.g. {"a","p","hr"}). I also fixed the logic so that it works recursively properly:
我接受了@mathias 的回答并改进了他的扩展方法,以便您可以提供要排除的标签列表List<string>(例如{"a","p","hr"})。我还修复了逻辑,使其递归正常工作:
public static string RemoveUnwantedHtmlTags(this string html, List<string> unwantedTags)
{
if (String.IsNullOrEmpty(html))
{
return html;
}
var document = new HtmlDocument();
document.LoadHtml(html);
HtmlNodeCollection tryGetNodes = document.DocumentNode.SelectNodes("./*|./text()");
if (tryGetNodes == null || !tryGetNodes.Any())
{
return html;
}
var nodes = new Queue<HtmlNode>(tryGetNodes);
while (nodes.Count > 0)
{
var node = nodes.Dequeue();
var parentNode = node.ParentNode;
var childNodes = node.SelectNodes("./*|./text()");
if (childNodes != null)
{
foreach (var child in childNodes)
{
nodes.Enqueue(child);
}
}
if (unwantedTags.Any(tag => tag == node.Name))
{
if (childNodes != null)
{
foreach (var child in childNodes)
{
parentNode.InsertBefore(child, node);
}
}
parentNode.RemoveChild(node);
}
}
return document.DocumentNode.InnerHtml;
}
回答by Dilip0165
If you do not want to use Html agility pack and still want to remove Unwanted Html Tag than you can do as given below.
如果您不想使用 Html agility pack 并且仍想删除不需要的 Html 标签,那么您可以按照下面给出的方法进行操作。
public static string RemoveHtmlTags(string strHtml)
{
string strText = Regex.Replace(strHtml, "<(.|\n)*?>", String.Empty);
strText = HttpUtility.HtmlDecode(strText);
strText = Regex.Replace(strText, @"\s+", " ");
return strText;
}

