C# HTML 敏捷包 - 在不删除内容的情况下删除不需要的标签？

Question

提问by Mathias Lykkegaard Lorenzen

I've seen a few related questions out here, but they don’t exactly talk about the same problem I am facing.

我在这里看到了一些相关的问题，但它们并没有完全谈论我面临的相同问题。

I want to use the HTML Agility Packto remove unwanted tags from my HTML without losing the content within the tags.

我想使用HTML Agility Pack从我的 HTML 中删除不需要的标签，而不会丢失标签中的内容。

So for instance, in my scenario, I would like to preserve the tags "b", "i" and "u".

例如，在我的场景中，我想保留标签“ b”、“ i”和“ u”。

And for an input like:

对于像这样的输入：

my paragraph <div>and my div</div> are italic and bold

The resulting HTML should be:

生成的 HTML 应该是：

my paragraph and my div are italic and bold

I tried using HtmlNode's Removemethod, but it removes my content too. Any suggestions?

我尝试使用HtmlNode'sRemove方法，但它也删除了我的内容。有什么建议？

Answer 1

采纳答案by Mathias Lykkegaard Lorenzen

I wrote an algorithm based on Oded's suggestions. Here it is. Works like a charm.

我根据 Oded 的建议编写了一个算法。这里是。奇迹般有效。

It removes all tags except strong, em, uand raw text nodes.

它除了删除所有标签strong，em，u和原始文本节点。

internal static string RemoveUnwantedTags(string data)
{
    if(string.IsNullOrEmpty(data)) return string.Empty;

    var document = new HtmlDocument();
    document.LoadHtml(data);

    var acceptableTags = new String[] { "strong", "em", "u"};

    var nodes = new Queue<HtmlNode>(document.DocumentNode.SelectNodes("./*|./text()"));
    while(nodes.Count > 0)
    {
        var node = nodes.Dequeue();
        var parentNode = node.ParentNode;

        if(!acceptableTags.Contains(node.Name) && node.Name != "#text")
        {
            var childNodes = node.SelectNodes("./*|./text()");

            if (childNodes != null)
            {
                foreach (var child in childNodes)
                {
                    nodes.Enqueue(child);
                    parentNode.InsertBefore(child, node);
                }
            }

            parentNode.RemoveChild(node);

        }
    }

    return document.DocumentNode.InnerHtml;
}

Answer 2

回答by Oded

Before removing a node, get its parent and its InnerText, then remove the node and re-assign the InnerTextto the parent.

在删除节点之前，先获取其父节点及其InnerText，然后删除节点并将重新分配InnerText给父节点。

var parent = node.ParentNode;
var innerText = parent.InnerText;
node.Remove();
parent.AppendChild(doc.CreateTextNode(innerText));

Answer 3

回答by Nathan Phillips

Try the following, you might find it a bit neater than the other proposed solutions:

尝试以下操作，您可能会发现它比其他建议的解决方案更简洁：

public static int RemoveNodesButKeepChildren(this HtmlNode rootNode, string xPath)
{
    HtmlNodeCollection nodes = rootNode.SelectNodes(xPath);
    if (nodes == null)
        return 0;
    foreach (HtmlNode node in nodes)
        node.RemoveButKeepChildren();
    return nodes.Count;
}

public static void RemoveButKeepChildren(this HtmlNode node)
{
    foreach (HtmlNode child in node.ChildNodes)
        node.ParentNode.InsertBefore(child, node);
    node.Remove();
}

public static bool TestYourSpecificExample()
{
    string html = "<p>my paragraph <div>and my <b>div</b></div> are <i>italic</i> and <b>bold</b></p>";
    HtmlDocument document = new HtmlDocument();
    document.LoadHtml(html);
    document.DocumentNode.RemoveNodesButKeepChildren("//div");
    document.DocumentNode.RemoveNodesButKeepChildren("//p");
    return document.DocumentNode.InnerHtml == "my paragraph and my <b>div</b> are <i>italic</i> and <b>bold</b>";
}

Answer 4

回答by theyetiman

How to recursively remove a given list of unwanted html tags from an html string

如何从 html 字符串中递归删除给定的不需要的 html 标签列表

I took @mathias answer and improved his extension method so that you can supply a list of tags to exclude as a List<string>(e.g. {"a","p","hr"}). I also fixed the logic so that it works recursively properly:

我接受了@mathias 的回答并改进了他的扩展方法，以便您可以提供要排除的标签列表List<string>（例如{"a","p","hr"}）。我还修复了逻辑，使其递归正常工作：

public static string RemoveUnwantedHtmlTags(this string html, List<string> unwantedTags)
    {
        if (String.IsNullOrEmpty(html))
        {
            return html;
        }

        var document = new HtmlDocument();
        document.LoadHtml(html);

        HtmlNodeCollection tryGetNodes = document.DocumentNode.SelectNodes("./*|./text()");

        if (tryGetNodes == null || !tryGetNodes.Any())
        {
            return html;
        }

        var nodes = new Queue<HtmlNode>(tryGetNodes);

        while (nodes.Count > 0)
        {
            var node = nodes.Dequeue();
            var parentNode = node.ParentNode;

            var childNodes = node.SelectNodes("./*|./text()");

            if (childNodes != null)
            {
                foreach (var child in childNodes)
                {
                    nodes.Enqueue(child);                       
                }
            }

            if (unwantedTags.Any(tag => tag == node.Name))
            {               
                if (childNodes != null)
                {
                    foreach (var child in childNodes)
                    {
                        parentNode.InsertBefore(child, node);
                    }
                }

                parentNode.RemoveChild(node);

            }
        }

        return document.DocumentNode.InnerHtml;
    }

Answer 5

回答by Dilip0165

If you do not want to use Html agility pack and still want to remove Unwanted Html Tag than you can do as given below.

如果您不想使用 Html agility pack 并且仍想删除不需要的 Html 标签，那么您可以按照下面给出的方法进行操作。

public static string RemoveHtmlTags(string strHtml)
    {
        string strText = Regex.Replace(strHtml, "<(.|\n)*?>", String.Empty);
        strText = HttpUtility.HtmlDecode(strText);
        strText = Regex.Replace(strText, @"\s+", " ");
        return strText;
    }

C# HTML 敏捷包 - 在不删除内容的情况下删除不需要的标签？

提问by Mathias Lykkegaard Lorenzen

采纳答案by Mathias Lykkegaard Lorenzen

回答by Oded

回答by Nathan Phillips

回答by theyetiman

How to recursively remove a given list of unwanted html tags from an html string

如何从 html 字符串中递归删除给定的不需要的 html 标签列表

回答by Dilip0165

相关推荐

最近更新

标签

C# HTML 敏捷包 - 在不删除内容的情况下删除不需要的标签？

提问by Mathias Lykkegaard Lorenzen

采纳答案by Mathias Lykkegaard Lorenzen

回答by Oded

回答by Nathan Phillips

回答by theyetiman

How to recursively remove a given list of unwanted html tags from an html string

如何从 html 字符串中递归删除给定的不需要的 html 标签列表

回答by Dilip0165

相关推荐

C# 从数据表中选择多列

C# 无法将类型“字符串”隐式转换为“双精度”问题

C# 在文本文件中的特定位置添加新行。

C# 将数据从一页发送到另一页

相关推荐

最近更新

标签