C# Html Agility Pack 按类获取所有元素

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13771083/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-10 09:45:48  来源:igfitidea点击:

Html Agility Pack get all elements by class

c#htmlhtml-agility-pack

提问by Adam

I am taking a stab at html agility pack and having trouble finding the right way to go about this.

我正在尝试使用 html 敏捷包,但无法找到正确的方法来解决这个问题。

For example:

例如:

var findclasses = _doc.DocumentNode.Descendants("div").Where(d => d.Attributes.Contains("class"));

However, obviously you can add classes to a lot more then divs so I tried this..

但是,显然您可以将类添加到更多的 div 中,所以我尝试了这个..

var allLinksWithDivAndClass = _doc.DocumentNode.SelectNodes("//*[@class=\"float\"]");

But that doesn't handle the cases where you add multiple classes and "float" is just one of them like this..

但这并不能处理您添加多个类并且“浮动”只是其中之一的情况。

class="className float anotherclassName"

Is there a way to handle all of this? I basically want to select all nodes that have a class = and contains float.

有没有办法处理这一切?我基本上想选择所有具有 class = 并包含 float 的节点。

**Answer has been documented on my blog with a full explanation at: Html Agility Pack Get All Elements by Class

**答案已记录在我的博客中,并在以下位置进行了完整解释:Html Agility Pack Get All Elements by Class

采纳答案by Dai

(Updated 2018-03-17)

(更新于 2018-03-17)

The problem:

问题:

The problem, as you've spotted, is that String.Containsdoes not perform a word-boundary check, so Contains("float")will return truefor both "foo float bar" (correct) and "unfloating" (which is incorrect).

正如您所发现的,问题在于String.Contains它不执行字边界检查,因此Contains("float")将返回true“foo float bar”(正确)和“unfloating”(不正确)。

The solution is to ensure that "float" (or whatever your desired class-name is) appears alongside a word-boundaryat both ends. A word-boundary is either the start (or end) of a string (or line), whitespace, certain punctuation, etc. In most regular-expressions this is \b. So the regex you want is simply: \bfloat\b.

解决方案是确保“float”(或任何您想要的类名)出现在两端的单词边界旁边。单词边界是字符串(或行)、空格、某些标点符号等的开始(或结束)。在大多数正则表达式中,这是\b. 所以,你想要的正则表达式很简单:\bfloat\b

A downside to using a Regexinstance is that they can be slow to run if you don't use the .Compiledoption - and they can be slow to compile. So you should cache the regex instance. This is more difficult if the class-name you're looking for changes at runtime.

使用Regex实例的一个缺点是,如果您不使用该.Compiled选项,它们的运行速度可能会很慢——而且它们的编译速度也会很慢。所以你应该缓存正则表达式实例。如果您要查找的类名在运行时发生变化,这将更加困难。

Alternatively you can search a string for words by word-boundaries without using a regex by implementing the regex as a C# string-processing function, being careful not to cause any new string or other object allocation (e.g. not using String.Split).

或者,您可以通过将正则表达式实现为 C# 字符串处理函数,在不使用正则表达式的情况下,按字边界在字符串中搜索单词,注意不要导致任何新字符串或其他对象分配(例如,不使用String.Split)。

Approach 1: Using a regular-expression:

方法 1:使用正则表达式:

Suppose you just want to look for elements with a single, design-time specified class-name:

假设您只想查找具有单个设计时指定类名的元素:

class Program {

    private static readonly Regex _classNameRegex = new Regex( @"\bfloat\b", RegexOptions.Compiled );

    private static IEnumerable<HtmlNode> GetFloatElements(HtmlDocument doc) {
        return doc
            .Descendants()
            .Where( n => n.NodeType == NodeType.Element )
            .Where( e => e.Name == "div" && _classNameRegex.IsMatch( e.GetAttributeValue("class", "") ) );
    }
}

If you need to choose a single class-name at runtime then you can build a regex:

如果您需要在运行时选择单个类名,那么您可以构建一个正则表达式:

private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String className) {

    Regex regex = new Regex( "\b" + Regex.Escape( className ) + "\b", RegexOptions.Compiled );

    return doc
        .Descendants()
        .Where( n => n.NodeType == NodeType.Element )
        .Where( e => e.Name == "div" && regex.IsMatch( e.GetAttributeValue("class", "") ) );
}

If you have multiple class-names and you want to match all of them, you could create an array of Regexobjects and ensure they're all matching, or combine them into a single Regexusing lookarounds, but this results in horrendously complicated expressions- so using a Regex[]is probably better:

如果你有多个类名并且你想匹配所有的类名,你可以创建一个Regex对象数组并确保它们都匹配,或者Regex使用环视将它们组合成一个,但这会导致非常复杂的表达式- 所以使用aRegex[]可能更好:

using System.Linq;

private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String[] classNames) {

    Regex[] exprs = new Regex[ classNames.Length ];
    for( Int32 i = 0; i < exprs.Length; i++ ) {
        exprs[i] = new Regex( "\b" + Regex.Escape( classNames[i] ) + "\b", RegexOptions.Compiled );
    }

    return doc
        .Descendants()
        .Where( n => n.NodeType == NodeType.Element )
        .Where( e =>
            e.Name == "div" &&
            exprs.All( r =>
                r.IsMatch( e.GetAttributeValue("class", "") )
            )
        );
}

Approach 2: Using non-regex string matching:

方法二:使用非正则字符串匹配:

The advantage of using a custom C# method to do string matching instead of a regex is hypothetically faster performance and reduced memory usage (though Regexmay be faster in some circumstances - always profile your code first, kids!)

使用自定义 C# 方法进行字符串匹配而不是正则表达式的优点是假设性能更快并减少内存使用量(尽管Regex在某些情况下可能更快 - 孩子们,请始终先分析您的代码!)

This method below: CheapClassListContainsprovides a fast word-boundary-checking string matching function that can be used the same way as regex.IsMatch:

下面的这个方法:CheapClassListContains提供了一个快速的词边界检查字符串匹配函数,可以像 一样使用regex.IsMatch

private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String className) {

    return doc
        .Descendants()
        .Where( n => n.NodeType == NodeType.Element )
        .Where( e =>
            e.Name == "div" &&
            CheapClassListContains(
                e.GetAttributeValue("class", ""),
                className,
                StringComparison.Ordinal
            )
        );
}

/// <summary>Performs optionally-whitespace-padded string search without new string allocations.</summary>
/// <remarks>A regex might also work, but constructing a new regex every time this method is called would be expensive.</remarks>
private static Boolean CheapClassListContains(String haystack, String needle, StringComparison comparison)
{
    if( String.Equals( haystack, needle, comparison ) ) return true;
    Int32 idx = 0;
    while( idx + needle.Length <= haystack.Length )
    {
        idx = haystack.IndexOf( needle, idx, comparison );
        if( idx == -1 ) return false;

        Int32 end = idx + needle.Length;

        // Needle must be enclosed in whitespace or be at the start/end of string
        Boolean validStart = idx == 0               || Char.IsWhiteSpace( haystack[idx - 1] );
        Boolean validEnd   = end == haystack.Length || Char.IsWhiteSpace( haystack[end] );
        if( validStart && validEnd ) return true;

        idx++;
    }
    return false;
}

Approach 3: Using a CSS Selector library:

方法 3:使用 CSS 选择器库:

HtmlAgilityPack is somewhat stagnated doesn't support .querySelectorand .querySelectorAll, but there are third-party libraries that extend HtmlAgilityPack with it: namely Fizzlerand CssSelectors. Both Fizzler and CssSelectors implement QuerySelectorAll, so you can use it like so:

HtmlAgilityPack 有点停滞不前,不支持.querySelector.querySelectorAll,但有第三方库可以用它扩展 HtmlAgilityPack:即FizzlerCssSelectors。Fizzler 和 CssSelectors 都实现了QuerySelectorAll,所以你可以像这样使用它:

private static IEnumerable<HtmlNode> GetDivElementsWithFloatClass(HtmlDocument doc) {

    return doc.QuerySelectorAll( "div.float" );
}

With runtime-defined classes:

使用运行时定义的类:

private static IEnumerable<HtmlNode> GetDivElementsWithClasses(HtmlDocument doc, IEnumerable<String> classNames) {

    String selector = "div." + String.Join( ".", classNames );

    return doc.QuerySelectorAll( selector  );
}

回答by Ryan McCarty

You can solve your issue by using the 'contains' function within your Xpath query, as below:

您可以通过在 Xpath 查询中使用“包含”函数来解决您的问题,如下所示:

var allElementsWithClassFloat = 
   _doc.DocumentNode.SelectNodes("//*[contains(@class,'float')]")

To reuse this in a function do something similar to the following:

要在函数中重用它,请执行类似于以下操作:

string classToFind = "float";    
var allElementsWithClassFloat = 
   _doc.DocumentNode.SelectNodes(string.Format("//*[contains(@class,'{0}')]", classToFind));

回答by nguyen.huu.duy

You can use the following script:

您可以使用以下脚本:

var findclasses = _doc.DocumentNode.Descendants("div").Where(d => 
    d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("float")
);

回答by Hung Cao

I used this extension method a lot in my project. Hope it will help one of you guys.

我在我的项目中经常使用这种扩展方法。希望它会帮助你们中的一个。

public static bool HasClass(this HtmlNode node, params string[] classValueArray)
    {
        var classValue = node.GetAttributeValue("class", "");
        var classValues = classValue.Split(' ');
        return classValueArray.All(c => classValues.Contains(c));
    }

回答by hadi.sh

public static List<HtmlNode> GetTagsWithClass(string html,List<string> @class)
    {
        // LoadHtml(html);           
        var result = htmlDocument.DocumentNode.Descendants()
            .Where(x =>x.Attributes.Contains("class") && @class.Contains(x.Attributes["class"].Value)).ToList();          
        return result;
    }