C# 如何使用 HTML 敏捷包

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/846994/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 03:57:56  来源:igfitidea点击:

How to use HTML Agility pack

c#htmlhtml-agility-pack

提问by

How do I use the HTML Agility Pack?

如何使用HTML Agility Pack

My XHTML document is not completely valid. That's why I wanted to use it. How do I use it in my project? My project is in C#.

我的 XHTML 文档不完全有效。这就是我想使用它的原因。我如何在我的项目中使用它?我的项目是在 C# 中。

回答by Ash

First, install the HTMLAgilityPacknuget package into your project.

首先,将HTMLAgilityPacknuget 包安装到您的项目中。

Then, as an example:

然后,作为一个例子:

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

// There are various options, set as needed
htmlDoc.OptionFixNestedTags=true;

// filePath is a path to a file containing the html
htmlDoc.Load(filePath);

// Use:  htmlDoc.LoadHtml(xmlString);  to load from a string (was htmlDoc.LoadXML(xmlString)

// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
{
    // Handle any parse errors as required

}
else
{

    if (htmlDoc.DocumentNode != null)
    {
        HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");

        if (bodyNode != null)
        {
            // Do something with bodyNode
        }
    }
}

(NB: This code is an example only and not necessarily the best/only approach. Do not use it blindly in your own application.)

(注意:这段代码只是一个例子,不一定是最好的/唯一的方法。不要在你自己的应用程序中盲目使用它。)

The HtmlDocument.Load()method also accepts a stream which is very useful in integrating with other stream oriented classes in the .NET framework. While HtmlEntity.DeEntitize()is another useful method for processing html entities correctly. (thanks Matthew)

HtmlDocument.Load()方法还接受一个流,这在与 .NET 框架中的其他面向流的类集成时非常有用。WhileHtmlEntity.DeEntitize()是另一种正确处理 html 实体的有用方法。(感谢马修)

HtmlDocumentand HtmlNodeare the classes you'll use most. Similar to an XML parser, it provides the selectSingleNode and selectNodes methods that accept XPath expressions.

HtmlDocument并且HtmlNode是您最常使用的类。与 XML 解析器类似,它提供了接受 XPath 表达式的 selectSingleNode 和 selectNodes 方法。

Pay attention to the HtmlDocument.Option??????boolean properties. These control how the Loadand LoadXMLmethods will process your HTML/XHTML.

注意HtmlDocument.Option??????布尔属性。这些控制LoadLoadXML方法将如何处理您的 HTML/XHTML。

There is also a compiled help file called HtmlAgilityPack.chm that has a complete reference for each of the objects. This is normally in the base folder of the solution.

还有一个名为 HtmlAgilityPack.chm 的编译帮助文件,其中包含每个对象的完整参考。这通常位于解决方案的基本文件夹中。

回答by rtpHarry

I don't know if this will be of any help to you, but I have written a couple of articles which introduce the basics.

我不知道这是否对您有帮助,但我已经写了几篇介绍基础知识的文章。

The next article is 95% complete, I just have to write up explanations of the last few parts of the code I have written. If you are interested then I will try to remember to post here when I publish it.

下一篇文章已经完成了 95%,我只需要写下我写的代码的最后几部分的解释。如果您有兴趣,那么我会尽量记住在发布时在这里发布。

回答by Kent Munthe Caspersen

HtmlAgilityPack uses XPath syntax, and though many argues that it is poorly documented, I had no trouble using it with help from this XPath documentation: https://www.w3schools.com/xml/xpath_syntax.asp

HtmlAgilityPack 使用 XPath 语法,尽管许多人认为它的文档很差,但我在此 XPath 文档的帮助下使用它没有问题:https: //www.w3schools.com/xml/xpath_syntax.asp

To parse

解析

<h2>
  <a href="">Hyman</a>
</h2>
<ul>
  <li class="tel">
    <a href="">81 75 53 60</a>
  </li>
</ul>
<h2>
  <a href="">Roy</a>
</h2>
<ul>
  <li class="tel">
    <a href="">44 52 16 87</a>
  </li>
</ul>

I did this:

我这样做了:

string url = "http://website.com";
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h2//a"))
{
  names.Add(node.ChildNodes[0].InnerHtml);
}
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//li[@class='tel']//a"))
{
  phones.Add(node.ChildNodes[0].InnerHtml);
}

回答by ibrahim ozboluk

    public string HtmlAgi(string url, string key)
    {

        var Webget = new HtmlWeb();
        var doc = Webget.Load(url);
        HtmlNode ourNode = doc.DocumentNode.SelectSingleNode(string.Format("//meta[@name='{0}']", key));

        if (ourNode != null)
        {


                return ourNode.GetAttributeValue("content", "");

        }
        else
        {
            return "not fount";
        }

    }

回答by captainsac

Main HTMLAgilityPack related code is as follows

主要HTMLAgilityPack相关代码如下

using System;
using System.Net;
using System.Web;
using System.Web.Services;
using System.Web.Script.Services;
using System.Text.RegularExpressions;
using HtmlAgilityPack;

namespace GetMetaData
{
    /// <summary>
    /// Summary description for MetaDataWebService
    /// </summary>
    [WebService(Namespace = "http://tempuri.org/")]
    [WebServiceBinding(ConformsTo = WsiProfiles.BasicProfile1_1)]
    [System.ComponentModel.ToolboxItem(false)]
    // To allow this Web Service to be called from script, using ASP.NET AJAX, uncomment the following line.
    [System.Web.Script.Services.ScriptService]
    public class MetaDataWebService: System.Web.Services.WebService
    {
        [WebMethod]
        [ScriptMethod(UseHttpGet = false)]
        public MetaData GetMetaData(string url)
        {
            MetaData objMetaData = new MetaData();

            //Get Title
            WebClient client = new WebClient();
            string sourceUrl = client.DownloadString(url);

            objMetaData.PageTitle = Regex.Match(sourceUrl, @
            "\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;

            //Method to get Meta Tags
            objMetaData.MetaDescription = GetMetaDescription(url);
            return objMetaData;
        }

        private string GetMetaDescription(string url)
        {
            string description = string.Empty;

            //Get Meta Tags
            var webGet = new HtmlWeb();
            var document = webGet.Load(url);
            var metaTags = document.DocumentNode.SelectNodes("//meta");

            if (metaTags != null)
            {
                foreach(var tag in metaTags)
                {
                    if (tag.Attributes["name"] != null && tag.Attributes["content"] != null && tag.Attributes["name"].Value.ToLower() == "description")
                    {
                        description = tag.Attributes["content"].Value;
                    }
                }
            } 
            else
            {
                description = string.Empty;
            }
            return description;
        }
    }
}

回答by Meysam

Getting Started - HTML Agility Pack

入门 - HTML Agility Pack

// From File
var doc = new HtmlDocument();
doc.Load(filePath);

// From String
var doc = new HtmlDocument();
doc.LoadHtml(html);

// From Web
var url = "http://html-agility-pack.net/";
var web = new HtmlWeb();
var doc = web.Load(url);

回答by PK-1825

try this

尝试这个

string htmlBody = ParseHmlBody(dtViewDetails.Rows[0]["Body"].ToString());

private string ParseHmlBody(string html)
        {
            string body = string.Empty;
            try
            {
                var htmlDoc = new HtmlDocument();
                htmlDoc.LoadHtml(html);
                var htmlBody = htmlDoc.DocumentNode.SelectSingleNode("//body");
                body = htmlBody.OuterHtml;
            }
            catch (Exception ex)
            {

                dalPendingOrders.LogMessage("Error in ParseHmlBody" + ex.Message);
            }
            return body;
        }