将数据从 HTML 表格导入到 C# 中的 DataTable

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18090626/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-10 11:15:18  来源:igfitidea点击:

Import data from HTML table to DataTable in C#

c#htmldatatablehtml-agility-pack

提问by user2658455

I wanted to import some data from HTML table (here is a link http://road2paris.com/wp-content/themes/roadtoparis/api/generated_table_august.html) and display first 16 people in DataGridView in my Form application. From what I've read the best way to do it is to use HTML Agility pack, so I downloaded it and included to my project. I understand that the first thing to do is to load the content of html file. This is the code I used to do so:

我想从 HTML 表中导入一些数据(这里是一个链接http://road2paris.com/wp-content/themes/roadtoparis/api/generated_table_august.html)并在我的表单应用程序的 DataGridView 中显示前 16 个人。从我读到的最好的方法是使用 HTML Agility 包,所以我下载了它并包含在我的项目中。我知道首先要做的是加载html文件的内容。这是我用来这样做的代码:

        string htmlCode = "";
        using (WebClient client = new WebClient())
        {
            client.Headers.Add(HttpRequestHeader.UserAgent, "AvoidError");
            htmlCode = client.DownloadString("http://road2paris.com/wp-content/themes/roadtoparis/api/generated_table_august.html");
        }
        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

        doc.LoadHtml(htmlCode);

And then I got stuck. I don't know how to fill my datatable with data from the html table. I've tried many various solutions but nothing seems to work properly. I'd be glad if anyone could help me with that.

然后我被卡住了。我不知道如何用 html 表中的数据填充我的数据表。我尝试了许多不同的解决方案,但似乎没有任何工作正常。如果有人能帮助我,我会很高兴。

采纳答案by Sergey Berezovskiy

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlCode);
var headers = doc.DocumentNode.SelectNodes("//tr/th");
DataTable table = new DataTable();
foreach (HtmlNode header in headers)
    table.Columns.Add(header.InnerText); // create columns from th
// select rows with td elements 
foreach (var row in doc.DocumentNode.SelectNodes("//tr[td]")) 
    table.Rows.Add(row.SelectNodes("td").Select(td => td.InnerText).ToArray());

You'll need the HTML Agility Pack library to use this code.

您需要 HTML Agility Pack 库才能使用此代码。

回答by 3 Beer Minimum

Below I have created code that will prevent having duplicate data headers. When you create a DataTable each "Column" must have a unique name. Also, there are times when a HTML row might go out of bounds and you have to add additional columns to the data table, otherwise you will drop data. this has been my solution.

下面我创建了防止重复数据头的代码。创建数据表时,每个“列”都必须具有唯一的名称。此外,有时 HTML 行可能会越界,您必须向数据表添加其他列,否则您将删除数据。这是我的解决方案。

'''
public enum DuplicateHeaderReplacementStrategy
{
    AppendAlpha,
    AppendNumeric,
    Delete
}

public class HtmlServices
{
    private static readonly string[] Alpha = new[] { "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z" };

    public static HtmlDocument RenameDuplicateHeaders(HtmlDocument doc, DuplicateHeaderReplacementStrategy strategy)
    {
        var index = 0;
        try
        {
            foreach (HtmlNode table in doc.DocumentNode?.SelectNodes("//table"))
            {
                var tableHeaders = table.SelectNodes("th")?
                   .GroupBy(x => x)?
                   .Where(g => g.Count() > 1)?
                   .ToList();
                tableHeaders?.ForEach(y =>
                   {
                       switch (strategy)
                       {
                           case DuplicateHeaderReplacementStrategy.AppendNumeric:
                               y.Key.InnerHtml += index++;
                               break;

                           case DuplicateHeaderReplacementStrategy.AppendAlpha:
                               y.Key.InnerHtml += Alpha[index++];
                               break;

                           case DuplicateHeaderReplacementStrategy.Delete:
                               y.Key.InnerHtml = string.Empty;
                               break;
                       }
                });
            }
            return doc;
        }
        catch
        {
            return doc;
        }


    }
}


public static DataTable GetDataTableFromHtmlTable(string url, string[] htmlIds)
    {
        ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12;
        HtmlWeb web = new HtmlWeb();
        HtmlDocument doc = web.Load(url);
        string html = doc.DocumentNode.OuterHtml;

        doc = HtmlServices.RenameDuplicateHeaders(doc, DuplicateHeaderReplacementStrategy.AppendNumeric);

        var headers = doc.DocumentNode.SelectNodes("//tr/th");

        DataTable table = new DataTable();
        foreach (HtmlNode header in headers)
            if (!table.ColumnExists(header.InnerText))
            {
                table.Columns.Add(header.InnerText); // create columns from th
            }
            else
            {
                int columnIteration = 0;
                while (table.ColumnExists(header.InnerText + columnIteration.ToString()))
                {
                    columnIteration++;
                }
                table.Columns.Add(header.InnerText + columnIteration.ToString()); // create columns from th
            }

        // select rows with td elements
        foreach (var row in doc.DocumentNode.SelectNodes("//tr[td]"))
        {
            var addRow = row.SelectNodes("td").Select(td => td.InnerHtml.StripHtmlTables()).ToArray();

            if (addRow.Count() > table.Columns.Count)
            {
                int m_numberOfRowsToAdd = addRow.Count() - table.Columns.Count;
                for (int i = 0; i < m_numberOfRowsToAdd; i++)
                    table.Columns.Add($"ExtraColumn {i + 1}");
            }

            try
            {
                table.Rows.Add(addRow);
            }
            catch (Exception e)
            {
                debug.Print(e.Message);
            }
        }
        return table;
    }