C# HTML 敏捷包 - 解析表

Question

提问by weismat

I want to use the HTML agility pack to parse tables from complex web pages, but I am somehow lost in the object model.

我想使用 HTML 敏捷包来解析复杂网页中的表格，但不知何故我迷失在对象模型中。

I looked at the link example, but did not find any table data this way. Can I use XPath to get the tables? I am basically lost after having loaded the data as to how to get the tables. I have done this in Perl before and it was a bit clumsy, but worked. (HTML::TableParser).

我查看了链接示例，但没有通过这种方式找到任何表数据。我可以使用 XPath 来获取表格吗？在加载了关于如何获取表格的数据后，我基本上迷路了。我以前在 Perl 中做过这件事，虽然有点笨拙，但确实有效。( HTML::TableParser).

I am also happy if one can just shed a light on the right object order for the parsing.

我也很高兴有人能阐明解析的正确对象顺序。

Answer 1

采纳答案by Marc Gravell

How about something like: Using HTML Agility Pack

怎么样：使用HTML Agility Pack

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(@"<html><body><p><table id=""foo""><tr><th>hello</th></tr><tr><td>world</td></tr></table></body></html>");
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table")) {
    Console.WriteLine("Found: " + table.Id);
    foreach (HtmlNode row in table.SelectNodes("tr")) {
        Console.WriteLine("row");
        foreach (HtmlNode cell in row.SelectNodes("th|td")) {
            Console.WriteLine("cell: " + cell.InnerText);
        }
    }
}

Note that you can make it prettier with LINQ-to-Objects if you want:

请注意，如果需要，您可以使用 LINQ-to-Objects 使其更漂亮：

var query = from table in doc.DocumentNode.SelectNodes("//table").Cast<HtmlNode>()
            from row in table.SelectNodes("tr").Cast<HtmlNode>()
            from cell in row.SelectNodes("th|td").Cast<HtmlNode>()
            select new {Table = table.Id, CellText = cell.InnerText};

foreach(var cell in query) {
    Console.WriteLine("{0}: {1}", cell.Table, cell.CellText);
}

Answer 2

回答by Coda

The most simple what I've found to get the XPath for a particular Element is to install FireBug extension for Firefox go to the site/webpage press F12 to bring up firebug; right select and right click the element on the page that you want to query and select "Inspect Element" Firebug will select the element in its IDE then right click the Element in Firebug and choose "Copy XPath" this function will give you the exact XPath Query you need to get the element you want using HTML Agility Library.

我发现为特定元素获取 XPath 的最简单方法是为 Firefox 安装 FireBug 扩展。转到站点/网页，按 F12 以启动 firebug；右键选择并右键单击页面上要查询的元素并选择“检查元素” Firebug 将在其 IDE 中选择该元素，然后右键单击 Firebug 中的元素并选择“复制 XPath”此功能将为您提供确切的 XPath使用 HTML Agility Library 查询您需要获取所需元素。

Answer 3

回答by rk42

Line from above answer:

从上面的答案行：

HtmlDocument doc = new HtmlDocument();

This doesn't work in VS 2015 C#. You cannot construct an HtmlDocumentany more.

这在 VS 2015 C# 中不起作用。你不能再构造一个HtmlDocument了。

Another MS "feature" that makes things more difficult to use. Try HtmlAgilityPack.HtmlWeband check out this linkfor some sample code.

另一个使事情更难使用的 MS“功能”。尝试HtmlAgilityPack.HtmlWeb查看此链接以获取一些示例代码。

Answer 4

回答by Shibumi Tait

In my case, there is a single table which happens to be a device list from a router. If you wish to read the table using TR/TH/TD (row, header, data) instead of a matrix as mentioned above, you can do something like the following:

就我而言，有一个表恰好是来自路由器的设备列表。如果您希望使用 TR/TH/TD（行、标题、数据）而不是上述矩阵来读取表格，您可以执行以下操作：

    List<TableRow> deviceTable = (from table in document.DocumentNode.SelectNodes(XPathQueries.SELECT_TABLE)
                                       from row in table?.SelectNodes(HtmlBody.TR)
                                       let rows = row.SelectSingleNode(HtmlBody.TR)
                                       where row.FirstChild.OriginalName != null && row.FirstChild.OriginalName.Equals(HtmlBody.T_HEADER)
                                       select new TableRow
                                       {
                                           Header = row.SelectSingleNode(HtmlBody.T_HEADER)?.InnerText,
                                           Data = row.SelectSingleNode(HtmlBody.T_DATA)?.InnerText}).ToList();
                                       }

TableRow is just a simple object with Header and Data as properties. The approach takes care of null-ness and this case:

TableRow 只是一个带有 Header 和 Data 作为属性的简单对象。该方法处理空性和这种情况：

<tr>
    <td width="28%">&nbsp;</td>
</tr>

which is row without a header. The HtmlBody object with the constants hanging off of it are probably readily deduced but I apologize for it even still. I came from the world where if you have " in your code, it should either be constant or localizable.

这是没有标题的行。带有常量的 HtmlBody 对象可能很容易推导出来，但我仍然为此道歉。我来自这样一个世界，如果你的代码中有 " ，它应该是常量或可本地化的。

Answer 5

回答by B. Miller

I know this is a pretty old question but this was my solution that helped with visualizing the table so you can create a class structure. This is also using the HTML Agility Pack

我知道这是一个很老的问题，但这是我的解决方案，它有助于可视化表格，以便您可以创建类结构。这也是使用 HTML Agility Pack

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(@"<html><body><p><table id=""foo""><tr><th>hello</th></tr><tr><td>world</td></tr></table></body></html>");
var table = doc.DocumentNode.SelectSingleNode("//table");
var tableRows = table.SelectNodes("tr");
var columns = tableRows[0].SelectNodes("th/text()");
for (int i = 1; i < tableRows.Count; i++)
{
    for (int e = 0; e < columns.Count; e++)
    {
        var value = tableRows[i].SelectSingleNode($"td[{e + 1}]");
        Console.Write(columns[e].InnerText + ":" + value.InnerText);
    }
Console.WriteLine();
}

C# HTML 敏捷包 - 解析表

提问by weismat

采纳答案by Marc Gravell

回答by Coda

回答by rk42

回答by Shibumi Tait

回答by B. Miller

相关推荐

最近更新

标签

C# HTML 敏捷包 - 解析表

提问by weismat

采纳答案by Marc Gravell

回答by Coda

回答by rk42

回答by Shibumi Tait

回答by B. Miller

相关推荐

C# 在 excel 2007 中打开时，Excel 电子表格生成导致“文件格式不同于扩展名错误”

在 C# 中公开 DLL 的方法

C# 比较编译的 .NET 程序集？

在 C# 中阅读 MS Exchange 电子邮件

相关推荐

最近更新

标签