C# HTML 敏捷包 - 解析表
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/655603/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
HTML Agility pack - parsing tables
提问by weismat
I want to use the HTML agility pack to parse tables from complex web pages, but I am somehow lost in the object model.
我想使用 HTML 敏捷包来解析复杂网页中的表格,但不知何故我迷失在对象模型中。
I looked at the link example, but did not find any table data this way.
Can I use XPath to get the tables? I am basically lost after having loaded the data as to how to get the tables. I have done this in Perl before and it was a bit clumsy, but worked. (HTML::TableParser
).
我查看了链接示例,但没有通过这种方式找到任何表数据。我可以使用 XPath 来获取表格吗?在加载了关于如何获取表格的数据后,我基本上迷路了。我以前在 Perl 中做过这件事,虽然有点笨拙,但确实有效。( HTML::TableParser
).
I am also happy if one can just shed a light on the right object order for the parsing.
我也很高兴有人能阐明解析的正确对象顺序。
采纳答案by Marc Gravell
How about something like: Using HTML Agility Pack
怎么样:使用HTML Agility Pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(@"<html><body><p><table id=""foo""><tr><th>hello</th></tr><tr><td>world</td></tr></table></body></html>");
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table")) {
Console.WriteLine("Found: " + table.Id);
foreach (HtmlNode row in table.SelectNodes("tr")) {
Console.WriteLine("row");
foreach (HtmlNode cell in row.SelectNodes("th|td")) {
Console.WriteLine("cell: " + cell.InnerText);
}
}
}
Note that you can make it prettier with LINQ-to-Objects if you want:
请注意,如果需要,您可以使用 LINQ-to-Objects 使其更漂亮:
var query = from table in doc.DocumentNode.SelectNodes("//table").Cast<HtmlNode>()
from row in table.SelectNodes("tr").Cast<HtmlNode>()
from cell in row.SelectNodes("th|td").Cast<HtmlNode>()
select new {Table = table.Id, CellText = cell.InnerText};
foreach(var cell in query) {
Console.WriteLine("{0}: {1}", cell.Table, cell.CellText);
}
回答by Coda
The most simple what I've found to get the XPath for a particular Element is to install FireBug extension for Firefox go to the site/webpage press F12 to bring up firebug; right select and right click the element on the page that you want to query and select "Inspect Element" Firebug will select the element in its IDE then right click the Element in Firebug and choose "Copy XPath" this function will give you the exact XPath Query you need to get the element you want using HTML Agility Library.
我发现为特定元素获取 XPath 的最简单方法是为 Firefox 安装 FireBug 扩展。转到站点/网页,按 F12 以启动 firebug;右键选择并右键单击页面上要查询的元素并选择“检查元素” Firebug 将在其 IDE 中选择该元素,然后右键单击 Firebug 中的元素并选择“复制 XPath”此功能将为您提供确切的 XPath使用 HTML Agility Library 查询您需要获取所需元素。
回答by rk42
Line from above answer:
从上面的答案行:
HtmlDocument doc = new HtmlDocument();
This doesn't work in VS 2015 C#. You cannot construct an HtmlDocument
any more.
这在 VS 2015 C# 中不起作用。你不能再构造一个HtmlDocument
了。
Another MS "feature" that makes things more difficult to use. Try HtmlAgilityPack.HtmlWeb
and check out this linkfor some sample code.
另一个使事情更难使用的 MS“功能”。尝试HtmlAgilityPack.HtmlWeb
查看此链接以获取一些示例代码。
回答by Shibumi Tait
In my case, there is a single table which happens to be a device list from a router. If you wish to read the table using TR/TH/TD (row, header, data) instead of a matrix as mentioned above, you can do something like the following:
就我而言,有一个表恰好是来自路由器的设备列表。如果您希望使用 TR/TH/TD(行、标题、数据)而不是上述矩阵来读取表格,您可以执行以下操作:
List<TableRow> deviceTable = (from table in document.DocumentNode.SelectNodes(XPathQueries.SELECT_TABLE)
from row in table?.SelectNodes(HtmlBody.TR)
let rows = row.SelectSingleNode(HtmlBody.TR)
where row.FirstChild.OriginalName != null && row.FirstChild.OriginalName.Equals(HtmlBody.T_HEADER)
select new TableRow
{
Header = row.SelectSingleNode(HtmlBody.T_HEADER)?.InnerText,
Data = row.SelectSingleNode(HtmlBody.T_DATA)?.InnerText}).ToList();
}
TableRow is just a simple object with Header and Data as properties. The approach takes care of null-ness and this case:
TableRow 只是一个带有 Header 和 Data 作为属性的简单对象。该方法处理空性和这种情况:
<tr>
<td width="28%"> </td>
</tr>
which is row without a header. The HtmlBody object with the constants hanging off of it are probably readily deduced but I apologize for it even still. I came from the world where if you have " in your code, it should either be constant or localizable.
这是没有标题的行。带有常量的 HtmlBody 对象可能很容易推导出来,但我仍然为此道歉。我来自这样一个世界,如果你的代码中有 " ,它应该是常量或可本地化的。
回答by B. Miller
I know this is a pretty old question but this was my solution that helped with visualizing the table so you can create a class structure. This is also using the HTML Agility Pack
我知道这是一个很老的问题,但这是我的解决方案,它有助于可视化表格,以便您可以创建类结构。这也是使用 HTML Agility Pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(@"<html><body><p><table id=""foo""><tr><th>hello</th></tr><tr><td>world</td></tr></table></body></html>");
var table = doc.DocumentNode.SelectSingleNode("//table");
var tableRows = table.SelectNodes("tr");
var columns = tableRows[0].SelectNodes("th/text()");
for (int i = 1; i < tableRows.Count; i++)
{
for (int e = 0; e < columns.Count; e++)
{
var value = tableRows[i].SelectSingleNode($"td[{e + 1}]");
Console.Write(columns[e].InnerText + ":" + value.InnerText);
}
Console.WriteLine();
}