C# 是否有 LINQ to HTML 或其他一些好的 .Net HTML 操作 API？

Question

提问by Doctor Jones

I have a C# WPF application that needs to consume data that is exposed on a webpage as a HTML table.

我有一个 C# WPF 应用程序，它需要使用在网页上作为 HTML 表公开的数据。

After getting inspiration from this urlI tried using Linq to Xml to parse the Html document, but this only works if the HTML document is extremely well formed (and doesn't have any comments or HTML entities inside it). I have managed to get a working solution using this technique, but it is far from ideal.

从这个 url 获得灵感后，我尝试使用 Linq to Xml 来解析 Html 文档，但这仅在 HTML 文档格式非常好（并且其中没有任何注释或 HTML 实体）时才有效。我已经设法使用这种技术获得了一个有效的解决方案，但它远非理想。

I am after a solution that is intended for parsing HTML. I have hacked "solutions" before, but they are brittle. I am after a robust way of parsing/manipulating the document. I'd ideally like something that makes the task as easy as it would be from Javascript/JQuery.

我正在寻找一个旨在解析 HTML 的解决方案。我以前破解过“解决方案”，但它们很脆弱。我正在寻找一种强大的解析/操作文档的方法。理想情况下，我喜欢使任务变得像 Javascript/JQuery 一样简单的东西。

Does anyone know of a good .Net library or utility for parsing/manipulating HTML?

有谁知道一个好的 .Net 库或用于解析/操作 HTML 的实用程序？

Answer 1

采纳答案by LaptopHeaven

~~Even though it's not LINQ based,~~I suggest researching the HTML Agility Packfrom CodePlex.

~~即使它不是基于 LINQ，~~我建议研究CodePlex的HTML Agility Pack。

Note: Html Agility Pack now supports Linq to Objects (via a LINQ to Xml Like interface)

注意：Html Agility Pack 现在支持 Linq to Objects（通过 LINQ to Xml Like 接口）

From the HTML Agility Pack page:

从 HTML Agility Pack 页面：

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

这是一个敏捷的 HTML 解析器，它构建了一个读/写 DOM 并支持普通的 XPATH 或 XSLT（你实际上不必了解 XPATH 或 XSLT 来使用它，别担心......）。它是一个 .NET 代码库，允许您解析“网络之外”的 HTML 文件。解析器对“现实世界”格式错误的 HTML 非常宽容。对象模型与 System.Xml 建议的非常相似，但用于 HTML 文档（或流）。

Answer 2

回答by Dave Swersky

HTML is rarely well-formed enough that you could reliably use LINQ to XML. It's conceivablethat you might find an HTML "cleaner" that could fix the formatting well enough to be read, but there's not telling how robust it would be.

HTML 的格式很少，以至于您可以可靠地使用 LINQ to XML。这是可能的，你可能会发现一个HTML“清洁剂”，可以修正格式不够好读，但有没有告诉那将是多么强大的是。

I assume this is a "screenscraper" that reads from an HTML table over which you have no control. Don't stress over robustness in this case, screen-scraping is inherently brittle. If your requirements are set in stone, design the scraper to be easily updatable if/when the HTML you are scraping changes.

我假设这是一个“screenscraper”，它从一个你无法控制的 HTML 表中读取。在这种情况下不要强调稳健性，屏幕抓取本质上是脆弱的。如果您的要求是一成不变的，请将抓取器设计为在/当您抓取的 HTML 发生变化时可以轻松更新。

Answer 3

回答by AndyM

I had to do this in a recent project and I used LINQ to XML. If you know it's always going to be clean XHTML then you can probably recursively copy the DOM pretty easily, but I used the DevComponents HTMLDocument class library (http://www.devcomponents.com/htmldoc/) to convert HTML to XML then pulled that into an XElement. This reduces the challenge to getting your HTML into an XElement hierarchy. The one caveat is it chokes on script elements, so I deleted those by brute force.

我在最近的一个项目中不得不这样做，我使用了 LINQ to XML。如果你知道它总是会是干净的 XHTML，那么你可以很容易地递归复制 DOM，但我使用 DevComponents HTMLDocument 类库（http://www.devcomponents.com/htmldoc/）将 HTML 转换为 XML 然后拉将其转换为 XElement。这减少了将 HTML 放入 XElement 层次结构的挑战。一个警告是它会阻塞脚本元素，所以我用蛮力删除了它们。

    /// <summary>
    /// Extracts an HtmlDocument DOM to an XElement DOM that can be queried using LINQ to XML.
    /// </summary>
    /// <param name="htmlDocument">HtmlDocument containing DOM of page to extract.</param>
    /// <returns>HTML content as <see cref="XElement" /> for consumption by LINQ to XML.</returns>
    public XElement ExtractXml(HtmlDocument htmlDocument) {
        XmlDocument xmlDoc = htmlDocument.ToXMLDocument();

        // Find and remove all script tags from XML DOM or LINQ to XML will choke on XElement.Parse(XmlDocument).
        IList<XmlNode> nodes = new List<XmlNode>();
        foreach (XmlNode node in xmlDoc.GetElementsByTagName("script"))
            nodes.Add(node);
        foreach (XmlNode node in nodes)
            node.ParentNode.RemoveChild(node);

        return XElement.Parse(xmlDoc.OuterXml);
    }

Answer 4

回答by Frank Schwieterman

I've posted some code providing "LINQ to HTML" functionality here:

我在这里发布了一些提供“LINQ to HTML”功能的代码：

Looking for C# HTML parser

寻找 C# HTML 解析器

Answer 5

回答by keith

There's a LINQ to HTML library here:

这里有一个 LINQ to HTML 库：

http://www.superstarcoders.com/linq-to-html.aspx

C# 是否有 LINQ to HTML 或其他一些好的 .Net HTML 操作 API？

提问by Doctor Jones

采纳答案by LaptopHeaven

回答by Dave Swersky

回答by AndyM

回答by Frank Schwieterman

回答by keith

相关推荐

最近更新

标签

C# 是否有 LINQ to HTML 或其他一些好的 .Net HTML 操作 API？

提问by Doctor Jones

采纳答案by LaptopHeaven

回答by Dave Swersky

回答by AndyM

回答by Frank Schwieterman

回答by keith

相关推荐

C# 是否有一个 IDictionary 实现，在缺少键时，返回默认值而不是抛出？

C# 如何获取Listview的标题高度

C# Windows 服务自动停止

在 C# 中从图像制作视频的工作方法

相关推荐

最近更新

标签