C# Html Agility Pack 仍然是最好的 .NET HTML 解析器吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1065031/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Is the Html Agility Pack still the best .NET HTML parser?
提问by Ian Ringrose
Html Agility Packwas given as the answer to a StackOverflow questionsome time ago, is it still the best option? What other options should be considered? Is there something more lightweight?
Html Agility Pack是前段时间在StackOverflow 问题的答案中给出的,它仍然是最好的选择吗?还应该考虑哪些其他选择?有没有更轻的东西?
采纳答案by Matthew Zielonka.co.uk
There is a spreadsheetwith the comparisons.
有一个带有比较的电子表格。
In summary:
总之:
CsQuery Performance vs. Html Agility Pack and Fizzler I put together some performance tests to compare CsQuery to the only practical alternative that I know of (Fizzler, an HtmlAgilityPack extension). I tested against three different documents:
- The sizzle test document (about 11 k)
- The wikipedia entry for "cheese" (about 170 k)
- The single-page HTML 5 spec (about 6 megabytes)
The overall results are:
- HAP is faster at loading the string of HTML into an object model. This makes sense, since I don't think Fizzler builds an index (or perhaps it builds only a relatively simple one). CsQuery takes anywhere from 1.1 to 2.6x longer to load the document. More on this below.
- CsQuery is faster for almost everything else. Sometimes by factors of 10,000 or more. The one exception is the "*" selector, where sometimes Fizzler is faster. For all tests, the results are completely enumerated; this case just results in every node in the tree being enumerated. So this doesn't test the selection engine so much as the data structure.
- CsQuery did a better job at returning the same results as a browser. Each of the selectors here was verified against the same document in Chrome using jQuery 1.7.2, and the numbers match those returned by CsQuery. This is probably because HtmlAgilityPack handles optional (missing) tags differently. Additionally, nth-child is not implemented completely in Fizzler - it only supports simple values (not formulae).
CsQuery 性能与 Html Agility Pack 和 Fizzler 我将一些性能测试放在一起,将 CsQuery 与我所知道的唯一实用替代方案(Fizzler,一个 HtmlAgilityPack 扩展)进行比较。我针对三个不同的文件进行了测试:
- sizzle测试文档(约11k)
- “奶酪”的维基百科条目(约 170 k)
- 单页 HTML 5 规范(约 6 兆字节)
总体结果是:
- HAP 将 HTML 字符串加载到对象模型中的速度更快。这是有道理的,因为我认为 Fizzler 不会构建索引(或者它可能只构建了一个相对简单的索引)。CsQuery 需要 1.1 到 2.6 倍的时间来加载文档。更多关于这个下面。
- CsQuery 对于几乎所有其他事情都更快。有时是 10,000 或更多的因数。一个例外是“*”选择器,有时 Fizzler 更快。对于所有测试,结果都是完整的枚举;这种情况只会导致树中的每个节点都被枚举。所以这不会像数据结构那样测试选择引擎。
- CsQuery 在返回与浏览器相同的结果方面做得更好。此处的每个选择器都使用 jQuery 1.7.2 在 Chrome 中针对同一文档进行了验证,并且数字与 CsQuery 返回的数字相匹配。这可能是因为 HtmlAgilityPack 以不同方式处理可选(缺失)标签。此外,nth-child 并没有在 Fizzler 中完全实现——它只支持简单的值(而不是公式)。
回答by J.W.
回答by gimel
If you are prepared to look outside the .NET
world,
the Python
SO community recommends Beautiful Soup,
for example html-parser-in-python.
如果您准备.NET
放眼世界,Python
SO 社区推荐Beautiful Soup,例如html-parser-in-python。
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping.
Beautiful Soup 是一个 Python HTML/XML 解析器,专为屏幕抓取等快速周转项目而设计。
回答by csharptest.net
Html Agility Pack was given as the answer to a StackOverflow question some time ago
Html Agility Pack 前段时间作为 StackOverflow 问题的答案给出
The Html Agility Pack is still an outstanding solution for parsing HTML.
Html Agility Pack 仍然是解析 HTML 的出色解决方案。
is it still the bestoption?
它仍然是最好的选择吗?
Best? well that all depends on the task at hand, but generally I think so. There are occasions when it does fall short of being ideal, but generally it will do a great job.
最好的事物?好吧,这一切都取决于手头的任务,但总的来说我是这么认为的。有时它确实不理想,但通常它会做得很好。
Is there something more lightweight?
有没有更轻的东西?
You could try this: http://csharptest.net/browse/src/Library/Html/It's nothing more than a hand-full of source files that pick apart HTML/XML via Regex. It supports a light-weight DOM and XPath but not much else. (help contents)
你可以试试这个:http: //csharptest.net/browse/src/Library/Html/这只不过是一堆通过正则表达式挑选 HTML/XML 的源文件。它支持轻量级 DOM 和 XPath,但不支持其他太多。(帮助内容)
[Example]
[例子]
public void TestParse() {
string notxml = "<html id=a ><body foo='bar' bar=\"foo\" />";
var html = new HtmlLightDocument(notxml).Root;
Assert.AreEqual("html", html.TagName);
Assert.AreEqual(1, html.Attributes.Count);
Assert.AreEqual("a", html.Attributes["id"]);
Assert.AreEqual(1, html.Children.Count);
}
Alternatively you can use the parser directly instead of building a DOM tree. Just implement the IXmlLightReaderinterface, and call the static XmlLightParser.Parsemethod.
或者,您可以直接使用解析器而不是构建 DOM 树。只需实现IXmlLightReader接口,并调用静态XmlLightParser.Parse方法。
PS: It was written to solve an in-house debate: that Regex canparse HTML! Since then we have actually found many uses for it since it is lightweight enough to embed anywhere. There are still ways to confuse the DOM heirarchy builder, but I haven't found any HTML the parser won't handle.
PS:它是为了解决内部争论而编写的:Regex可以解析 HTML!从那时起,我们实际上发现了它的许多用途,因为它足够轻巧,可以嵌入任何地方。仍然有一些方法可以混淆 DOM 层次结构构建器,但我还没有找到解析器无法处理的任何 HTML。
回答by Jamie Treworgy
When it comes to HTML parsing, there's no comparison to the real thing. This is a C# port of the validator.nuparser. This is the same code base used by Gecko-based browsers (e.g. Firefox). There repo looks a bit dusty but don't be fooled.. the port is outstanding. It's just been overlooked. I integrated it into CsQueryabout a month ago. It passes all the CsQuery tests (which include most of the jQuery and Sizzle tests ported to C#).
当谈到 HTML 解析时,没有与真实的比较。这是validator.nu解析器的C# 端口。这与基于 Gecko 的浏览器(例如 Firefox)使用的代码库相同。那里的 repo 看起来有点尘土飞扬,但不要被愚弄......端口非常出色。只是被忽视了。大约一个月前,我将它集成到CsQuery 中。它通过了所有 CsQuery 测试(包括大部分移植到 C# 的 jQuery 和 Sizzle 测试)。
I'm not aware of any other HTML5 parsers written in C#, or even any that come remotely close to doing a good job in terms of missing, optional, and invalid tag handling. This doesn't just do a great job though - it's standards compliant.
我不知道任何其他用 C# 编写的 HTML5 解析器,甚至任何在缺失、可选和无效标签处理方面都非常接近做好工作的解析器。这不仅做得很好 - 它符合标准。
The repo I linked to above is the original port, it includes a basic wrapper that produces an XML node tree. CsQuery versions 1.3 and higher use this parser.
我上面链接的 repo 是原始端口,它包含一个基本包装器,可生成 XML 节点树。CsQuery 1.3 及更高版本使用此解析器。
回答by Ewerton
best is a very relative term, for your question, i imagine you are searching for a reliable tool, so i think this feature should be taken into consideration. I would look for the support and strength of the company that provides the tool. It's a horrible feeling when you try to contact support for any tool that uses and the answer is, this company no longer exists. As HAP is maintained by the developer community, I would rather trust her.
最好是一个非常相对的术语,对于你的问题,我想你正在寻找一个可靠的工具,所以我认为应该考虑这个功能。我会寻求提供该工具的公司的支持和力量。当您尝试联系使用任何工具的支持时,这是一种可怕的感觉,答案是,这家公司已不复存在。由于HAP是由开发者社区维护的,我宁愿相信她。
回答by Simon
There is also AngleSharp
AngleSharp is a .NET library that gives you the ability to parse angle bracket based hyper-texts like HTML, SVG, and MathML. XML without validation is also supported by the library. An important aspect of AngleSharp is that CSS can also be parsed. The parser is built upon the official W3C specification. This produces a perfectly portable HTML5 DOM representation of the given source code. Also current features such as querySelector or querySelectorAll work for tree traversal.
AngleSharp 是一个 .NET 库,它使您能够解析基于尖括号的超文本,如 HTML、SVG 和 MathML。库也支持没有验证的 XML。AngleSharp 的一个重要方面是 CSS 也可以被解析。解析器基于官方 W3C 规范构建。这会生成给定源代码的完美可移植的 HTML5 DOM 表示。诸如 querySelector 或 querySelectorAll 等当前功能也适用于树遍历。