C# - 解析网页的最佳方法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/300252/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-03 22:15:54  来源:igfitidea点击:

C# - Best Approach to Parsing Webpage?

c#htmlxmlhtml-content-extraction

提问by MattSayar

I've saved an entire webpage's html to a string, and now I want to grab the "href" valuesfrom the links, preferably with the ability to save them to different strings later. What's the best way to do this?

我已经将整个网页的 html 保存到一个字符串中,现在我想从链接中获取“href”值,最好能够稍后将它们保存到不同的字符串中。做到这一点的最佳方法是什么?

I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't navigate a not-really-an-xml-document too well.

我已经尝试将字符串保存为 .xml 文档并使用 XPathDocument 导航器解析它,但是(令人惊讶的是)它并没有很好地导航一个不是真正的 xml 文档。

Are regular expressions the bestway to achieve what I'm trying to accomplish?

正则表达式是实现我想要完成的目标的最佳方式吗?

采纳答案by NotMe

Regular expressions are one way to do it, but it can be problematic.

正则表达式是一种方法,但它可能有问题。

Most HTML pages can't be parsed using standard html techniques because, as you've found out, most don't validate.

大多数 HTML 页面无法使用标准 html 技术进行解析,因为正如您所发现的,大多数页面都没有验证。

You could spend the time trying to integrate HTML Tidyor a similar tool, but it would be much faster to just build the regex you need.

您可以花时间尝试集成HTML Tidy或类似工具,但仅构建所需的正则表达式会快得多。

UPDATE

更新

At the time of this update I've received 15 up and 9 downvotes. I think that maybe people aren't reading the question nor the comments on this answer. All the OP wanted to do was grab the href values. That's it.From that perspective, a simple regex is just fine. If the author had wanted to parse other items then there is no way I would recommend regex as I stated at the beginning, it's problematic at best.

在此更新时,我收到了 15 票赞成票和 9 票反对票。我认为也许人们没有阅读这个问题,也没有阅读对这个答案的评论。OP 想要做的就是获取 href 值。 就是这样。从这个角度来看,一个简单的正则表达式就可以了。如果作者想解析其他项目,那么我不可能像我开头所说的那样推荐正则表达式,这充其量是有问题的。

回答by Joel Coehoorn

You might have more luck using xml if you know or can fix the document to be at least well-formed. If you have goodhtml (or rather, xhtml), the xml system in .Net should be able to handle it. Unfortunately, good html is extremely rare.

如果您知道或可以将文档修复为至少格式良好,那么使用 xml 可能会更幸运。如果你有好的html(或者更确切地说,xhtml),.Net 中的 xml 系统应该能够处理它。不幸的是,好的 html 非常罕见。

On the other hand, regular expressions are really badat parsing html. Fortunately, you don't need to handle a full html spec. All you need to worry about is parsing href=strings to get the url. Even this can be tricky, so I won't make an attempt at it right away. Instead I'll start by asking a few questions to try and establish a few ground rules. They basically all boil down to "How much do you know about the document?", but here goes:

另一方面,正则表达式在解析 html方面非常糟糕。幸运的是,您不需要处理完整的 html 规范。您需要担心的是解析href=字符串以获取 url。即使这可能很棘手,所以我不会立即尝试。相反,我将从问几个问题开始,尝试建立一些基本规则。他们基本上都归结为“你对这份文件了解多少?”,但这里有:

  • Do you know if the "href" text will always be lower case?
  • Do you know if it will always use double quotes, single quotes, or nothing around the url?
  • Is it always be a valid URL, or do you need to account for things like '#', javascript statements, and the like?
  • Is it possible to work with a document where the content describes html features (IE: href=could also be in the document and not belong to an anchor tag)?
  • What else can you tell us about the document?
  • 你知道“href”文本是否总是小写?
  • 你知道它是否总是在 url 周围使用双引号、单引号或什么都不使用?
  • 它始终是有效的 URL,还是需要考虑诸如“#”、javascript 语句等内容?
  • 是否可以使用内容描述 html 功能的文档(即:href=也可以在文档中而不属于锚标记)?
  • 关于这份文件,你还能告诉我们什么?

回答by JasonTrue

Probably you want something like the Majestic parser: http://www.majestic12.co.uk/projects/html_parser.php

可能你想要像 Majestic 解析器这样的东西:http: //www.majestic12.co.uk/projects/html_parser.php

There are a few other options that can deal with flaky html, as well. The Html Agility Pack is worth a look, as someone else mentioned.

还有一些其他选项也可以处理片状 html。正如其他人提到的,Html Agility Pack 值得一看。

I don't think regexes are an ideal solution for HTML, since HTML is not context-free. They'll probably produce an adequate, if imprecise, result; even deterministically identifying a URI is a messy problem.

我不认为正则表达式是 HTML 的理想解决方案,因为 HTML 不是上下文无关的。他们可能会产生足够的结果,如果不精确的话;即使确定性地识别 URI 也是一个麻烦的问题。

回答by Tim Jarvis

I agree with Chris Lively, because HTML is often not very well formed you probably are best off with a regular expression for this.

我同意 Chris Lively 的观点,因为 HTML 的格式通常不是很好,因此最好使用正则表达式。

href=[\"\'](http:\/\/|\.\/|\/)?\w+(\.\w+)*(\/\w+(\.\w+)?)*(\/|\?\w*=\w*(&\w*=\w*)*)?[\"\']

From hereon RegExLib should get you started

这里开始 RegExLib 应该让你开始

回答by Duncan

For dealing with HTML of all shapes and sizes I prefer to use the HTMLAgility pack @ http://www.codeplex.com/htmlagilitypackit lets you write XPaths against the nodes you want and get those return in a collection.

为了处理所有形状和大小的 HTML,我更喜欢使用 HTMLAgility 包 @ http://www.codeplex.com/htmlagilitypack它允许您针对所需的节点编写 XPath,并在集合中获得这些返回值。

回答by Jeff Donnici

I can recommend the HTML Agility Pack. I've used it in a few cases where I needed to parse HTML and it works great. Once you load your HTML into it, you can use XPathexpressions to query the document and get your anchor tags (as well as just about anything else in there).

我可以推荐HTML Agility Pack。我在一些需要解析 HTML 的情况下使用过它,并且效果很好。将 HTML 加载到其中后,您可以使用XPath表达式来查询文档并获取锚标记(以及其中的几乎任何其他内容)。

HtmlDocument yourDoc = // load your HTML;
int someCount = yourDoc.DocumentNode.SelectNodes("your_xpath").Count;

回答by Dimitre Novatchev

It is always better, if possible not to rediscover the wheel. Some good tools exist that either convert HTML to well-formed XML, or act as an XmlReader:

如果可能的话,最好不要重新发现轮子。有一些很好的工具可以将 HTML 转换为格式良好的 XML,或者充当 XmlReader:

Here are three good tools:

这里有三个很好的工具:

  1. TagSoup, an open-source program, is a Java and SAX - based tool, developed by John Cowan. This is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
    Taggleis a commercial C++ port of TagSoup.

  2. SgmlReaderis a tool developed by Microsoft's Chris Lovett.
    SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.
    Download the zip file including the standalone executable and the full source code: SgmlReader.zip

  3. An outstanding achievement is the pure XSLT 2.0 Parser of HTMLwritten by David Carlisle.

  1. TagSoup是一个开源程序,是一个基于 Java 和 SAX 的工具,由John Cowan开发。这是一个用 Java 编写的符合 SAX 的解析器,它不是解析格式正确或有效的 XML,而是解析在野外发现的 HTML:差劲、讨厌和野蛮,尽管通常远非短小。TagSoup 是为那些必须使用某种合理的应用程序设计来处理这些东西的人设计的。通过提供 SAX 接口,它允许将标准 XML 工具应用于最糟糕的 HTML。TagSoup 还包括一个命令行处理器,它读取 HTML 文件,并可以生成干净的 HTML 或格式良好的 XML,它非常接近 XHTML。
    Taggle是 TagSoup 的商业 C++ 端口。

  2. SgmlReader是由微软的Chris Lovett开发的工具。
    SgmlReader 是基于任何 SGML 文档(包括对 HTML 的内置支持)的 XmlReader API。还提供了一个命令行实用程序,用于输出格式良好的 XML 结果。
    下载包含独立可执行文件和完整源代码的 zip 文件:SgmlReader.zip

  3. 一个杰出的成就是由David Carlisle编写的纯 XSLT 2.0 HTML 解析器

Reading its code would be a great learning exercise for everyone of us.

阅读它的代码对我们每个人来说都是一个很好的学习练习。

From the description:

从描述来看:

"d:htmlparse(string)
 d:htmlparse(string,namespace,html-mode)

  The one argument form is equivalent to)
  d:htmlparse(string,'http://ww.w3.org/1999/xhtml',true()))

  Parses the string as HTML and/or XML using some inbuilt heuristics to)
  control implied opening and closing of elements.

  It doesn't have full knowledge of HTML DTD but does have full list of
  empty elements and full list of entity definitions. HTML entities, and
  decimal and hex character references are all accepted. Note html-entities
  are recognised even if html-mode=false().

  Element names are lowercased (if html-mode is true()) and placed into the
  namespace specified by the namespace parameter (which may be "" to denote
  no-namespace unless the input has explict namespace declarations, in
  which case these will be honoured.

  Attribute names are lowercased if html-mode=true()
"

" d:htmlparse(string)
 d:htmlparse(string,namespace,html-mode)

  一个参数形式等价于)
  d:htmlparse(string,' http://ww.w3.org/1999/xhtml',true ()))

  使用一些内置的启发式方法将字符串解析为 HTML 和/或 XML,以)
  控制隐含的元素打开和关闭。

  它没有完整的 HTML DTD 知识,但有完整的
  空元素列表和完整的实体列表定义。HTML 实体、
  十进制和十六进制字符引用都被接受。注意
  即使 html-mode=false() 也能识别html 实体。

  元素名称是小写的(如果 html-mode 为 true())并放入
  命名空间参数指定的命名空间(可能是“”来表示
  无命名空间,除非输入有明确的命名空间声明,在
  这种情况下,这些将被尊重。

  如果 html-mode=true() 属性名称是小写的

Read a more detailed description here.

在此处阅读更详细的说明。

Hope this helped.

希望这有帮助。

Cheers,

干杯,

Dimitre Novatchev.

迪米特·诺瓦切夫。

回答by Frank Schwieterman

I've linked some code here that will let you use "LINQ to HTML"...

我在这里链接了一些代码,可以让您使用“LINQ to HTML”...

Looking for C# HTML parser

寻找 C# HTML 解析器