C# 获取html页面上的所有链接？

Question

提问by maxp

Im working on a little hobby project. I already have written the code to get a url, download the header and return the mime type / content type.

我正在做一个小爱好项目。我已经编写了获取 url、下载标题并返回 MIME 类型/内容类型的代码。

However, the step before this is the one im stuck on - i need to retrieve the contents of all the urls on the page based inside a tag, and in quotes i.e.

但是，在此之前的步骤是我坚持的一步 - 我需要检索基于标签内的页面上所有网址的内容，并在引号中即

...
<link rel='shortcut icon' href="/static/favicon.ico" type="image/x-icon" />
...

Would find the favicon link.

会找到图标链接。

Is there anything helpful in the .net library or is this going to have to be a case for regex?

.net 库中是否有任何有用的东西，或者这是否必须是正则表达式的一个案例？

Answer 1

采纳答案by womp

I'd look at using the Html Agility Pack.

我会考虑使用Html Agility Pack。

Here's an example straight from their examples page on how to find all the links in a page:

这是直接来自他们的示例页面的示例，该示例介绍了如何查找页面中的所有链接：

 HtmlWeb hw = new HtmlWeb();
 HtmlDocument doc = hw.Load(/* url */);
 foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
 {

 }

Answer 2

回答by Andrew Hare

There isn't anything built into the BCL, but fortunately you can use the HTML Agility Packto accomplish this task quite simply.

BCL 中没有内置任何内容，但幸运的是，您可以使用HTML Agility Pack非常简单地完成此任务。

As for your specific problem, please see Easily extracting links from a snippet of html with HtmlAgilityPack:

至于您的具体问题，请参阅使用 HtmlAgilityPack 从一段 html 中轻松提取链接：

private List<string> ExtractAllAHrefTags(HtmlDocument htmlSnippet)
{
    List<string> hrefTags = new List<string>();

    foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[@href]"))
    {
        HtmlAttribute att = link.Attributes["href"];
        hrefTags.Add(att.Value);
    }

    return hrefTags;
}

Answer 3

回答by SLaks

You need to use the HTML Agility Pack.

您需要使用HTML Agility Pack。

For example:

例如：

var doc = new HtmlWeb().Load(url);
var linkTags = doc.DocumentNode.Descendants("link");
var linkedPages = doc.DocumentNode.Descendants("a")
                                  .Select(a => a.GetAttributeValue("href", null))
                                  .Where(u => !String.IsNullOrEmpty(u));

Answer 4

回答by GRUNGER

How about Regex?

正则表达式呢？

<(a|link).*?href=(\"|')(.+?)(\"|').*?>

with flags IgnoreCaseand SingleLine

有旗帜IgnoreCase和SingleLine

See demo on systemtextregularexpressions.com regex.matches

请参阅systemtextregularexpressions.com 上的演示regex.matches

C# 获取html页面上的所有链接？

提问by maxp

采纳答案by womp

回答by Andrew Hare

回答by SLaks

回答by GRUNGER

相关推荐

最近更新

标签

C# 获取html页面上的所有链接？

提问by maxp

采纳答案by womp

回答by Andrew Hare

回答by SLaks

回答by GRUNGER

相关推荐

Linux 什么是活动内存和非活动内存

C# List<T> 与 BindingList<T> 优点/缺点

Linux 在bash追加换行符中连接两个字符串变量

对象与使用 C# 反射的目标类型不匹配

相关推荐

最近更新

标签