C# 获取html页面上的所有链接?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2248411/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get all links on html page?
提问by maxp
Im working on a little hobby project. I already have written the code to get a url, download the header and return the mime type / content type.
我正在做一个小爱好项目。我已经编写了获取 url、下载标题并返回 MIME 类型/内容类型的代码。
However, the step before this is the one im stuck on - i need to retrieve the contents of all the urls on the page based inside a tag, and in quotes i.e.
但是,在此之前的步骤是我坚持的一步 - 我需要检索基于标签内的页面上所有网址的内容,并在引号中即
...
<link rel='shortcut icon' href="/static/favicon.ico" type="image/x-icon" />
...
Would find the favicon link.
会找到图标链接。
Is there anything helpful in the .net library or is this going to have to be a case for regex?
.net 库中是否有任何有用的东西,或者这是否必须是正则表达式的一个案例?
采纳答案by womp
I'd look at using the Html Agility Pack.
我会考虑使用Html Agility Pack。
Here's an example straight from their examples page on how to find all the links in a page:
这是直接来自他们的示例页面的示例,该示例介绍了如何查找页面中的所有链接:
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(/* url */);
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
}
回答by Andrew Hare
There isn't anything built into the BCL, but fortunately you can use the HTML Agility Packto accomplish this task quite simply.
BCL 中没有内置任何内容,但幸运的是,您可以使用HTML Agility Pack非常简单地完成此任务。
As for your specific problem, please see Easily extracting links from a snippet of html with HtmlAgilityPack:
至于您的具体问题,请参阅使用 HtmlAgilityPack 从一段 html 中轻松提取链接:
private List<string> ExtractAllAHrefTags(HtmlDocument htmlSnippet)
{
List<string> hrefTags = new List<string>();
foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[@href]"))
{
HtmlAttribute att = link.Attributes["href"];
hrefTags.Add(att.Value);
}
return hrefTags;
}
回答by SLaks
You need to use the HTML Agility Pack.
您需要使用HTML Agility Pack。
For example:
例如:
var doc = new HtmlWeb().Load(url);
var linkTags = doc.DocumentNode.Descendants("link");
var linkedPages = doc.DocumentNode.Descendants("a")
.Select(a => a.GetAttributeValue("href", null))
.Where(u => !String.IsNullOrEmpty(u));
回答by GRUNGER
How about Regex?
正则表达式呢?
<(a|link).*?href=(\"|')(.+?)(\"|').*?>
with flags IgnoreCase
and SingleLine
有旗帜IgnoreCase
和SingleLine
See demo on systemtextregularexpressions.com regex.matches