使用 C# 解析 HTML 链接
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/122856/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parse HTML links using C#
提问by Shaun Bowe
Is there a built in dll that will give me a list of links from a string. I want to send in a string with valid html and have it parse all the links. I seem to remember there being something built into either .net or an unmanaged library.
是否有一个内置的 dll 会给我一个来自字符串的链接列表。我想发送一个带有有效 html 的字符串并让它解析所有链接。我似乎记得 .net 或非托管库中内置了一些东西。
I found a couple open source projects that looked promising but I thought there was a built in module. If not I may have to use one of those. I just didn't want an external dependency at this point if it wasn't necessary.
我发现了几个看起来很有前途的开源项目,但我认为有一个内置模块。如果不是,我可能不得不使用其中之一。如果没有必要,我此时只是不想要外部依赖。
采纳答案by Forgotten Semicolon
SubSonic.Sugar.Web.ScrapeLinksseems to do part of what you want, however it grabs the html from a url, rather than from a string. You can check out their implementation here.
SubSonic.Sugar.Web.ScrapeLinks似乎可以完成您想要的部分操作,但是它从 url 中获取 html,而不是从字符串中获取。您可以在此处查看它们的实现。
回答by Armin Ronacher
Google gives me this module: http://www.majestic12.co.uk/projects/html_parser.php
谷歌给了我这个模块:http: //www.majestic12.co.uk/projects/html_parser.php
Seems to be a HTML parser for .NET.
似乎是 .NET 的 HTML 解析器。
回答by Harper Shelby
A simple regular expression -
一个简单的正则表达式——
@"<a.*?>"
@"<a.*?>"
passed in to Regex.Matchesshould do what you need. That regex may need a tiny bit of tweaking, but it's pretty close I think.
传递给Regex.Matches应该做你需要的。该正则表达式可能需要稍作调整,但我认为它非常接近。
回答by Brian Lyttle
I don't think there is a built-in library, but the Html Agility Packis popular for what you want to do.
我认为没有内置库,但是Html Agility Pack很受欢迎,可以满足您的需求。
The way to do this with the raw .NET framework and no external dependencies would be use a regular expression to find all the 'a' tags in the string. You would need to take care of a lot of edge cases perhaps. eg href = "http://url" vs href=http://urletc.
使用原始 .NET 框架并且没有外部依赖项的方法是使用正则表达式来查找字符串中的所有“a”标记。您可能需要处理很多边缘情况。例如 href = " http://url" vs href= http://url等。
回答by Jacob Proffitt
I'm not aware of anything built in and from your question it's a little bit ambiguous what you're looking for exactly. Do you want the entire anchor tag, or just the URL from the href attribute?
我不知道任何内置的东西,从你的问题来看,你正在寻找的东西有点模棱两可。你想要整个锚标记,还是只是来自 href 属性的 URL?
If you have well-formed XHtml, you might be able to get away with using an XmlReader and an XPath query to find all the anchor tags (<a>
) and then hit the href attribute for the address. Since that's unlikely, you're probably better off using RegEx to pull down what you want.
如果您有格式良好的 XHtml,您可能可以使用 XmlReader 和 XPath 查询来查找所有锚标记 ( <a>
),然后点击地址的 href 属性。由于这不太可能,因此您最好使用 RegEx 来提取您想要的内容。
Using RegEx, you could do something like:
使用 RegEx,您可以执行以下操作:
List<Uri> findUris(string message)
{
string anchorPattern = "<a[\s]+[^>]*?href[\s]?=[\s\\"\']+(?<href>.*?)[\\"\']+.*?>(?<fileName>[^<]+|.*?)?<\/a>";
MatchCollection matches = Regex.Matches(message, anchorPattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.Compiled);
if (matches.Count > 0)
{
List<Uri> uris = new List<Uri>();
foreach (Match m in matches)
{
string url = m.Groups["url"].Value;
Uri testUri = null;
if (Uri.TryCreate(url, UriKind.RelativeOrAbsolute, out testUri))
{
uris.Add(testUri);
}
}
return uris;
}
return null;
}
Note that I'd want to check the href to make sure that the address actually makes sense as a valid Uri. You can eliminate that if you aren't actually going to be pursuing the link anywhere.
请注意,我想检查 href 以确保地址作为有效的 Uri 确实有意义。如果您实际上不打算在任何地方追求链接,则可以消除它。