C# 如何解析图像标签的 HTML 字符串以获取 SRC 信息?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/138839/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do you parse an HTML string for image tags to get at the SRC information?
提问by Roberto Bonini
Currently I use .Net WebBrowser.Document.Images()
to do this. It requires the Webrowser
to load the document. It's messy and takes up resources.
目前我使用 .NetWebBrowser.Document.Images()
来做到这一点。它需要Webrowser
加载文档。它很乱,占用资源。
According to this questionXPath is better than a regex at this.
根据这个问题,XPath 在这方面比正则表达式更好。
Anyone know how to do this in C#?
有谁知道如何在 C# 中做到这一点?
采纳答案by mathieu
If your input string is valid XHTML you can treat is as xml, load it into an xmldocument, and do XPath magic :) But it's not always the case.
如果您的输入字符串是有效的 XHTML,您可以将其视为 xml,将其加载到 xmldocument 中,然后执行 XPath 魔术:) 但情况并非总是如此。
Otherwise you can try this function, that will return all image links from HtmlSource :
否则你可以试试这个函数,它将从 HtmlSource 返回所有图像链接:
public List<Uri> FetchLinksFromSource(string htmlSource)
{
List<Uri> links = new List<Uri>();
string regexImgSrc = @"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>";
MatchCollection matchesImgSrc = Regex.Matches(htmlSource, regexImgSrc, RegexOptions.IgnoreCase | RegexOptions.Singleline);
foreach (Match m in matchesImgSrc)
{
string href = m.Groups[1].Value;
links.Add(new Uri(href));
}
return links;
}
And you can use it like this :
你可以像这样使用它:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.example.com");
request.Credentials = System.Net.CredentialCache.DefaultCredentials;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
using(StreamReader sr = new StreamReader(response.GetResponseStream()))
{
List<Uri> links = FetchLinksFromSource(sr.ReadToEnd());
}
}
回答by Khoth
If it's valid xhtml, you could do this:
如果它是有效的 xhtml,你可以这样做:
XmlDocument doc = new XmlDocument();
doc.LoadXml(html);
XmlNodeList results = doc.SelectNodes("//img/@src");
回答by rslite
If all you need is images I would just use a regular expression. Something like this should do the trick:
如果您只需要图像,我将只使用正则表达式。像这样的事情应该可以解决问题:
Regex rg = new Regex(@"<img.*?src=""(.*?)""", RegexOptions.IgnoreCase);
回答by Paul Mrozowski
The big issue with any HTML parsing is the "well formed" part. You've seen the crap HTML out there - how much of it is really well formed? I needed to do something similar - parse out all links in a document (and in my case) update them with a rewritten link. I found the Html Agility Packover on CodePlex. It rocks (and handles malformed HTML).
任何 HTML 解析的大问题是“格式良好”的部分。您已经看到了那些垃圾 HTML - 其中有多少是真正格式良好的?我需要做一些类似的事情 - 解析文档中的所有链接(在我的情况下)用重写的链接更新它们。我在 CodePlex 上找到了Html Agility Pack。它摇摆不定(并处理格式错误的 HTML)。
Here's a snippet for iterating over links in a document:
这是一个用于迭代文档中链接的片段:
HtmlDocument doc = new HtmlDocument();
doc.Load(@"C:\Sample.HTM");
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a/@href");
Content match = null;
// Run only if there are links in the document.
if (linkNodes != null)
{
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute attrib = linkNode.Attributes["href"];
// Do whatever else you need here
}
}