C# 如何解析图像标签的 HTML 字符串以获取 SRC 信息？

Question

提问by Roberto Bonini

Currently I use .Net WebBrowser.Document.Images()to do this. It requires the Webrowserto load the document. It's messy and takes up resources.

目前我使用 .NetWebBrowser.Document.Images()来做到这一点。它需要Webrowser加载文档。它很乱，占用资源。

According to this questionXPath is better than a regex at this.

根据这个问题，XPath 在这方面比正则表达式更好。

Anyone know how to do this in C#?

有谁知道如何在 C# 中做到这一点？

Answer 1

采纳答案by mathieu

If your input string is valid XHTML you can treat is as xml, load it into an xmldocument, and do XPath magic :) But it's not always the case.

如果您的输入字符串是有效的 XHTML，您可以将其视为 xml，将其加载到 xmldocument 中，然后执行 XPath 魔术:) 但情况并非总是如此。

Otherwise you can try this function, that will return all image links from HtmlSource :

否则你可以试试这个函数，它将从 HtmlSource 返回所有图像链接：

public List<Uri> FetchLinksFromSource(string htmlSource)
{
    List<Uri> links = new List<Uri>();
    string regexImgSrc = @"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>";
    MatchCollection matchesImgSrc = Regex.Matches(htmlSource, regexImgSrc, RegexOptions.IgnoreCase | RegexOptions.Singleline);
    foreach (Match m in matchesImgSrc)
    {
        string href = m.Groups[1].Value;
        links.Add(new Uri(href));
    }
    return links;
}

And you can use it like this :

你可以像这样使用它：

HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.example.com");
request.Credentials = System.Net.CredentialCache.DefaultCredentials;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
    using(StreamReader sr = new StreamReader(response.GetResponseStream()))
    {
        List<Uri> links = FetchLinksFromSource(sr.ReadToEnd());
    }
}

Answer 2

回答by Khoth

If it's valid xhtml, you could do this:

如果它是有效的 xhtml，你可以这样做：

XmlDocument doc = new XmlDocument();
doc.LoadXml(html);
XmlNodeList results = doc.SelectNodes("//img/@src");

Answer 3

回答by rslite

If all you need is images I would just use a regular expression. Something like this should do the trick:

如果您只需要图像，我将只使用正则表达式。像这样的事情应该可以解决问题：

Regex rg = new Regex(@"<img.*?src=""(.*?)""", RegexOptions.IgnoreCase);

Answer 4

回答by Paul Mrozowski

The big issue with any HTML parsing is the "well formed" part. You've seen the crap HTML out there - how much of it is really well formed? I needed to do something similar - parse out all links in a document (and in my case) update them with a rewritten link. I found the Html Agility Packover on CodePlex. It rocks (and handles malformed HTML).

任何 HTML 解析的大问题是“格式良好”的部分。您已经看到了那些垃圾 HTML - 其中有多少是真正格式良好的？我需要做一些类似的事情 - 解析文档中的所有链接（在我的情况下）用重写的链接更新它们。我在 CodePlex 上找到了Html Agility Pack。它摇摆不定（并处理格式错误的 HTML）。

Here's a snippet for iterating over links in a document:

这是一个用于迭代文档中链接的片段：

HtmlDocument doc = new HtmlDocument();
doc.Load(@"C:\Sample.HTM");
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a/@href");

Content match = null;

// Run only if there are links in the document.
if (linkNodes != null)
{
    foreach (HtmlNode linkNode in linkNodes)
    {
        HtmlAttribute attrib = linkNode.Attributes["href"];
        // Do whatever else you need here
    }
}

Original Blog Post

原始博客文章

C# 如何解析图像标签的 HTML 字符串以获取 SRC 信息？

提问by Roberto Bonini

采纳答案by mathieu

回答by Khoth

回答by rslite

回答by Paul Mrozowski

相关推荐

最近更新

标签

C# 如何解析图像标签的 HTML 字符串以获取 SRC 信息？

提问by Roberto Bonini

采纳答案by mathieu

回答by Khoth

回答by rslite

回答by Paul Mrozowski

相关推荐

C# 如何使用 System.IO.DirectoryInfo 访问映射的网络驱动器？

C# 哪些策略和工具可用于查找 .NET 中的内存泄漏？

C# 生成 Xml 序列化程序集作为我的构建的一部分

C# 为什么不能在 .Net 的静态方法中使用关键字“this”？

相关推荐

最近更新

标签