我们如何解析图像标签的HTML字符串以获取SRC信息？-IGI

时间：2020-03-06 14:46:42 　来源:igfitidea点击:

目前，我使用.NetWebBrowser.Document.Images()来做到这一点。它需要Webrowser加载文档。这很麻烦，并且占用了资源。

根据这个问题，XPath在这方面胜过正则表达式。

有人知道如何在C＃中执行此操作吗？

解决方案

如果它是有效的xhtml，则可以执行以下操作：

XmlDocument doc = new XmlDocument();
doc.LoadXml(html);
XmlNodeList results = doc.SelectNodes("//img/@src");

如果我们只需要图像，我将使用正则表达式。这样的事情应该可以解决问题：

Regex rg = new Regex(@"<img.*?src=""(.*?)""", RegexOptions.IgnoreCase);

如果我们输入的字符串是有效的XHTML，则可以将其视为xml，将其加载到xmldocument中，然后执行XPath magic :)，但并非总是如此。

否则，我们可以尝试使用此功能，该功能将从HtmlSource返回所有图像链接：

public List<Uri> FetchLinksFromSource(string htmlSource)
{
    List<Uri> links = new List<Uri>();
    string regexImgSrc = @"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>";
    MatchCollection matchesImgSrc = Regex.Matches(htmlSource, regexImgSrc, RegexOptions.IgnoreCase | RegexOptions.Singleline);
    foreach (Match m in matchesImgSrc)
    {
        string href = m.Groups[1].Value;
        links.Add(new Uri(href));
    }
    return links;
}

我们可以像这样使用它：

HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.example.com");
request.Credentials = System.Net.CredentialCache.DefaultCredentials;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
    using(StreamReader sr = new StreamReader(response.GetResponseStream()))
    {
        List<Uri> links = FetchLinksFromSource(sr.ReadToEnd());
    }
}

HTML解析的最大问题是"格式正确"的部分。我们已经看到了废话HTML，其中有多少真的格式正确？我需要执行类似的操作以解析出文档中的所有链接(以我为例)，并使用重写的链接对其进行更新。我在CodePlex上找到了Html Agility Pack。它会摇摆(并处理格式错误的HTML)。

这是用于遍历文档中链接的摘要：

HtmlDocument doc = new HtmlDocument();
doc.Load(@"C:\Sample.HTM");
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a/@href");

Content match = null;

// Run only if there are links in the document.
if (linkNodes != null)
{
    foreach (HtmlNode linkNode in linkNodes)
    {
        HtmlAttribute attrib = linkNode.Attributes["href"];
        // Do whatever else you need here
    }
}

原始博客文章

我们如何解析图像标签的HTML字符串以获取SRC信息？

解决方案

相关推荐

最近更新

标签

我们如何解析图像标签的HTML字符串以获取SRC信息？

解决方案

相关推荐

压缩ASP.Net的脚本资源

Visual Studio 2008报告中的权限

如何从XSD文件创建数据库表？

将" tree / f / a"结果保存到具有unicode支持的文本文件中

相关推荐

最近更新

标签