C# 用于查找 <a> 链接的“href”值的正则表达式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15926142/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-10 18:29:55  来源:igfitidea点击:

regular expression for finding 'href' value of a <a> link

c#regex

提问by MrRolling

I need a regex pattern for finding web page links in HTML.

我需要一个正则表达式模式来查找 HTML 中的网页链接。

I first use @"(<a.*?>.*?</a>)"to extract links (<a>), but I can't fetch hreffrom that.

我首先使用@"(<a.*?>.*?</a>)"提取链接 ( <a>),但我无法href从中获取。

My strings are:

我的字符串是:

  1. <a href="www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
  2. <a href="http://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
  3. <a href="https://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
  4. <a href="www.example.com/page.php/404" ....></a>
  1. <a href="www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
  2. <a href="http://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
  3. <a href="https://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
  4. <a href="www.example.com/page.php/404" ....></a>

1, 2 and 3 are valid and I need them, but number 4 is not valid for me (?and =is essential)

1、2 和 3 是有效的,我需要它们,但数字 4 对我无效(?并且=是必不可少的)



Thanks everyone, but I don't need parsing <a>. I have a list of links in href="abcdef"format.

谢谢大家,但我不需要解析<a>。我有一个href="abcdef"格式的链接列表 。

I need to fetch hrefof the links and filter it, my favorite urls must be contain ?and =like page.php?id=5

我要取href的联系,并过滤它,我最喜欢的网址必须包含?=page.php?id=5

Thanks!

谢谢!

采纳答案by plalx

I'd recommend using an HTML parser over a regex, but still here's a regex that will create a capturing group over the value of the hrefattribute of each links. It will match whether double or single quotes are used.

我建议在正则表达式上使用 HTML 解析器,但这里仍然是一个正则表达式,它将href在每个链接的属性值上创建一个捕获组。它将匹配使用双引号还是单引号。

<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)

You can view a full explanation of this regex at here.

您可以在此处查看此正则表达式的完整说明。

Snippet playground:

片段游乐场:

const linkRx = /<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)/;
const textToMatchInput = document.querySelector('[name=textToMatch]');

document.querySelector('button').addEventListener('click', () => {
  console.log(textToMatchInput.value.match(linkRx));
});
<label>
  Text to match:
  <input type="text" name="textToMatch" value='<a href="google.com"'>
  
  <button>Match</button>
 </label>

回答by Freelancer

Try this regex:

试试这个正则表达式:

"href\s*=\s*(?:\"(?<1>[^\"]*)\"|(?<1>\S+))"

You will get more help from discussions over:

您将从以下方面的讨论中获得更多帮助:

Regular expression to extract URL from an HTML link

从 HTML 链接中提取 URL 的正则表达式

and

Regex to get the link in href. [asp.net]

正则表达式获取 href 中的链接。[asp.net]

Hope its helpful.

希望它有帮助。

回答by KF2

Try this :

尝试这个 :

 public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            var res = Find(html);
        }

        public static List<LinkItem> Find(string file)
        {
            List<LinkItem> list = new List<LinkItem>();

            // 1.
            // Find all matches in file.
            MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)",
                RegexOptions.Singleline);

            // 2.
            // Loop over each match.
            foreach (Match m in m1)
            {
                string value = m.Groups[1].Value;
                LinkItem i = new LinkItem();

                // 3.
                // Get href attribute.
                Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
                RegexOptions.Singleline);
                if (m2.Success)
                {
                    i.Href = m2.Groups[1].Value;
                }

                // 4.
                // Remove inner tags from text.
                string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
                RegexOptions.Singleline);
                i.Text = t;

                list.Add(i);
            }
            return list;
        }

        public struct LinkItem
        {
            public string Href;
            public string Text;

            public override string ToString()
            {
                return Href + "\n\t" + Text;
            }
        }

    }  

Input:

输入:

  string html = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> 2.<a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> "; 

Result:

结果:

[0] = {www.aaa.xx/xx.zz?id=xxxx&name=xxxx}
[1] = {http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx}

C# Scraping HTML Links

C# 抓取 HTML 链接

Scraping HTML extracts important page elements. It has many legal uses for webmasters and ASP.NET developers. With the Regex type and WebClient, we implement screen scraping for HTML.

抓取 HTML 提取重要的页面元素。它对网站管理员和 ASP.NET 开发人员有许多合法用途。使用 Regex 类型和 WebClient,我们为 HTML 实现屏幕抓取。

Edited

已编辑

Another easy way:you can use a web browsercontrol for getting hreffrom tag a,like this:(see my example)

另一种简单的方法:您可以使用web browser控件href从标签中获取a,如下所示:(参见我的示例)

 public Form1()
        {
            InitializeComponent();
            webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            webBrowser1.DocumentText = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"https://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"www.aaa.xx/xx.zz/xxx\" ....></a>";
        }

        void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            List<string> href = new List<string>();
            foreach (HtmlElement el in webBrowser1.Document.GetElementsByTagName("a"))
            {
                href.Add(el.GetAttribute("href"));
            }
        }

回答by Anirudha

Using regexto parse html is not recommended

regex不推荐使用解析html

regexis used for regularly occurring patterns.htmlis not regular with it's format(except xhtml).For example htmlfiles are valid even if you don'thave a closing tag!This could break your code.

regex用于定期出现的模式。html是不是经常与它的格式(除xhtml)。例如html文件,即使你有效closing tag!这可能会破坏你的代码。

Use an html parser like htmlagilitypack

使用像htmlagilitypack这样的 html 解析器

You can use this code to retrieve all href'sin anchor tag using HtmlAgilityPack

您可以使用此代码检索href's锚标记中的所有内容HtmlAgilityPack

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

var hrefList = doc.DocumentNode.SelectNodes("//a")
                  .Select(p => p.GetAttributeValue("href", "not found"))
                  .ToList();

hrefListcontains all href`s

hrefList包含所有href

回答by MrRolling

Thanks everyone (specially @plalx)

谢谢大家(特别是@plalx)

I find it quite overkill enforce the validity of the href attribute with such a complex and cryptic pattern while a simple expression such as
<a\s+(?:[^>]*?\s+)?href="([^"]*)"
would suffice to capture all URLs. If you want to make sure they contain at least a query string, you could just use
<a\s+(?:[^>]*?\s+)?href="([^"]+\?[^"]+)"

我发现使用如此复杂和神秘的模式强制执行 href 属性的有效性是非常矫枉过正的,而像 这样的简单表达式就足以捕获所有 URL。如果你想确保它们至少包含一个查询字符串,你可以使用
<a\s+(?:[^>]*?\s+)?href="([^"]*)"

<a\s+(?:[^>]*?\s+)?href="([^"]+\?[^"]+)"



My final regex string:

我的最终正则表达式字符串:


First use one of this:


首先使用其中一个:

st =@"((www\.|https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\))+ \w\d:#@%/;$()~_?\+-=\\.&]*)";
st = "@<a href[^>]*>(.*?)</a>";
st = @"((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))?)";
st = @"((?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\)(?:www\.)?|www\.)[\w\d:#@%/;$()~_?\+,\-=\.&]+)";
st = @"(?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\)(?:www\.)?|www\.)";
st = @"(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\))+)|(www\.)[\w\d:#@%/;$()~_?\+-=\\.&]*)";
st = @"href=[""'](?<url>(http|https)://[^/]*?\.(com|org|net|gov))(/.*)?[""']";
st = @"(<a.*?>.*?</a>)";
st = @"(?:hrefs*=)(?:[s""']*)(?!#|mailto|location.|javascript|.*css|.*this.)(?.*?)(?:[s>""'])";
st = @"http://([\w+?\.\w+])+([a-zA-Z0-9\~\!\@\#\$\%\^\&amp;\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?";
st = @"http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?";
st = @"(http|https)://([a-zA-Z0-9\~\!\@\#\$\%\^\&amp;\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?";
st = @"((http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?)";
st = @"http://([\w+?\.\w+])+([a-zA-Z0-9\~\!\@\#\$\%\^\&amp;\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?";
st = @"http(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\+&amp;%$#_]*)?$";
st = @"(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*";

my choice is

我的选择是

@"(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*"

Second Use this:

第二使用这个:

st = "(.*)?(.*)=(.*)";



Problem Solved. Thanks every one :)

问题解决了。谢谢大家 :)

回答by Joee

 HTMLDocument DOC = this.MySuperBrowser.Document as HTMLDocument;
 public IHTMLAnchorElement imageElementHref;
 imageElementHref = DOC.getElementById("idfirsticonhref") as IHTMLAnchorElement;

Simply try this code

只需尝试此代码

回答by Base33

I came up with this one, that supports anchor and image tags, and supports single and double quotes.

我想出了这个,支持锚点和图像标签,并支持单引号和双引号。

<[a|img]+\s+(?:[^>]*?\s+)?[src|href]+=[\"']([^\"']*)['\"]

So

所以

<a href="/something.ext">click here</a>

Will match:

将匹配:

 Match 1: /something.ext

And

<a href='/something.ext'>click here</a>

Will match:

将匹配:

 Match 1: /something.ext

Same goes for img src attributes

img src 属性也是如此