C# 用于查找 <a> 链接的“href”值的正则表达式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15926142/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
regular expression for finding 'href' value of a <a> link
提问by MrRolling
I need a regex pattern for finding web page links in HTML.
我需要一个正则表达式模式来查找 HTML 中的网页链接。
I first use @"(<a.*?>.*?</a>)"
to extract links (<a>
), but I can't fetch href
from that.
我首先使用@"(<a.*?>.*?</a>)"
提取链接 ( <a>
),但我无法href
从中获取。
My strings are:
我的字符串是:
<a href="www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="http://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="https://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="www.example.com/page.php/404" ....></a>
<a href="www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="http://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="https://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="www.example.com/page.php/404" ....></a>
1, 2 and 3 are valid and I need them, but number 4 is not valid for me
(?
and =
is essential)
1、2 和 3 是有效的,我需要它们,但数字 4 对我无效(?
并且=
是必不可少的)
Thanks everyone, but I don't need parsing <a>
. I have a list of links in href="abcdef"
format.
谢谢大家,但我不需要解析<a>
。我有一个href="abcdef"
格式的链接列表 。
I need to fetch href
of the links and filter it, my favorite urls must be contain ?
and =
like page.php?id=5
我要取href
的联系,并过滤它,我最喜欢的网址必须包含?
和=
像page.php?id=5
Thanks!
谢谢!
采纳答案by plalx
I'd recommend using an HTML parser over a regex, but still here's a regex that will create a capturing group over the value of the href
attribute of each links. It will match whether double or single quotes are used.
我建议在正则表达式上使用 HTML 解析器,但这里仍然是一个正则表达式,它将href
在每个链接的属性值上创建一个捕获组。它将匹配使用双引号还是单引号。
<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)
You can view a full explanation of this regex at here.
您可以在此处查看此正则表达式的完整说明。
Snippet playground:
片段游乐场:
const linkRx = /<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)/;
const textToMatchInput = document.querySelector('[name=textToMatch]');
document.querySelector('button').addEventListener('click', () => {
console.log(textToMatchInput.value.match(linkRx));
});
<label>
Text to match:
<input type="text" name="textToMatch" value='<a href="google.com"'>
<button>Match</button>
</label>
回答by Freelancer
Try this regex:
试试这个正则表达式:
"href\s*=\s*(?:\"(?<1>[^\"]*)\"|(?<1>\S+))"
You will get more help from discussions over:
您将从以下方面的讨论中获得更多帮助:
Regular expression to extract URL from an HTML link
and
和
Regex to get the link in href. [asp.net]
Hope its helpful.
希望它有帮助。
回答by KF2
Try this :
尝试这个 :
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
var res = Find(html);
}
public static List<LinkItem> Find(string file)
{
List<LinkItem> list = new List<LinkItem>();
// 1.
// Find all matches in file.
MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)",
RegexOptions.Singleline);
// 2.
// Loop over each match.
foreach (Match m in m1)
{
string value = m.Groups[1].Value;
LinkItem i = new LinkItem();
// 3.
// Get href attribute.
Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
RegexOptions.Singleline);
if (m2.Success)
{
i.Href = m2.Groups[1].Value;
}
// 4.
// Remove inner tags from text.
string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
RegexOptions.Singleline);
i.Text = t;
list.Add(i);
}
return list;
}
public struct LinkItem
{
public string Href;
public string Text;
public override string ToString()
{
return Href + "\n\t" + Text;
}
}
}
Input:
输入:
string html = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> 2.<a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> ";
Result:
结果:
[0] = {www.aaa.xx/xx.zz?id=xxxx&name=xxxx}
[1] = {http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx}
Scraping HTML extracts important page elements. It has many legal uses for webmasters and ASP.NET developers. With the Regex type and WebClient, we implement screen scraping for HTML.
抓取 HTML 提取重要的页面元素。它对网站管理员和 ASP.NET 开发人员有许多合法用途。使用 Regex 类型和 WebClient,我们为 HTML 实现屏幕抓取。
Edited
已编辑
Another easy way:you can use a web browser
control for getting href
from tag a
,like this:(see my example)
另一种简单的方法:您可以使用web browser
控件href
从标签中获取a
,如下所示:(参见我的示例)
public Form1()
{
InitializeComponent();
webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);
}
private void Form1_Load(object sender, EventArgs e)
{
webBrowser1.DocumentText = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"https://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"www.aaa.xx/xx.zz/xxx\" ....></a>";
}
void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
List<string> href = new List<string>();
foreach (HtmlElement el in webBrowser1.Document.GetElementsByTagName("a"))
{
href.Add(el.GetAttribute("href"));
}
}
回答by Anirudha
Using regex
to parse html is not recommended
regex
不推荐使用解析html
regex
is used for regularly occurring patterns.html
is not regular with it's format(except xhtml
).For example html
files are valid even if you don'thave a closing tag
!This could break your code.
regex
用于定期出现的模式。html
是不是经常与它的格式(除xhtml
)。例如html
文件,即使你有效不有closing tag
!这可能会破坏你的代码。
Use an html parser like htmlagilitypack
使用像htmlagilitypack这样的 html 解析器
You can use this code to retrieve all href's
in anchor tag using HtmlAgilityPack
您可以使用此代码检索href's
锚标记中的所有内容HtmlAgilityPack
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var hrefList = doc.DocumentNode.SelectNodes("//a")
.Select(p => p.GetAttributeValue("href", "not found"))
.ToList();
hrefList
contains all href`s
hrefList
包含所有href
回答by MrRolling
Thanks everyone (specially @plalx)
谢谢大家(特别是@plalx)
I find it quite overkill enforce the validity of the href attribute with such a complex and cryptic pattern while a simple expression such as
<a\s+(?:[^>]*?\s+)?href="([^"]*)"
would suffice to capture all URLs. If you want to make sure they contain at least a query string, you could just use<a\s+(?:[^>]*?\s+)?href="([^"]+\?[^"]+)"
我发现使用如此复杂和神秘的模式强制执行 href 属性的有效性是非常矫枉过正的,而像 这样的简单表达式就足以捕获所有 URL。如果你想确保它们至少包含一个查询字符串,你可以使用
<a\s+(?:[^>]*?\s+)?href="([^"]*)"
<a\s+(?:[^>]*?\s+)?href="([^"]+\?[^"]+)"
My final regex string:
我的最终正则表达式字符串:
First use one of this:
首先使用其中一个:
st =@"((www\.|https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\))+ \w\d:#@%/;$()~_?\+-=\\.&]*)";
st = "@<a href[^>]*>(.*?)</a>";
st = @"((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))?)";
st = @"((?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\)(?:www\.)?|www\.)[\w\d:#@%/;$()~_?\+,\-=\.&]+)";
st = @"(?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\)(?:www\.)?|www\.)";
st = @"(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\))+)|(www\.)[\w\d:#@%/;$()~_?\+-=\\.&]*)";
st = @"href=[""'](?<url>(http|https)://[^/]*?\.(com|org|net|gov))(/.*)?[""']";
st = @"(<a.*?>.*?</a>)";
st = @"(?:hrefs*=)(?:[s""']*)(?!#|mailto|location.|javascript|.*css|.*this.)(?.*?)(?:[s>""'])";
st = @"http://([\w+?\.\w+])+([a-zA-Z0-9\~\!\@\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?";
st = @"http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?";
st = @"(http|https)://([a-zA-Z0-9\~\!\@\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?";
st = @"((http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?)";
st = @"http://([\w+?\.\w+])+([a-zA-Z0-9\~\!\@\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?";
st = @"http(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\+&%$#_]*)?$";
st = @"(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*";
my choice is
我的选择是
@"(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*"
Second Use this:
第二使用这个:
st = "(.*)?(.*)=(.*)";
Problem Solved. Thanks every one :)
问题解决了。谢谢大家 :)
回答by Joee
HTMLDocument DOC = this.MySuperBrowser.Document as HTMLDocument;
public IHTMLAnchorElement imageElementHref;
imageElementHref = DOC.getElementById("idfirsticonhref") as IHTMLAnchorElement;
Simply try this code
只需尝试此代码
回答by Base33
I came up with this one, that supports anchor and image tags, and supports single and double quotes.
我想出了这个,支持锚点和图像标签,并支持单引号和双引号。
<[a|img]+\s+(?:[^>]*?\s+)?[src|href]+=[\"']([^\"']*)['\"]
So
所以
<a href="/something.ext">click here</a>
Will match:
将匹配:
Match 1: /something.ext
And
和
<a href='/something.ext'>click here</a>
Will match:
将匹配:
Match 1: /something.ext
Same goes for img src attributes
img src 属性也是如此