如何使用 JavaScript 获取网站中的所有 URL?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3824208/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to get all the URLs in a web site using JavaScript?
提问by netha
Any one knows a way to get all the URLs in a website using JavaScript?
有谁知道一种使用 JavaScript 获取网站中所有 URL 的方法?
I only need the links starting with the same domain name.no need to consider other links.
我只需要以相同域名开头的链接。不需要考虑其他链接。
回答by bobince
Well this will get all the same-host links on the page:
那么这将获得页面上的所有相同主机链接:
var urls = [];
for(var i = document.links.length; i --> 0;)
if(document.links[i].hostname === location.hostname)
urls.push(document.links[i].href);
If by siteyou mean you want to recursively get the links inside linked pages, that's a bit trickier. You'd have to download each link into a new document (for example in an <iframe>), and the onloadcheck the iframe's own document for more links to add to the list to fetch. You'd need to keep a lookup of what URLs you'd already spidered to avoid fetching the same document twice. It probably wouldn't be very fast.
如果按站点,您的意思是要递归地获取链接页面内的链接,那就有点棘手了。您必须将每个链接下载到一个新文档中(例如在 中<iframe>),然后onload检查 iframe 自己的文档以获取更多链接以添加到列表中以获取。您需要继续查找您已经抓取的 URL,以避免两次获取相同的文档。应该不会很快。
回答by SColvin
Or in es6
或者在es6中
[...document.links].map(l => l.href)
回答by Craig Gjerdingen
Javascript to extract (and display) the domains, urls, and links from a page The "for(var i = document.links.length; i --> 0;)" method is a good collection to work with. Here is a example to pulls it from specific parts of the html page.
用于从页面中提取(和显示)域、网址和链接的 Javascript “for(var i = document.links.length; i --> 0;)”方法是一个很好的集合。这是一个从 html 页面的特定部分提取它的示例。
You could alter it to select and filter to whatever you want. And then use the list however you want. I wanted to show a working example.
你可以改变它来选择和过滤你想要的任何东西。然后根据需要使用该列表。我想展示一个工作示例。
var re = /^((http[s]?|ftp|mailto):(?:\/\/)?)?\/?(([^\/\.]+\.)*?([^\/\.]+\.[^:\/\s\.]{1,4})?(\.[^:\/\s\.]{1,2})?(:\d+)?)($|\/)([^#?\s]+)?(.*?)?(#[\w\-]+)?$/i;
var reG = /^((http[s]?|ftp|mailto):(?:\/\/)?)?\/?(([^\/\.]+\.)*?([^\/\.]+\.[^:\/\s\.]{1,4})?(\.[^:\/\s\.]{1,2})?(:\d+)?)($|\/)([^#?\s]+)?(.*?)?(#[\w\-]+)?$/ig;
var printList = document.getElementById("domains");
var unorderedList = document.createElement("ul");
unorderedList.setAttribute("id", "domainsList");
unorderedList.setAttribute("class", "list-group");
printList.appendChild(unorderedList);
var domainsList = document.getElementById("domainsList");
var list = document.getElementsByTagName("a");
//console.log(list);
var listArray = Array.from(list);
//loop through the list
listArray.forEach(function(link){
//console.log(link.href);
//console.log(typeof(link.href));
var listItem = document.createElement("li");
listItem.setAttribute("class", "list-group-item domain");
domainsList.appendChild(listItem);
var str = link.href;
var match = str.match(reG);
var matchGroup = str.match(re);
//console.log(matchGroup[5]);
var domainNode = document.createTextNode("Domain: " + matchGroup[5]);
listItem.appendChild(domainNode);
var breakNode = document.createElement("br");
listItem.appendChild(breakNode);
var websiteNode = document.createTextNode("Website: " + matchGroup[3]);
listItem.appendChild(websiteNode);
var breakNode = document.createElement("br");
listItem.appendChild(breakNode);
var fullNode = document.createTextNode("Full Link: " + match);
listItem.appendChild(fullNode);
domainsList.appendChild(listItem);
unorderedList.appendChild(listItem)
});
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<title>Pull Domains form a page</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- Bootstrap CSS -->
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
</head>
<body>
<div class="card-deck">
<div class="card mb-3" style="min-width: 10rem;"><div class="card-body"><a href="https://www.youtube.com/watch?v=f9B_1Ac5jnc">Link 1</a></div></div>
<div class="card mb-3" style="min-width: 10rem;"><div class="card-body"><a href="http://www.apple.com">Link 2</a></div></div>
<div class="card mb-3" style="min-width: 10rem;"><div class="card-body"><a href="http://www.cnn.com.au">Link 3</a></div></div>
<div class="card mb-3" style="min-width: 10rem;"><div class="card-body"><a href="http://downloads.news.com.au">Link 4</a></div></div>
<div class="card mb-3" style="min-width: 10rem;"><div class="card-body"><a href="http://ftp.android.co.nz">Link 5</a></div></div>
<div class="card mb-3" style="min-width: 10rem;"><div class="card-body"><a href="http://global.news.ca">Link 6</a></div></div>
<div class="card mb-3" style="min-width: 10rem;"><div class="card-body"><a href="https://www.apple.com">Link 7</a></div></div>
<div class="card mb-3" style="min-width: 10rem;"><div class="card-body"><a href="https://mira.mx/">Link 8</a></div></div>
<div class="card mb-3" style="min-width: 10rem;"><div class="card-body"><a href="http://www.qs.com/">Link 9</a></div></div>
<div class="card mb-3" style="min-width: 10rem;"><div class="card-body"><a href="http://pbs.org">Link 10</a></div></div>
</div>
<div id="domains"></div>
</body>
</html>
回答by Muhammad Adeel Zahid
using jquery u can find all the links on the page that match a specific criteria
使用 jquery 你可以找到页面上符合特定条件的所有链接
$("a[href=^domain.com]").each(function(){
alert($(this).attr("href"));
});

