bash 使用 wget 抓取网站并限制抓取的链接总数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4973152/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-17 23:26:45  来源:igfitidea点击:

Crawl website using wget and limit total number of crawled links

bashscriptingweb-crawlerwget

提问by GobiasKoffi

I want to learn more about crawlers by playing around with the wget tool. I'm interested in crawling my department's website, and finding the first 100 links on that site. So far, the command below is what I have. How do I limit the crawler to stop after 100 links?

我想通过使用 wget 工具来了解有关爬虫的更多信息。我有兴趣抓取我部门的网站,并在该网站上查找前 100 个链接。到目前为止,下面的命令是我所拥有的。如何限制爬虫在 100 个链接后停止?

wget -r -o output.txt -l 0 -t 1 --spider -w 5 -A html -e robots=on "http://www.example.com"

采纳答案by Wolph

You can't. wget doesn't support this so if you want something like this, you would have to write a tool yourself.

你不能。wget 不支持这个,所以如果你想要这样的东西,你必须自己编写一个工具。

You could fetch the main file, parse the links manually, and fetch them one by one with a limit of 100 items. But it's not something that wget supports.

您可以获取主文件,手动解析链接,然后以 100 项为限一一获取它们。但这不是 wget 支持的东西。

You could take a look at HTTrack for website crawling too, it has quite a few extra options for this: http://www.httrack.com/

您也可以查看 HTTrack 进行网站抓取,它有很多额外的选项:http: //www.httrack.com/

回答by Olivier Delouya

  1. Create a fifo file (mknod /tmp/httpipe p)
  2. do a fork
    • in the child do wget --spider -r -l 1 http://myurl --output-file /tmp/httppipe
    • in the father: read line by line /tmp/httpipe
    • parse the output =~ m{^\-\-\d\d:\d\d:\d\d\-\- http://$self->{http_server}:$self->{tcport}/(.*)$}, print $1
    • count the lines; after 100 lines just close the file, it will break the pipe
  1. 创建一个fifo文件(mknod /tmp/httpipe p)
  2. 做一个叉子
    • 在孩子做 wget --spider -r -l 1 http://myurl --output-file /tmp/httppipe
    • 在父亲:逐行阅读 /tmp/httpipe
    • 解析输出 =~ m{^\-\-\d\d:\d\d:\d\d\-\- http://$self->{http_server}:$self->{tcport}/(.*)$}, print $1
    • 计算线数;在 100 行后关闭文件,它会破坏管道