bash 使用 wget 抓取网站并限制抓取的链接总数

Question

提问by GobiasKoffi

I want to learn more about crawlers by playing around with the wget tool. I'm interested in crawling my department's website, and finding the first 100 links on that site. So far, the command below is what I have. How do I limit the crawler to stop after 100 links?

我想通过使用 wget 工具来了解有关爬虫的更多信息。我有兴趣抓取我部门的网站，并在该网站上查找前 100 个链接。到目前为止，下面的命令是我所拥有的。如何限制爬虫在 100 个链接后停止？

wget -r -o output.txt -l 0 -t 1 --spider -w 5 -A html -e robots=on "http://www.example.com"

Answer 1

采纳答案by Wolph

You can't. wget doesn't support this so if you want something like this, you would have to write a tool yourself.

你不能。wget 不支持这个，所以如果你想要这样的东西，你必须自己编写一个工具。

You could fetch the main file, parse the links manually, and fetch them one by one with a limit of 100 items. But it's not something that wget supports.

您可以获取主文件，手动解析链接，然后以 100 项为限一一获取它们。但这不是 wget 支持的东西。

You could take a look at HTTrack for website crawling too, it has quite a few extra options for this: http://www.httrack.com/

您也可以查看 HTTrack 进行网站抓取，它有很多额外的选项：http: //www.httrack.com/

Answer 2

回答by Olivier Delouya

Create a fifo file (mknod /tmp/httpipe p)
do a fork
- in the child do wget --spider -r -l 1 http://myurl --output-file /tmp/httppipe
- in the father: read line by line /tmp/httpipe
- parse the output =~ m{^\-\-\d\d:\d\d:\d\d\-\- http://$self->{http_server}:$self->{tcport}/(.*)$}, print $1
- count the lines; after 100 lines just close the file, it will break the pipe

创建一个fifo文件（mknod /tmp/httpipe p）
做一个叉子
- 在孩子做 wget --spider -r -l 1 http://myurl --output-file /tmp/httppipe
- 在父亲：逐行阅读 /tmp/httpipe
- 解析输出 =~ m{^\-\-\d\d:\d\d:\d\d\-\- http://$self->{http_server}:$self->{tcport}/(.*)$}, print $1
- 计算线数；在 100 行后关闭文件，它会破坏管道

bash 使用 wget 抓取网站并限制抓取的链接总数

提问by GobiasKoffi

采纳答案by Wolph

回答by Olivier Delouya

相关推荐

最近更新

标签

bash 使用 wget 抓取网站并限制抓取的链接总数

提问by GobiasKoffi

采纳答案by Wolph

回答by Olivier Delouya

相关推荐

在 Bash 脚本中调用“读取”时获取自动完成

bash 仅将 STDOUT 的最后一行重定向到文件

bash RRDTool 图例标签放置

如何使用 bash 计算路径中的目录数？

相关推荐

最近更新

标签