Linux 使用 Curl 命令行实用程序并行下载
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8634109/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parallel download using Curl command line utility
提问by Ravi Gupta
I want to download some pages from a website and I did it successfully using curl
but I was wondering if somehow curl
downloads multiple pages at a time just like most of the download managers do, it will speed up things a little bit. Is it possible to do it in curl
command line utility?
我想从网站下载一些页面并且我成功地使用了它,curl
但我想知道是否curl
像大多数下载管理器一样一次下载多个页面,它会加快速度。是否可以在curl
命令行实用程序中执行此操作?
The current command I am using is
我正在使用的当前命令是
curl 'http://www...../?page=[1-10]' 2>&1 > 1.html
Here I am downloading pages from 1 to 10 and storing them in a file named 1.html
.
在这里,我正在下载从 1 到 10 的页面并将它们存储在一个名为1.html
.
Also, is it possible for curl
to write output of each URL to separate file say URL.html
, where URL
is the actual URL of the page under process.
此外,是否可以curl
将每个 URL 的输出写入单独的文件,例如URL.html
,URL
正在处理的页面的实际 URL在哪里。
采纳答案by nimrodm
Well, curl
is just a simple UNIX process. You can have as many of these curl
processes running in parallel and sending their outputs to different files.
嗯,curl
只是一个简单的UNIX进程。您可以让尽可能多的这些curl
进程并行运行并将它们的输出发送到不同的文件。
curl
can use the filename part of the URL to generate the local file. Just use the -O
option (man curl
for details).
curl
可以使用 URL 的文件名部分来生成本地文件。只需使用-O
选项(man curl
有关详细信息)。
You could use something like the following
您可以使用以下内容
urls="http://example.com/?page1.html http://example.com?page2.html" # add more URLs here
for url in $urls; do
# run the curl job in the background so we can start another job
# and disable the progress bar (-s)
echo "fetching $url"
curl $url -O -s &
done
wait #wait for all background jobs to terminate
回答by zengr
回答by AXE Labs
Curl can also accelerate a download of a file by splitting it into parts:
Curl 还可以通过将文件分成几部分来加速文件的下载:
$ man curl |grep -A2 '\--range'
-r/--range <range>
(HTTP/FTP/SFTP/FILE) Retrieve a byte range (i.e a partial docu-
ment) from a HTTP/1.1, FTP or SFTP server or a local FILE.
Here is a script that will automatically launch curl with the desired number of concurrent processes: https://github.com/axelabs/splitcurl
这是一个脚本,它将自动启动具有所需并发进程数的 curl:https: //github.com/axelabs/splitcurl
回答by Jonas Berlin
For launching of parallel commands, why not use the venerable make
command line utility.. It supports parallell execution and dependency tracking and whatnot.
为了启动并行命令,为什么不使用古老的make
命令行实用程序。它支持并行执行和依赖项跟踪等等。
How? In the directory where you are downloading the files, create a new file called Makefile
with the following contents:
如何?在您下载文件的目录中,创建一个名为Makefile
以下内容的新文件:
# which page numbers to fetch
numbers := $(shell seq 1 10)
# default target which depends on files 1.html .. 10.html
# (patsubst replaces % with %.html for each number)
all: $(patsubst %,%.html,$(numbers))
# the rule which tells how to generate a %.html dependency
# $@ is the target filename e.g. 1.html
%.html:
curl -C - 'http://www...../?page='$(patsubst %.html,%,$@) -o [email protected]
mv [email protected] $@
NOTEThe last two lines should start with a TAB character (instead of 8 spaces) or make will not accept the file.
注意最后两行应以 TAB 字符(而不是 8 个空格)开头,否则 make 将不接受该文件。
Now you just run:
现在你只需运行:
make -k -j 5
The curl command I used will store the output in 1.html.tmp
and only if the curl command succeeds then it will be renamed to 1.html
(by the mv
command on the next line). Thus if some download should fail, you can just re-run the same make
command and it will resume/retry downloading the files that failed to download during the first time. Once all files have been successfully downloaded, make will report that there is nothing more to be done, so there is no harm in running it one extra time to be "safe".
我使用的 curl 命令会将输出存储在其中,1.html.tmp
并且仅当 curl 命令成功时才会将其重命名为1.html
(通过mv
下一行的命令)。因此,如果某些下载失败,您只需重新运行相同的make
命令,它就会恢复/重试下载第一次下载失败的文件。一旦所有文件都成功下载,make 将报告没有更多的事情要做,所以为了“安全”多运行一次也没有坏处。
(The -k
switch tells make to keep downloading the rest of the files even if one single download should fail.)
(该-k
开关告诉 make 继续下载其余的文件,即使一次下载失败。)
回答by ndronen
My answer is a bit late, but I believe all of the existing answers fall just a little short. The way I do things like this is with xargs
, which is capable of running a specified number of commands in subprocesses.
我的回答有点晚了,但我相信所有现有的答案都有些短。我这样做的方式是使用xargs
,它能够在子进程中运行指定数量的命令。
The one-liner I would use is, simply:
我会使用的单线很简单:
$ seq 1 10 | xargs -n1 -P2 bash -c 'i=parallel --jobs 2 curl -O -s http://example.com/?page{}.html ::: {1..10}
; url="http://example.com/?page${i}.html"; curl -O -s $url'
This warrants some explanation. The use of -n 1
instructs xargs
to process a single input argument at a time. In this example, the numbers 1 ... 10
are each processed separately. And -P 2
tells xargs
to keep 2 subprocesses running all the time, each one handling a single argument, until all of the input arguments have been processed.
这需要一些解释。使用-n 1
指令一次xargs
处理一个输入参数。在这个例子中,1 ... 10
每个数字都是单独处理的。并-P 2
告诉xargs
保持 2 个子进程一直运行,每个子进程处理一个参数,直到处理完所有输入参数。
You can think of this as MapReduce in the shell. Or perhaps just the Map phase. Regardless, it's an effective way to get a lot of work done while ensuring that you don't fork bomb your machine. It's possible to do something similar in a for loop in a shell, but end up doing process management, which starts to seem pretty pointless once you realize how insanely great this use of xargs
is.
您可以将其视为 shell 中的 MapReduce。或者也许只是地图阶段。无论如何,这是完成大量工作的有效方法,同时确保您不会分叉炸弹您的机器。可以在 shell 的 for 循环中做类似的事情,但最终会做进程管理,一旦你意识到这种用法xargs
是多么的疯狂,这开始看起来毫无意义。
Update: I suspect that my example with xargs
could be improved (at least on Mac OS X and BSD with the -J
flag). With GNU Parallel, the command is a bit less unwieldy as well:
更新:我怀疑我的示例xargs
可以改进(至少在带有-J
标志的Mac OS X 和 BSD 上)。使用 GNU Parallel,该命令也不再那么笨拙:
#!/bin/sh
max=4
running_curl() {
set -- $(pidof curl)
echo $#
}
while [ $# -gt 0 ]; do
while [ $(running_curl) -ge $max ] ; do
sleep 1
done
curl "" --create-dirs -o "${1##*://}" &
shift
done
回答by Alex
Run a limited number of process is easy if your system have commands like pidof
or pgrep
which, given a process name, return the pids (the count of the pids tell how many are running).
如果您的系统有类似pidof
或pgrep
which 的命令,则运行有限数量的进程很容易,给定进程名称,返回 pids(pids 的计数表明正在运行的进程数)。
Something like this:
像这样的东西:
script.sh $(for i in `seq 1 10`; do printf "http://example/%s.html " "$i"; done)
to call like this:
像这样调用:
seq 1 50 | fmt -w40 | tr ' ' ',' \
| awk -v url="http://example.com/" '{print url "page{" "}.html"}' \
| xargs -P3 -n1 curl -o
The curl line of the script is untested.
脚本的卷曲行未经测试。
回答by Slava Ignatyev
I came up with a solution based on fmt
and xargs
. The idea is to specify multiple URLs inside braces http://example.com/page{1,2,3}.html
and run them in parallel with xargs
. Following would start downloading in 3 process:
我想出了一个基于fmt
和的解决方案xargs
。这个想法是在大括号内指定多个 URLhttp://example.com/page{1,2,3}.html
并与xargs
. 以下将在 3 个过程中开始下载:
curl -o http://example.com/page{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}.html
curl -o http://example.com/page{17,18,19,20,21,22,23,24,25,26,27,28,29}.html
curl -o http://example.com/page{30,31,32,33,34,35,36,37,38,39,40,41,42}.html
curl -o http://example.com/page{43,44,45,46,47,48,49,50}.html
so 4 downloadable lines of URLs are generated and sent to xargs
因此生成 4 行可下载的 URL 并将其发送到 xargs
curl -Z 'http://httpbin.org/anything/[1-9].{txt,html}' -o '#1.#2'
回答by Andrew Pantyukhin
As of 7.66.0, the curl
utility finally has built-in support for parallel downloads of multiple URLs within a single non-blocking process, which should be much faster and more resource-efficient compared to xargs
and background spawning, in most cases:
从 7.66.0 开始,该curl
实用程序终于内置了对在单个非阻塞进程中并行下载多个 URL 的支持xargs
,在大多数情况下,与后台生成相比,它应该更快且资源效率更高:
This will download 18 links in parallel and write them out to 18 different files, also in parallel. The official announcement of this feature from Daniel Stenberg is here: https://daniel.haxx.se/blog/2019/07/22/curl-goez-parallel/
这将并行下载 18 个链接并将它们写入 18 个不同的文件,也是并行的。Daniel Stenberg 对此功能的官方公告在这里:https: //daniel.haxx.se/blog/2019/07/22/curl-goez-parallel/