Linux 使用 Curl 命令行实用程序并行下载

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8634109/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 03:49:31  来源:igfitidea点击:

Parallel download using Curl command line utility

linuxshellunixcurl

提问by Ravi Gupta

I want to download some pages from a website and I did it successfully using curlbut I was wondering if somehow curldownloads multiple pages at a time just like most of the download managers do, it will speed up things a little bit. Is it possible to do it in curlcommand line utility?

我想从网站下载一些页面并且我成功地使用了它,curl但我想知道是否curl像大多数下载管理器一样一次下载多个页面,它会加快速度。是否可以在curl命令行实用程序中执行此操作?

The current command I am using is

我正在使用的当前命令是

curl 'http://www...../?page=[1-10]' 2>&1 > 1.html

Here I am downloading pages from 1 to 10 and storing them in a file named 1.html.

在这里,我正在下载从 1 到 10 的页面并将它们存储在一个名为1.html.

Also, is it possible for curlto write output of each URL to separate file say URL.html, where URLis the actual URL of the page under process.

此外,是否可以curl将每个 URL 的输出写入单独的文件,例如URL.htmlURL正在处理的页面的实际 URL在哪里。

采纳答案by nimrodm

Well, curlis just a simple UNIX process. You can have as many of these curlprocesses running in parallel and sending their outputs to different files.

嗯,curl只是一个简单的UNIX进程。您可以让尽可能多的这些curl进程并行运行并将它们的输出发送到不同的文件。

curlcan use the filename part of the URL to generate the local file. Just use the -Ooption (man curlfor details).

curl可以使用 URL 的文件名部分来生成本地文件。只需使用-O选项(man curl有关详细信息)。

You could use something like the following

您可以使用以下内容

urls="http://example.com/?page1.html http://example.com?page2.html" # add more URLs here

for url in $urls; do
   # run the curl job in the background so we can start another job
   # and disable the progress bar (-s)
   echo "fetching $url"
   curl $url -O -s &
done
wait #wait for all background jobs to terminate

回答by zengr

I am not sure about curl, but you can do that using wget.

我不确定 curl ,但您可以使用wget来做到这一点。

wget \
     --recursive \
     --no-clobber \
     --page-requisites \
     --html-extension \
     --convert-links \
     --restrict-file-names=windows \
     --domains website.org \
     --no-parent \
         www.website.org/tutorials/html/

回答by AXE Labs

Curl can also accelerate a download of a file by splitting it into parts:

Curl 还可以通过将文件分成几部分来加速文件的下载:

$ man curl |grep -A2 '\--range'
       -r/--range <range>
              (HTTP/FTP/SFTP/FILE)  Retrieve a byte range (i.e a partial docu-
              ment) from a HTTP/1.1, FTP or  SFTP  server  or  a  local  FILE.

Here is a script that will automatically launch curl with the desired number of concurrent processes: https://github.com/axelabs/splitcurl

这是一个脚本,它将自动启动具有所需并发进程数的 curl:https: //github.com/axelabs/splitcurl

回答by Jonas Berlin

For launching of parallel commands, why not use the venerable makecommand line utility.. It supports parallell execution and dependency tracking and whatnot.

为了启动并行命令,为什么不使用古老的make命令行实用程序。它支持并行执行和依赖项跟踪等等。

How? In the directory where you are downloading the files, create a new file called Makefilewith the following contents:

如何?在您下载文件的目录中,创建一个名为Makefile以下内容的新文件:

# which page numbers to fetch
numbers := $(shell seq 1 10)

# default target which depends on files 1.html .. 10.html
# (patsubst replaces % with %.html for each number)
all: $(patsubst %,%.html,$(numbers))

# the rule which tells how to generate a %.html dependency
# $@ is the target filename e.g. 1.html
%.html:
        curl -C - 'http://www...../?page='$(patsubst %.html,%,$@) -o [email protected]
        mv [email protected] $@

NOTEThe last two lines should start with a TAB character (instead of 8 spaces) or make will not accept the file.

注意最后两行应以 TAB 字符(而不是 8 个空格)开头,否则 make 将不接受该文件。

Now you just run:

现在你只需运行:

make -k -j 5

The curl command I used will store the output in 1.html.tmpand only if the curl command succeeds then it will be renamed to 1.html(by the mvcommand on the next line). Thus if some download should fail, you can just re-run the same makecommand and it will resume/retry downloading the files that failed to download during the first time. Once all files have been successfully downloaded, make will report that there is nothing more to be done, so there is no harm in running it one extra time to be "safe".

我使用的 curl 命令会将输出存储在其中,1.html.tmp并且仅当 curl 命令成功时才会将其重命名为1.html(通过mv下一行的命令)。因此,如果某些下载失败,您只需重新运行相同的make命令,它就会恢复/重试下载第一次下载失败的文件。一旦所有文件都成功下载,make 将报告没有更多的事情要做,所以为了“安全”多运行一次也没有坏处。

(The -kswitch tells make to keep downloading the rest of the files even if one single download should fail.)

(该-k开关告诉 make 继续下载其余的文件,即使一次下载失败。)

回答by ndronen

My answer is a bit late, but I believe all of the existing answers fall just a little short. The way I do things like this is with xargs, which is capable of running a specified number of commands in subprocesses.

我的回答有点晚了,但我相信所有现有的答案都有些短。我这样做的方式是使用xargs,它能够在子进程中运行指定数量的命令。

The one-liner I would use is, simply:

我会使用的单线很简单:

$ seq 1 10 | xargs -n1 -P2 bash -c 'i=
parallel --jobs 2 curl -O -s http://example.com/?page{}.html ::: {1..10}
; url="http://example.com/?page${i}.html"; curl -O -s $url'

This warrants some explanation. The use of -n 1instructs xargsto process a single input argument at a time. In this example, the numbers 1 ... 10are each processed separately. And -P 2tells xargsto keep 2 subprocesses running all the time, each one handling a single argument, until all of the input arguments have been processed.

这需要一些解释。使用-n 1指令一次xargs处理一个输入参数。在这个例子中,1 ... 10每个数字都是单独处理的。并-P 2告诉xargs保持 2 个子进程一直运行,每个子进程处理一个参数,直到处理完所有输入参数。

You can think of this as MapReduce in the shell. Or perhaps just the Map phase. Regardless, it's an effective way to get a lot of work done while ensuring that you don't fork bomb your machine. It's possible to do something similar in a for loop in a shell, but end up doing process management, which starts to seem pretty pointless once you realize how insanely great this use of xargsis.

您可以将其视为 shell 中的 MapReduce。或者也许只是地图阶段。无论如何,这是完成大量工作的有效方法,同时确保您不会分叉炸弹您的机器。可以在 shell 的 for 循环中做类似的事情,但最终会做进程管理,一旦你意识到这种用法xargs是多么的疯狂,这开始看起来毫无意义。

Update: I suspect that my example with xargscould be improved (at least on Mac OS X and BSD with the -Jflag). With GNU Parallel, the command is a bit less unwieldy as well:

更新:我怀疑我的示例xargs可以改进(至少在带有-J标志的Mac OS X 和 BSD 上)。使用 GNU Parallel,该命令也不再那么笨拙:

#!/bin/sh
max=4
running_curl() {
    set -- $(pidof curl)
    echo $#
}
while [ $# -gt 0 ]; do
    while [ $(running_curl) -ge $max ] ; do
        sleep 1
    done
    curl "" --create-dirs -o "${1##*://}" &
    shift
done

回答by Alex

Run a limited number of process is easy if your system have commands like pidofor pgrepwhich, given a process name, return the pids (the count of the pids tell how many are running).

如果您的系统有类似pidofpgrepwhich 的命令,则运行有限数量的进程很容易,给定进程名称,返回 pids(pids 的计数表明正在运行的进程数)。

Something like this:

像这样的东西:

script.sh $(for i in `seq 1 10`; do printf "http://example/%s.html " "$i"; done)

to call like this:

像这样调用:

seq 1 50 | fmt -w40 | tr ' ' ',' \
| awk -v url="http://example.com/" '{print url "page{"  "}.html"}' \
| xargs -P3 -n1 curl -o

The curl line of the script is untested.

脚本的卷曲行未经测试。

回答by Slava Ignatyev

I came up with a solution based on fmtand xargs. The idea is to specify multiple URLs inside braces http://example.com/page{1,2,3}.htmland run them in parallel with xargs. Following would start downloading in 3 process:

我想出了一个基于fmt和的解决方案xargs。这个想法是在大括号内指定多个 URLhttp://example.com/page{1,2,3}.html并与xargs. 以下将在 3 个过程中开始下载:

curl -o http://example.com/page{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}.html
curl -o http://example.com/page{17,18,19,20,21,22,23,24,25,26,27,28,29}.html
curl -o http://example.com/page{30,31,32,33,34,35,36,37,38,39,40,41,42}.html
curl -o http://example.com/page{43,44,45,46,47,48,49,50}.html

so 4 downloadable lines of URLs are generated and sent to xargs

因此生成 4 行可下载的 URL 并将其发送到 xargs

curl -Z 'http://httpbin.org/anything/[1-9].{txt,html}' -o '#1.#2'

回答by Andrew Pantyukhin

As of 7.66.0, the curlutility finally has built-in support for parallel downloads of multiple URLs within a single non-blocking process, which should be much faster and more resource-efficient compared to xargsand background spawning, in most cases:

从 7.66.0 开始,该curl实用程序终于内置了对在单个非阻塞进程中并行下载多个 URL 的支持xargs,在大多数情况下,与后台生成相比,它应该更快且资源效率更高:

##代码##

This will download 18 links in parallel and write them out to 18 different files, also in parallel. The official announcement of this feature from Daniel Stenberg is here: https://daniel.haxx.se/blog/2019/07/22/curl-goez-parallel/

这将并行下载 18 个链接并将它们写入 18 个不同的文件,也是并行的。Daniel Stenberg 对此功能的官方公告在这里:https: //daniel.haxx.se/blog/2019/07/22/curl-goez-parallel/