bash 通过同时/并发文件传输加速 rsync?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24058544/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 00:52:14  来源:igfitidea点击:

Speed up rsync with Simultaneous/Concurrent File Transfers?

bashshellubuntu-12.04rsyncsimultaneous

提问by BT643

We need to transfer 15TBof data from one server to another as fast as we can. We're currently using rsyncbut we're only getting speeds of around 150Mb/s, when our network is capable of 900+Mb/s(tested with iperf). I've done tests of the disks, network, etc and figured it's just that rsync is only transferring one file at a time which is causing the slowdown.

我们需要15TB尽可能快地将数据从一台服务器传输到另一台服务器。我们目前正在使用,但当我们的网络能够(用 测试)时,我们rsync只能获得大约 的速度。我已经对磁盘、网络等进行了测试,并认为 rsync 一次只传输一个文件,这导致了速度变慢。150Mb/s900+Mb/siperf

I found a script to run a different rsync for each folder in a directory tree (allowing you to limit to x number), but I can't get it working, it still just runs one rsync at a time.

我找到了一个脚本来为目录树中的每个文件夹运行不同的 rsync(允许您限制为 x 个),但我无法让它工作,它仍然一次只运行一个 rsync。

I found the scripthere(copied below).

我找到了script这里(复制如下)。

Our directory tree is like this:

我们的目录树是这样的:

/main
   - /files
      - /1
         - 343
            - 123.wav
            - 76.wav
         - 772
            - 122.wav
         - 55
            - 555.wav
            - 324.wav
            - 1209.wav
         - 43
            - 999.wav
            - 111.wav
            - 222.wav
      - /2
         - 346
            - 9993.wav
         - 4242
            - 827.wav
      - /3
         - 2545
            - 76.wav
            - 199.wav
            - 183.wav
         - 23
            - 33.wav
            - 876.wav
         - 4256
            - 998.wav
            - 1665.wav
            - 332.wav
            - 112.wav
            - 5584.wav

So what I'd like to happen is to create an rsync for each of the directories in /main/files, up to a maximum of, say, 5 at a time. So in this case, 3 rsyncs would run, for /main/files/1, /main/files/2and /main/files/3.

所以我想要发生的是为 /main/files 中的每个目录创建一个 rsync,一次最多可以创建 5 个。所以在这种情况下,3个rsyncs将运行,为/main/files/1/main/files/2/main/files/3

I tried with it like this, but it just runs 1 rsync at a time for the /main/files/2folder:

我像这样尝试过,但它一次只为/main/files/2文件夹运行 1 个 rsync :

#!/bin/bash

# Define source, target, maxdepth and cd to source
source="/main/files"
target="/main/filesTest"
depth=1
cd "${source}"

# Set the maximum number of concurrent rsync threads
maxthreads=5
# How long to wait before checking the number of rsync threads again
sleeptime=5

# Find all folders in the source directory within the maxdepth level
find . -maxdepth ${depth} -type d | while read dir
do
    # Make sure to ignore the parent folder
    if [ `echo "${dir}" | awk -F'/' '{print NF}'` -gt ${depth} ]
    then
        # Strip leading dot slash
        subfolder=$(echo "${dir}" | sed 's@^\./@@g')
        if [ ! -d "${target}/${subfolder}" ]
        then
            # Create destination folder and set ownership and permissions to match source
            mkdir -p "${target}/${subfolder}"
            chown --reference="${source}/${subfolder}" "${target}/${subfolder}"
            chmod --reference="${source}/${subfolder}" "${target}/${subfolder}"
        fi
        # Make sure the number of rsync threads running is below the threshold
        while [ `ps -ef | grep -c [r]sync` -gt ${maxthreads} ]
        do
            echo "Sleeping ${sleeptime} seconds"
            sleep ${sleeptime}
        done
        # Run rsync in background for the current subfolder and move one to the next one
        nohup rsync -a "${source}/${subfolder}/" "${target}/${subfolder}/" </dev/null >/dev/null 2>&1 &
    fi
done

# Find all files above the maxdepth level and rsync them as well
find . -maxdepth ${depth} -type f -print0 | rsync -a --files-from=- --from0 ./ "${target}/"

回答by Manuel Riel

Updated answer (Jan 2020)

更新答案(2020 年 1 月)

xargsis now the recommended tool to achieve parallel execution. It's pre-installed almost everywhere. For running multiple rsynctasks the command would be:

xargs现在是实现并行执行的推荐工具。几乎在任何地方都预装了它。要运行多个rsync任务,命令将是:

ls /srv/mail | xargs -n1 -P4 -I% rsync -Pa % myserver.com:/srv/mail/

This will list all folders in /srv/mail, pipe them to xargs, which will read them one-by-one and and run 4 rsyncprocesses at a time. The %char replaces the input argument for each command call.

这将列出 中的所有文件夹/srv/mailxargs将它们通过管道传送到,这将一一读取它们并一次运行 4 个rsync进程。该%字符替换为每个命令调用输入参数。

Original answer using parallel:

原始答案使用parallel

ls /srv/mail | parallel -v -j8 rsync -raz --progress {} myserver.com:/srv/mail/{}

回答by Stuart Caie

rsynctransfers files as fast as it can over the network. For example, try using it to copy one large file that doesn't exist at all on the destination. That speed is the maximum speed rsync can transfer data. Compare it with the speed of scp(for example). rsyncis even slower at raw transfer when the destination file exists, because both sides have to have a two-way chat about what parts of the file are changed, but pays for itself by identifying data that doesn't need to be transferred.

rsync以最快的速度通过网络传输文件。例如,尝试使用它来复制目标上根本不存在的一个大文件。该速度是 rsync 可以传输数据的最大速度。将其与scp(例如)的速度进行比较。rsync当目标文件存在时,原始传输的速度甚至更慢,因为双方必须就文件的哪些部分发生更改进行双向对话,但通过识别不需要传输的数据来收回成本。

A simpler way to run rsyncin parallel would be to use parallel. The command below would run up to 5 rsyncs in parallel, each one copying one directory. Be aware that the bottleneck might not be your network, but the speed of your CPUs and disks, and running things in parallel just makes them all slower, not faster.

一种更简单的rsync并行运行方法是使用parallel. 下面的命令最多可以rsync并行运行 5秒,每个命令复制一个目录。请注意,瓶颈可能不是您的网络,而是您的 CPU 和磁盘的速度,并行运行只会让它们变慢,而不是变快。

run_rsync() {
    # e.g. copies /main/files/blah to /main/filesTest/blah
    rsync -av "" "/main/filesTest/${1#/main/files/}"
}
export -f run_rsync
parallel -j5 run_rsync ::: /main/files/*

回答by nickgryg

You can use xargswhich supports running many processes at a time. For your case it will be:

您可以使用xargswhich 支持一次运行多个进程。对于您的情况,它将是:

ls -1 /main/files | xargs -I {} -P 5 -n 1 rsync -avh /main/files/{} /main/filesTest/

回答by Bryan P

There are a number of alternative tools and approaches for doing this listed arround the web. For example:

在网络上列出了许多替代工具和方法来执行此操作。例如:

  • The NCSA Bloghas a description of using xargsand findto parallelize rsync without having to install any new software for most *nix systems.

  • And parsyncprovides a feature rich Perl wrapper for parallel rsync.

  • NCSA博客有使用说明xargsfind并行rsync的,无需安装用于大多数* nix系统中的任何新的软件。

  • parsync提供了一个功能丰富的Perl包装并行rsync的。

回答by max

I've developed a python package called: parallel_sync

我开发了一个名为:parallel_sync 的 python 包

https://pythonhosted.org/parallel_sync/pages/examples.html

https://pythonhosted.org/parallel_sync/pages/examples.html

Here is a sample code how to use it:

这是一个如何使用它的示例代码:

from parallel_sync import rsync
creds = {'user': 'myusername', 'key':'~/.ssh/id_rsa', 'host':'192.168.16.31'}
rsync.upload('/tmp/local_dir', '/tmp/remote_dir', creds=creds)

parallelism by default is 10; you can increase it:

默认并行度为 10;你可以增加它:

from parallel_sync import rsync
creds = {'user': 'myusername', 'key':'~/.ssh/id_rsa', 'host':'192.168.16.31'}
rsync.upload('/tmp/local_dir', '/tmp/remote_dir', creds=creds, parallelism=20)

however note that ssh typically has the MaxSessions by default set to 10 so to increase it beyond 10, you'll have to modify your ssh settings.

但是请注意,ssh 通常默认将 MaxSessions 设置为 10,因此要将其增加到 10 以上,您必须修改 ssh 设置。

回答by sba

The simplest I've found is using background jobs in the shell:

我发现的最简单的方法是在 shell 中使用后台作业:

for d in /main/files/*; do
    rsync -a "$d" remote:/main/files/ &
done

Beware it doesn't limit the amount of jobs! If you're network-bound this is not really a problem but if you're waiting for spinning rust this will be thrashing the disk.

当心它不会限制工作量!如果您受网络限制,这不是一个真正的问题,但如果您正在等待旋转锈,这将导致磁盘抖动。

You could add

你可以添加

while [ $(jobs | wc -l | xargs) -gt 10 ]; do sleep 1; done

inside the loop for a primitive form of job control.

在循环内部用于原始形式的作业控制。