bash 通过同时/并发文件传输加速 rsync?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24058544/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Speed up rsync with Simultaneous/Concurrent File Transfers?
提问by BT643
We need to transfer 15TB
of data from one server to another as fast as we can. We're currently using rsync
but we're only getting speeds of around 150Mb/s
, when our network is capable of 900+Mb/s
(tested with iperf
). I've done tests of the disks, network, etc and figured it's just that rsync is only transferring one file at a time which is causing the slowdown.
我们需要15TB
尽可能快地将数据从一台服务器传输到另一台服务器。我们目前正在使用,但当我们的网络能够(用 测试)时,我们rsync
只能获得大约 的速度。我已经对磁盘、网络等进行了测试,并认为 rsync 一次只传输一个文件,这导致了速度变慢。150Mb/s
900+Mb/s
iperf
I found a script to run a different rsync for each folder in a directory tree (allowing you to limit to x number), but I can't get it working, it still just runs one rsync at a time.
我找到了一个脚本来为目录树中的每个文件夹运行不同的 rsync(允许您限制为 x 个),但我无法让它工作,它仍然一次只运行一个 rsync。
I found the script
here(copied below).
我找到了script
这里(复制如下)。
Our directory tree is like this:
我们的目录树是这样的:
/main
- /files
- /1
- 343
- 123.wav
- 76.wav
- 772
- 122.wav
- 55
- 555.wav
- 324.wav
- 1209.wav
- 43
- 999.wav
- 111.wav
- 222.wav
- /2
- 346
- 9993.wav
- 4242
- 827.wav
- /3
- 2545
- 76.wav
- 199.wav
- 183.wav
- 23
- 33.wav
- 876.wav
- 4256
- 998.wav
- 1665.wav
- 332.wav
- 112.wav
- 5584.wav
So what I'd like to happen is to create an rsync for each of the directories in /main/files, up to a maximum of, say, 5 at a time. So in this case, 3 rsyncs would run, for /main/files/1
, /main/files/2
and /main/files/3
.
所以我想要发生的是为 /main/files 中的每个目录创建一个 rsync,一次最多可以创建 5 个。所以在这种情况下,3个rsyncs将运行,为/main/files/1
,/main/files/2
和/main/files/3
。
I tried with it like this, but it just runs 1 rsync at a time for the /main/files/2
folder:
我像这样尝试过,但它一次只为/main/files/2
文件夹运行 1 个 rsync :
#!/bin/bash
# Define source, target, maxdepth and cd to source
source="/main/files"
target="/main/filesTest"
depth=1
cd "${source}"
# Set the maximum number of concurrent rsync threads
maxthreads=5
# How long to wait before checking the number of rsync threads again
sleeptime=5
# Find all folders in the source directory within the maxdepth level
find . -maxdepth ${depth} -type d | while read dir
do
# Make sure to ignore the parent folder
if [ `echo "${dir}" | awk -F'/' '{print NF}'` -gt ${depth} ]
then
# Strip leading dot slash
subfolder=$(echo "${dir}" | sed 's@^\./@@g')
if [ ! -d "${target}/${subfolder}" ]
then
# Create destination folder and set ownership and permissions to match source
mkdir -p "${target}/${subfolder}"
chown --reference="${source}/${subfolder}" "${target}/${subfolder}"
chmod --reference="${source}/${subfolder}" "${target}/${subfolder}"
fi
# Make sure the number of rsync threads running is below the threshold
while [ `ps -ef | grep -c [r]sync` -gt ${maxthreads} ]
do
echo "Sleeping ${sleeptime} seconds"
sleep ${sleeptime}
done
# Run rsync in background for the current subfolder and move one to the next one
nohup rsync -a "${source}/${subfolder}/" "${target}/${subfolder}/" </dev/null >/dev/null 2>&1 &
fi
done
# Find all files above the maxdepth level and rsync them as well
find . -maxdepth ${depth} -type f -print0 | rsync -a --files-from=- --from0 ./ "${target}/"
回答by Manuel Riel
Updated answer (Jan 2020)
更新答案(2020 年 1 月)
xargs
is now the recommended tool to achieve parallel execution. It's pre-installed almost everywhere. For running multiple rsync
tasks the command would be:
xargs
现在是实现并行执行的推荐工具。几乎在任何地方都预装了它。要运行多个rsync
任务,命令将是:
ls /srv/mail | xargs -n1 -P4 -I% rsync -Pa % myserver.com:/srv/mail/
This will list all folders in /srv/mail
, pipe them to xargs
, which will read them one-by-one and and run 4 rsync
processes at a time. The %
char replaces the input argument for each command call.
这将列出 中的所有文件夹/srv/mail
,xargs
将它们通过管道传送到,这将一一读取它们并一次运行 4 个rsync
进程。该%
字符替换为每个命令调用输入参数。
Original answer using parallel
:
原始答案使用parallel
:
ls /srv/mail | parallel -v -j8 rsync -raz --progress {} myserver.com:/srv/mail/{}
回答by Stuart Caie
rsync
transfers files as fast as it can over the network. For example, try using it to copy one large file that doesn't exist at all on the destination. That speed is the maximum speed rsync can transfer data. Compare it with the speed of scp
(for example). rsync
is even slower at raw transfer when the destination file exists, because both sides have to have a two-way chat about what parts of the file are changed, but pays for itself by identifying data that doesn't need to be transferred.
rsync
以最快的速度通过网络传输文件。例如,尝试使用它来复制目标上根本不存在的一个大文件。该速度是 rsync 可以传输数据的最大速度。将其与scp
(例如)的速度进行比较。rsync
当目标文件存在时,原始传输的速度甚至更慢,因为双方必须就文件的哪些部分发生更改进行双向对话,但通过识别不需要传输的数据来收回成本。
A simpler way to run rsync
in parallel would be to use parallel
. The command below would run up to 5 rsync
s in parallel, each one copying one directory. Be aware that the bottleneck might not be your network, but the speed of your CPUs and disks, and running things in parallel just makes them all slower, not faster.
一种更简单的rsync
并行运行方法是使用parallel
. 下面的命令最多可以rsync
并行运行 5秒,每个命令复制一个目录。请注意,瓶颈可能不是您的网络,而是您的 CPU 和磁盘的速度,并行运行只会让它们变慢,而不是变快。
run_rsync() {
# e.g. copies /main/files/blah to /main/filesTest/blah
rsync -av "" "/main/filesTest/${1#/main/files/}"
}
export -f run_rsync
parallel -j5 run_rsync ::: /main/files/*
回答by nickgryg
You can use xargs
which supports running many processes at a time. For your case it will be:
您可以使用xargs
which 支持一次运行多个进程。对于您的情况,它将是:
ls -1 /main/files | xargs -I {} -P 5 -n 1 rsync -avh /main/files/{} /main/filesTest/
回答by Bryan P
There are a number of alternative tools and approaches for doing this listed arround the web. For example:
在网络上列出了许多替代工具和方法来执行此操作。例如:
回答by max
I've developed a python package called: parallel_sync
我开发了一个名为:parallel_sync 的 python 包
https://pythonhosted.org/parallel_sync/pages/examples.html
https://pythonhosted.org/parallel_sync/pages/examples.html
Here is a sample code how to use it:
这是一个如何使用它的示例代码:
from parallel_sync import rsync
creds = {'user': 'myusername', 'key':'~/.ssh/id_rsa', 'host':'192.168.16.31'}
rsync.upload('/tmp/local_dir', '/tmp/remote_dir', creds=creds)
parallelism by default is 10; you can increase it:
默认并行度为 10;你可以增加它:
from parallel_sync import rsync
creds = {'user': 'myusername', 'key':'~/.ssh/id_rsa', 'host':'192.168.16.31'}
rsync.upload('/tmp/local_dir', '/tmp/remote_dir', creds=creds, parallelism=20)
however note that ssh typically has the MaxSessions by default set to 10 so to increase it beyond 10, you'll have to modify your ssh settings.
但是请注意,ssh 通常默认将 MaxSessions 设置为 10,因此要将其增加到 10 以上,您必须修改 ssh 设置。
回答by sba
The simplest I've found is using background jobs in the shell:
我发现的最简单的方法是在 shell 中使用后台作业:
for d in /main/files/*; do
rsync -a "$d" remote:/main/files/ &
done
Beware it doesn't limit the amount of jobs! If you're network-bound this is not really a problem but if you're waiting for spinning rust this will be thrashing the disk.
当心它不会限制工作量!如果您受网络限制,这不是一个真正的问题,但如果您正在等待旋转锈,这将导致磁盘抖动。
You could add
你可以添加
while [ $(jobs | wc -l | xargs) -gt 10 ]; do sleep 1; done
inside the loop for a primitive form of job control.
在循环内部用于原始形式的作业控制。