Bash:在许多文件上并行化 md5sum 校验和
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16772186/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Bash: parallelize md5sum checksum on many files
提问by user1968963
lets say, I have a 64-core server, and I need to compute md5sumof all files in /mnt/data, and store the results in a text file:
比方说,我有一个 64 核服务器,我需要计算 中md5sum的所有文件/mnt/data,并将结果存储在一个文本文件中:
find /mnt/data -type f -exec md5sum {} \; > md5.txt
The problem with the above command is, that only one process runs at any given time. I would like to harness the full power of my 64-cores. Ideally, I would like to makes sure, that at any given time, 64 parallel md5processes are running (but not more than 64).
上述命令的问题是,在任何给定时间只有一个进程运行。我想利用我的 64 核的全部功能。理想情况下,我想确保在任何给定时间md5都有64 个并行进程在运行(但不超过 64 个)。
Also. I would need output from all the processes to be stored into one file.
还。我需要将所有进程的输出存储到一个文件中。
NOTE: I am not looking for a way to compute md5sumof one file in parallel. I am looking for a way to compute 64 md5sums of 64 different files in parallel, as long as there are any files coming from find.
注意:我不是在寻找一种md5sum并行计算一个文件的方法。我正在寻找一种方法来并行计算 64 个不同文件的 64 个 md5sum,只要有来自find.
回答by Steve
Use GNU parallel. And you can find some more examples on how to implement it here.
使用GNU parallel. 您可以在此处找到更多有关如何实现它的示例。
find /mnt/data -type f | parallel -j 64 md5sum > md5.txt
回答by Tony
You can use xargs as well, It might be more available than parallels on some distro.
您也可以使用 xargs,它可能比某些发行版上的并行更可用。
-P controls the number of process spawned.
-P 控制产生的进程数。
find /mnt/data -type f | xargs -L1 -P24 md5sum > /tmp/result.txt
回答by jm666
If you want experiment try install the md5deep. (http://md5deep.sourceforge.net)
如果您想进行实验,请尝试安装md5deep. ( http://md5deep.sourceforge.net)
Here is the manualwhere you can read:
-jnn Controls multi-threading. By default the program will create one producer thread to scan the file system and one hashing thread per CPU core. Multi-threading causes output filenames to be in non-deterministic order, as files that take longer to hash will be delayed while they are hashed. If a deterministic order is required, specify -j0 to disable multi-threading
-jnn 控制多线程。默认情况下,程序将创建一个生产者线程来扫描文件系统,并为每个 CPU 内核创建一个散列线程。多线程导致输出文件名的顺序不确定,因为散列需要更长时间的文件在散列时会被延迟。如果需要确定性顺序,请指定 -j0 以禁用多线程
If this not helps, you have I/O bottleneck.
如果这没有帮助,您就会遇到 I/O 瓶颈。
回答by TrueY
UPDATED
更新
If You do not want to use additional packages You can try sg like this:
如果您不想使用其他软件包,您可以像这样尝试 sg:
#!/usr/bin/bash
max=5;
cpid=()
# Enable job control to receive SIGCHLD
set -m
remove() {
for i in ${!cpid[*]}; do
[ ! -d /proc/$i ] && echo UNSET $i && unset cpid[$i] && break
done
}
trap remove SIGCHLD
for x in $(find ./ -type f -name '*.sh'); do
some_long_process $x&
cpid[$!]="$x";
while [ ${#cpid[*]} -ge $max ]; do
echo DO SOMETHING && sleep 1;
done
done
wait
It first enables to receive SIGCHLD if a subprocess exits. If SIGCHLD it finds the first non-existing process and removes from cpidarray.
如果子进程退出,它首先启用接收 SIGCHLD。如果 SIGCHLD 找到第一个不存在的进程并从cpid数组中删除。
In the for-loop it starts maxnumber of some_long_processprocesses asynchronously. It maxreached it polls all pids added to cpidarray. It waits until cpid's length is less then maxand starts some more processes asynchronously.
在 for 循环中,它异步启动max多个some_long_process进程。它max到达它轮询添加到cpid数组的所有 pid 。它等待直到cpid的长度小于然后max异步启动更多进程。
If the list is over then it waits for all children to finish.
如果列表结束,则等待所有孩子完成。
ADDED
添加

