Bash:在许多文件上并行化 md5sum 校验和

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16772186/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 05:31:15  来源:igfitidea点击:

Bash: parallelize md5sum checksum on many files

bash

提问by user1968963

lets say, I have a 64-core server, and I need to compute md5sumof all files in /mnt/data, and store the results in a text file:

比方说,我有一个 64 核服务器,我需要计算 中md5sum的所有文件/mnt/data,并将结果存储在一个文本文件中:

find /mnt/data -type f -exec md5sum {} \; > md5.txt

The problem with the above command is, that only one process runs at any given time. I would like to harness the full power of my 64-cores. Ideally, I would like to makes sure, that at any given time, 64 parallel md5processes are running (but not more than 64).

上述命令的问题是,在任何给定时间只有一个进程运行。我想利用我的 64 核的全部功能。理想情况下,我想确保在任何给定时间md5都有64 个并行进程在运行(但不超过 64 个)。

Also. I would need output from all the processes to be stored into one file.

还。我需要将所有进程的输出存储到一个文件中。

NOTE: I am not looking for a way to compute md5sumof one file in parallel. I am looking for a way to compute 64 md5sums of 64 different files in parallel, as long as there are any files coming from find.

注意:我不是在寻找一种md5sum并行计算一个文件的方法。我正在寻找一种方法来并行计算 64 个不同文件的 64 个 md5sum,只要有来自find.

回答by Steve

Use GNU parallel. And you can find some more examples on how to implement it here.

使用GNU parallel. 您可以在此处找到更多有关如何实现它的示例。

find /mnt/data -type f | parallel -j 64 md5sum > md5.txt

回答by Tony

You can use xargs as well, It might be more available than parallels on some distro.

您也可以使用 xargs,它可能比某些发行版上的并行更可用。

-P controls the number of process spawned.

-P 控制产生的进程数。

find /mnt/data -type f | xargs -L1 -P24  md5sum > /tmp/result.txt

回答by jm666

If you want experiment try install the md5deep. (http://md5deep.sourceforge.net)

如果您想进行实验,请尝试安装md5deep. ( http://md5deep.sourceforge.net)

Here is the manualwhere you can read:

这是您可以阅读的手册

-jnn Controls multi-threading. By default the program will create one producer thread to scan the file system and one hashing thread per CPU core. Multi-threading causes output filenames to be in non-deterministic order, as files that take longer to hash will be delayed while they are hashed. If a deterministic order is required, specify -j0 to disable multi-threading

-jnn 控制多线程。默认情况下,程序将创建一个生产者线程来扫描文件系统,并为每个 CPU 内核创建一个散列线程。多线程导致输出文件名的顺序不确定,因为散列需要更长时间的文件在散列时会被延迟。如果需要确定性顺序,请指定 -j0 以禁用多线程

If this not helps, you have I/O bottleneck.

如果这没有帮助,您就会遇到 I/O 瓶颈。

回答by TrueY

UPDATED

更新

If You do not want to use additional packages You can try sg like this:

如果您不想使用其他软件包,您可以像这样尝试 sg:

#!/usr/bin/bash

max=5;
cpid=()

# Enable job control to receive SIGCHLD
set -m
remove() {
  for i in ${!cpid[*]}; do
    [ ! -d /proc/$i ] && echo UNSET $i && unset cpid[$i] && break
  done
}
trap remove SIGCHLD

for x in $(find ./ -type f -name '*.sh'); do
  some_long_process $x&
  cpid[$!]="$x";
  while [ ${#cpid[*]} -ge $max ]; do
    echo DO SOMETHING && sleep 1;
  done
done
wait

It first enables to receive SIGCHLD if a subprocess exits. If SIGCHLD it finds the first non-existing process and removes from cpidarray.

如果子进程退出,它首先启用接收 SIGCHLD。如果 SIGCHLD 找到第一个不存在的进程并从cpid数组中删除。

In the for-loop it starts maxnumber of some_long_processprocesses asynchronously. It maxreached it polls all pids added to cpidarray. It waits until cpid's length is less then maxand starts some more processes asynchronously.

在 for 循环中,它异步启动max多个some_long_process进程。它max到达它轮询添加到cpid数组的所有 pid 。它等待直到cpid的长度小于然后max异步启动更多进程。

If the list is over then it waits for all children to finish.

如果列表结束,则等待所有孩子完成。

ADDED

添加

Finally I have found a proper makesolution here.

最后,我在这里找到了合适的make解决方案。