Bash：在许多文件上并行化 md5sum 校验和

Question

提问by user1968963

lets say, I have a 64-core server, and I need to compute md5sumof all files in /mnt/data, and store the results in a text file:

比方说，我有一个 64 核服务器，我需要计算中md5sum的所有文件/mnt/data，并将结果存储在一个文本文件中：

find /mnt/data -type f -exec md5sum {} \; > md5.txt

The problem with the above command is, that only one process runs at any given time. I would like to harness the full power of my 64-cores. Ideally, I would like to makes sure, that at any given time, 64 parallel md5processes are running (but not more than 64).

上述命令的问题是，在任何给定时间只有一个进程运行。我想利用我的 64 核的全部功能。理想情况下，我想确保在任何给定时间md5都有64 个并行进程在运行（但不超过 64 个）。

Also. I would need output from all the processes to be stored into one file.

还。我需要将所有进程的输出存储到一个文件中。

NOTE: I am not looking for a way to compute md5sumof one file in parallel. I am looking for a way to compute 64 md5sums of 64 different files in parallel, as long as there are any files coming from find.

注意：我不是在寻找一种md5sum并行计算一个文件的方法。我正在寻找一种方法来并行计算 64 个不同文件的 64 个 md5sum，只要有来自find.

Answer 1

回答by Steve

Use GNU parallel. And you can find some more examples on how to implement it here.

使用GNU parallel. 您可以在此处找到更多有关如何实现它的示例。

find /mnt/data -type f | parallel -j 64 md5sum > md5.txt

Answer 2

回答by Tony

You can use xargs as well, It might be more available than parallels on some distro.

您也可以使用 xargs，它可能比某些发行版上的并行更可用。

-P controls the number of process spawned.

-P 控制产生的进程数。

find /mnt/data -type f | xargs -L1 -P24  md5sum > /tmp/result.txt

Answer 3

回答by jm666

If you want experiment try install the md5deep. (http://md5deep.sourceforge.net)

如果您想进行实验，请尝试安装md5deep. ( http://md5deep.sourceforge.net)

Here is the manualwhere you can read:

这是您可以阅读的手册：

-jnn Controls multi-threading. By default the program will create one producer thread to scan the file system and one hashing thread per CPU core. Multi-threading causes output filenames to be in non-deterministic order, as files that take longer to hash will be delayed while they are hashed. If a deterministic order is required, specify -j0 to disable multi-threading

-jnn 控制多线程。默认情况下，程序将创建一个生产者线程来扫描文件系统，并为每个 CPU 内核创建一个散列线程。多线程导致输出文件名的顺序不确定，因为散列需要更长时间的文件在散列时会被延迟。如果需要确定性顺序，请指定 -j0 以禁用多线程

If this not helps, you have I/O bottleneck.

如果这没有帮助，您就会遇到 I/O 瓶颈。

Answer 4

回答by TrueY

UPDATED

更新

If You do not want to use additional packages You can try sg like this:

如果您不想使用其他软件包，您可以像这样尝试 sg：

#!/usr/bin/bash

max=5;
cpid=()

# Enable job control to receive SIGCHLD
set -m
remove() {
  for i in ${!cpid[*]}; do
    [ ! -d /proc/$i ] && echo UNSET $i && unset cpid[$i] && break
  done
}
trap remove SIGCHLD

for x in $(find ./ -type f -name '*.sh'); do
  some_long_process $x&
  cpid[$!]="$x";
  while [ ${#cpid[*]} -ge $max ]; do
    echo DO SOMETHING && sleep 1;
  done
done
wait

It first enables to receive SIGCHLD if a subprocess exits. If SIGCHLD it finds the first non-existing process and removes from cpidarray.

如果子进程退出，它首先启用接收 SIGCHLD。如果 SIGCHLD 找到第一个不存在的进程并从cpid数组中删除。

In the for-loop it starts maxnumber of some_long_processprocesses asynchronously. It maxreached it polls all pids added to cpidarray. It waits until cpid's length is less then maxand starts some more processes asynchronously.

在 for 循环中，它异步启动max多个some_long_process进程。它max到达它轮询添加到cpid数组的所有 pid 。它等待直到cpid的长度小于然后max异步启动更多进程。

If the list is over then it waits for all children to finish.

如果列表结束，则等待所有孩子完成。

ADDED

添加

Finally I have found a proper makesolution here.

最后，我在这里找到了合适的make解决方案。

Bash：在许多文件上并行化 md5sum 校验和

提问by user1968963

回答by Steve

回答by Tony

回答by jm666

回答by TrueY

相关推荐

最近更新

标签

Bash：在许多文件上并行化 md5sum 校验和

提问by user1968963

回答by Steve

回答by Tony

回答by jm666

回答by TrueY

相关推荐

bash 如何用换行符替换 : 字符？

bash 正则表达式“不以”开头

在 bash 脚本中运行 Matlab；matlab：未找到

bash 颠覆预提交钩子错误代码 255

相关推荐

最近更新

标签