bash 查找和压缩数百万个文件的更有效方法

Question

提问by Stu Thompson

I've got a job running on my server at the command line prompt for a two days now:

我的服务器上有一个作业在命令行提示符下运行了两天：

find data/ -name filepattern-*2009* -exec tar uf 2009.tar {} ;

It is taking forever, and then some. Yes, there are millions of files in the target directory. (Each file is a measly 8 bytes in a well hashed directory structure.) But just running...

它需要永远，然后一些。是的，目标目录中有数百万个文件。（每个文件在散列良好的目录结构中只有微不足道的 8 个字节。）但只是运行......

find data/ -name filepattern-*2009* -print > filesOfInterest.txt

...takes only two hours or so. At the rate my job is running, it won't be finished for a couple of weeks.. That seems unreasonable. Is there a more efficient to do this?Maybe with a more complicated bash script?

...只需要两个小时左右。按照我的工作运行速度，它不会在几周内完成.. 这似乎不合理。有没有更有效的方法来做到这一点？也许使用更复杂的 bash 脚本？

A secondary questions is "why is my current approach so slow?"

第二个问题是“为什么我目前的方法如此缓慢？”

Answer 1

采纳答案by frankc

If you already did the second command that created the file list, just use the -Toption to tell tar to read the files names from that saved file list. Running 1 tar command vs N tar commands will be a lot better.

如果您已经执行了创建文件列表的第二个命令，只需使用-T选项告诉 tar 从该保存的文件列表中读取文件名。运行 1 个 tar 命令与 N 个 tar 命令会好得多。

Answer 2

回答by Matthew Mott

One option is to use cpioto generate a tar-format archive:

一种选择是使用cpio生成 tar 格式的存档：

$ find data/ -name "filepattern-*2009*" | cpio -ov --format=ustar > 2009.tar

cpioworks natively with a list of filenames from stdin, rather than a top-level directory, which makes it an ideal tool for this situation.

cpio本机使用来自 stdin 的文件名列表，而不是顶级目录，这使其成为这种情况的理想工具。

Answer 3

回答by bashfu

Here's a find-tar combination that can do what you want without the use of xargs or exec (which should result in a noticeable speed-up):

这是一个 find-tar 组合，它可以在不使用 xargs 或 exec 的情况下做你想做的事情（这应该会导致明显的加速）：

tar --version    # tar (GNU tar) 1.14 

# FreeBSD find (on Mac OS X)
find -x data -name "filepattern-*2009*" -print0 | tar --null --no-recursion -uf 2009.tar --files-from -

# for GNU find use -xdev instead of -x
gfind data -xdev -name "filepattern-*2009*" -print0 | tar --null --no-recursion -uf 2009.tar --files-from -

# added: set permissions via tar
find -x data -name "filepattern-*2009*" -print0 | \
    tar --null --no-recursion --owner=... --group=... --mode=... -uf 2009.tar --files-from -

Answer 4

回答by Michal ?iha?

There is xargs for this:

为此有 xargs：

find data/ -name filepattern-*2009* -print0 | xargs -0 tar uf 2009.tar

Guessing why it is slow is hard as there is not much information. What is the structure of the directory, what filesystem do you use, how it was configured on creating. Having milions of files in single directory is quite hard situation for most filesystems.

由于没有太多信息，很难猜测为什么它很慢。目录的结构是什么，您使用什么文件系统，在创建时是如何配置的。对于大多数文件系统来说，在单个目录中拥有数百万个文件是非常困难的情况。

Answer 5

回答by bashfu

To correctly handle file names with weird (but legal) characters (such as newlines, ...) you should write your file list to filesOfInterest.txt using find's -print0:

要正确处理带有奇怪（但合法）字符（例如换行符，...）的文件名，您应该使用 find 的 -print0 将文件列表写入 filesOfInterest.txt：

find -x data -name "filepattern-*2009*" -print0 > filesOfInterest.txt
tar --null --no-recursion -uf 2009.tar --files-from filesOfInterest.txt

Answer 6

回答by Michael Aaron Safyan

The way you currently have things, you are invoking the tar command every single time it finds a file, which is not surprisingly slow. Instead of taking the two hours to print plus the amount of time it takes to open the tar archive, see if the files are out of date, and add them to the archive, you are actually multiplying those times together. You might have better success invoking the tar command once, after you have batched together all the names, possibly using xargs to achieve the invocation. By the way, I hope you are using 'filepattern-*2009*' and not filepattern-*2009* as the stars will be expanded by the shell without quotes.

您目前拥有的方式是，每次找到文件时都会调用 tar 命令，这并不奇怪。与其花费两个小时打印时间加上打开 tar 存档所需的时间，不如查看文件是否过期并将它们添加到存档中，您实际上是将这些时间相乘。在将所有名称批处理后，您可能会更成功地调用 tar 命令，可能使用 xargs 来实现调用。顺便说一句，我希望您使用的是 'filepattern-*2009*' 而不是 filepattern-*2009* 因为星号将被 shell 扩展而没有引号。

Answer 7

回答by ruffrey

There is a utility for this called tarsplitter.

有一个名为tarsplitter.

tarsplitter -m archive -i folder/*.json -o archive.tar -p 8

will use 8 threads to archive the files matching "folder/*.json" into an output archive of "archive.tar"

将使用 8 个线程将匹配“folder/*.json”的文件归档到“archive.tar”的输出归档中

https://github.com/AQUAOSOTech/tarsplitter

Answer 8

回答by Oleg Kuznetsov

Simplest (also remove file after archive creation):

最简单的（也在创建存档后删除文件）：

find *.1  -exec tar czf '{}.tgz' '{}' --remove-files \;

bash 查找和压缩数百万个文件的更有效方法

提问by Stu Thompson

采纳答案by frankc

回答by Matthew Mott

回答by bashfu

回答by Michal ?iha?

回答by bashfu

回答by Michael Aaron Safyan

回答by ruffrey

回答by Oleg Kuznetsov

相关推荐

最近更新

标签

bash 查找和压缩数百万个文件的更有效方法

提问by Stu Thompson

采纳答案by frankc

回答by Matthew Mott

回答by bashfu

回答by Michal ?iha?

回答by bashfu

回答by Michael Aaron Safyan

回答by ruffrey

回答by Oleg Kuznetsov

相关推荐

bash 是否可以从子shell获取退出代码？

来自 Bash 脚本范围的随机数

bash 在没有 make 的情况下安装 make 命令（mac os 10.5）

bash 如何将输出重定向到 shell 中的变量？

相关推荐

最近更新

标签