在 Bash 中工作时如何处理“文件太多”的问题?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/186099/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-17 20:30:35  来源:igfitidea点击:

How do you handle the "Too many files" problem when working in Bash?

bashunixshell

提问by Vinko Vrsalovic

I many times have to work with directories containing hundreds of thousands of files, doing text matching, replacing and so on. If I go the standard route of, say

我很多时候不得不处理包含数十万个文件的目录,进行文本匹配、替换等。如果我走标准路线,说

grep foo *

I get the too many files error message, so I end up doing

我收到了太多文件错误消息,所以我最终做了

for i in *; do grep foo $i; done

or

或者

find ../path/ | xargs -I{} grep foo "{}"

But these are less than optimal (create a new grep process per each file).

但是这些都不是最佳的(为每个文件创建一个新的 grep 进程)。

This looks like more of a limitation in the size of the arguments programs can receive, because the * in the for loop works alright. But, in any case, what's the proper way to handle this?

这看起来更像是程序可以接收的参数大小的限制,因为 for 循环中的 * 工作正常。但是,无论如何,处理这个问题的正确方法是什么?

PS: Don't tell me to do grep -r instead, I know about that, I'm thinking about tools that do not have a recursive option.

PS:不要告诉我改用 grep -r ,我知道这一点,我正在考虑没有递归选项的工具。

回答by Charles Duffy

In newer versions of findutils, find can do the work of xargs (including the glomming behavior, such that only as many grep processes as needed are used):

在较新版本的 findutils 中, find 可以完成 xargs 的工作(包括 glomming 行为,因此只使用尽可能多的 grep 进程):

find ../path -exec grep foo '{}' +

The use of +rather than ;as the last argument triggers this behavior.

使用+而不是;作为最后一个参数触发此行为。

回答by JesperE

If there is a risk of filenames containing spaces, you should remember to use the -print0 flag to find together with the -0 flag to xargs:

如果存在文件名包含空格的风险,您应该记住使用 -print0 标志与 -0 标志一起查找到 xargs:

find . -print0 | xargs -0 grep -H foo

回答by camh

xargs does not start a new process for each file. It bunches together the arguments. Have a look at the -n option to xargs - it controls the number of arguments passed to each execution of the sub-command.

xargs 不会为每个文件启动一个新进程。它将论点聚集在一起。查看 xargs 的 -n 选项 - 它控制传递给每次执行子命令的参数数量。

回答by paxdiablo

I can't see that

我看不到

for i in *; do
    grep foo $i
done

would work since I thought the "too many files" was a shell limitation, hence it would fail for the for loop as well.

会起作用,因为我认为“文件太多”是 shell 限制,因此 for 循环也会失败。

Having said that, I always let xargs do the grunt-work of splitting the argument list into manageable bits thus:

话虽如此,我总是让 xargs 完成将参数列表拆分为可管理位的繁重工作,因此:

find ../path/ | xargs grep foo

It won't start a process per file but per group of files.

它不会为每个文件启动一个进程,而是每个文件组。

回答by paxdiablo

Well, I had the same problems, but it seems that everything I came up with is already mentioned. Mostly, had two problems. Doing globs is expensive, doing ls on a million files directory takes forever (20+ minutes on one of my servers) and doing ls * on a million files directory takes forever and fails with "argument list too long" error.

好吧,我遇到了同样的问题,但似乎我想出的一切都已经提到了。大多数情况下,有两个问题。执行 globs 是昂贵的,对一百万个文件目录执行 ls 需要永远(在我的一个服务器上超过 20 分钟)并且对一百万个文件目录执行 ls * 需要永远并且失败并出现“参数列表太长”错误。

find /some -type f -exec some command {} \; 

seems to help with both problems. Also, if you need to do more complex operations on these files, you might consider to script your stuff into multiple threads. Here is a python primer for scripting CLI stuff. http://www.ibm.com/developerworks/aix/library/au-pythocli/?ca=dgr-lnxw06pythonunixtool&S_TACT=105AGX59&S_CMP=GR

似乎有助于解决这两个问题。此外,如果您需要对这些文件进行更复杂的操作,您可能会考虑将您的内容脚本化到多个线程中。这是用于编写 CLI 内容的 Python 入门。 http://www.ibm.com/developerworks/aix/library/au-pythocli/?ca=dgr-lnxw06pythonunixtool&S_TACT=105AGX59&S_CMP=GR