在 Bash 中工作时如何处理“文件太多”的问题?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/186099/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do you handle the "Too many files" problem when working in Bash?
提问by Vinko Vrsalovic
I many times have to work with directories containing hundreds of thousands of files, doing text matching, replacing and so on. If I go the standard route of, say
我很多时候不得不处理包含数十万个文件的目录,进行文本匹配、替换等。如果我走标准路线,说
grep foo *
I get the too many files error message, so I end up doing
我收到了太多文件错误消息,所以我最终做了
for i in *; do grep foo $i; done
or
或者
find ../path/ | xargs -I{} grep foo "{}"
But these are less than optimal (create a new grep process per each file).
但是这些都不是最佳的(为每个文件创建一个新的 grep 进程)。
This looks like more of a limitation in the size of the arguments programs can receive, because the * in the for loop works alright. But, in any case, what's the proper way to handle this?
这看起来更像是程序可以接收的参数大小的限制,因为 for 循环中的 * 工作正常。但是,无论如何,处理这个问题的正确方法是什么?
PS: Don't tell me to do grep -r instead, I know about that, I'm thinking about tools that do not have a recursive option.
PS:不要告诉我改用 grep -r ,我知道这一点,我正在考虑没有递归选项的工具。
回答by Charles Duffy
In newer versions of findutils, find can do the work of xargs (including the glomming behavior, such that only as many grep processes as needed are used):
在较新版本的 findutils 中, find 可以完成 xargs 的工作(包括 glomming 行为,因此只使用尽可能多的 grep 进程):
find ../path -exec grep foo '{}' +
The use of +rather than ;as the last argument triggers this behavior.
使用+而不是;作为最后一个参数触发此行为。
回答by JesperE
If there is a risk of filenames containing spaces, you should remember to use the -print0 flag to find together with the -0 flag to xargs:
如果存在文件名包含空格的风险,您应该记住使用 -print0 标志与 -0 标志一起查找到 xargs:
find . -print0 | xargs -0 grep -H foo
回答by camh
xargs does not start a new process for each file. It bunches together the arguments. Have a look at the -n option to xargs - it controls the number of arguments passed to each execution of the sub-command.
xargs 不会为每个文件启动一个新进程。它将论点聚集在一起。查看 xargs 的 -n 选项 - 它控制传递给每次执行子命令的参数数量。
回答by paxdiablo
I can't see that
我看不到
for i in *; do
grep foo $i
done
would work since I thought the "too many files" was a shell limitation, hence it would fail for the for loop as well.
会起作用,因为我认为“文件太多”是 shell 限制,因此 for 循环也会失败。
Having said that, I always let xargs do the grunt-work of splitting the argument list into manageable bits thus:
话虽如此,我总是让 xargs 完成将参数列表拆分为可管理位的繁重工作,因此:
find ../path/ | xargs grep foo
It won't start a process per file but per group of files.
它不会为每个文件启动一个进程,而是每个文件组。
回答by paxdiablo
Well, I had the same problems, but it seems that everything I came up with is already mentioned. Mostly, had two problems. Doing globs is expensive, doing ls on a million files directory takes forever (20+ minutes on one of my servers) and doing ls * on a million files directory takes forever and fails with "argument list too long" error.
好吧,我遇到了同样的问题,但似乎我想出的一切都已经提到了。大多数情况下,有两个问题。执行 globs 是昂贵的,对一百万个文件目录执行 ls 需要永远(在我的一个服务器上超过 20 分钟)并且对一百万个文件目录执行 ls * 需要永远并且失败并出现“参数列表太长”错误。
find /some -type f -exec some command {} \;
seems to help with both problems. Also, if you need to do more complex operations on these files, you might consider to script your stuff into multiple threads. Here is a python primer for scripting CLI stuff. http://www.ibm.com/developerworks/aix/library/au-pythocli/?ca=dgr-lnxw06pythonunixtool&S_TACT=105AGX59&S_CMP=GR
似乎有助于解决这两个问题。此外,如果您需要对这些文件进行更复杂的操作,您可能会考虑将您的内容脚本化到多个线程中。这是用于编写 CLI 内容的 Python 入门。 http://www.ibm.com/developerworks/aix/library/au-pythocli/?ca=dgr-lnxw06pythonunixtool&S_TACT=105AGX59&S_CMP=GR

