在 Bash 中工作时如何处理“文件太多”的问题？

Question

提问by Vinko Vrsalovic

I many times have to work with directories containing hundreds of thousands of files, doing text matching, replacing and so on. If I go the standard route of, say

我很多时候不得不处理包含数十万个文件的目录，进行文本匹配、替换等。如果我走标准路线，说

grep foo *

I get the too many files error message, so I end up doing

我收到了太多文件错误消息，所以我最终做了

for i in *; do grep foo $i; done

or

或者

find ../path/ | xargs -I{} grep foo "{}"

But these are less than optimal (create a new grep process per each file).

但是这些都不是最佳的（为每个文件创建一个新的 grep 进程）。

This looks like more of a limitation in the size of the arguments programs can receive, because the * in the for loop works alright. But, in any case, what's the proper way to handle this?

这看起来更像是程序可以接收的参数大小的限制，因为 for 循环中的 * 工作正常。但是，无论如何，处理这个问题的正确方法是什么？

PS: Don't tell me to do grep -r instead, I know about that, I'm thinking about tools that do not have a recursive option.

PS：不要告诉我改用 grep -r ，我知道这一点，我正在考虑没有递归选项的工具。

Answer 1

回答by Charles Duffy

In newer versions of findutils, find can do the work of xargs (including the glomming behavior, such that only as many grep processes as needed are used):

在较新版本的 findutils 中， find 可以完成 xargs 的工作（包括 glomming 行为，因此只使用尽可能多的 grep 进程）：

find ../path -exec grep foo '{}' +

The use of +rather than ;as the last argument triggers this behavior.

使用+而不是;作为最后一个参数触发此行为。

Answer 2

回答by JesperE

If there is a risk of filenames containing spaces, you should remember to use the -print0 flag to find together with the -0 flag to xargs:

如果存在文件名包含空格的风险，您应该记住使用 -print0 标志与 -0 标志一起查找到 xargs：

find . -print0 | xargs -0 grep -H foo

Answer 3

回答by camh

xargs does not start a new process for each file. It bunches together the arguments. Have a look at the -n option to xargs - it controls the number of arguments passed to each execution of the sub-command.

xargs 不会为每个文件启动一个新进程。它将论点聚集在一起。查看 xargs 的 -n 选项 - 它控制传递给每次执行子命令的参数数量。

Answer 4

回答by paxdiablo

I can't see that

我看不到

for i in *; do
    grep foo $i
done

would work since I thought the "too many files" was a shell limitation, hence it would fail for the for loop as well.

会起作用，因为我认为“文件太多”是 shell 限制，因此 for 循环也会失败。

Having said that, I always let xargs do the grunt-work of splitting the argument list into manageable bits thus:

话虽如此，我总是让 xargs 完成将参数列表拆分为可管理位的繁重工作，因此：

find ../path/ | xargs grep foo

It won't start a process per file but per group of files.

它不会为每个文件启动一个进程，而是每个文件组。

Answer 5

回答by paxdiablo

Well, I had the same problems, but it seems that everything I came up with is already mentioned. Mostly, had two problems. Doing globs is expensive, doing ls on a million files directory takes forever (20+ minutes on one of my servers) and doing ls * on a million files directory takes forever and fails with "argument list too long" error.

好吧，我遇到了同样的问题，但似乎我想出的一切都已经提到了。大多数情况下，有两个问题。执行 globs 是昂贵的，对一百万个文件目录执行 ls 需要永远（在我的一个服务器上超过 20 分钟）并且对一百万个文件目录执行 ls * 需要永远并且失败并出现“参数列表太长”错误。

find /some -type f -exec some command {} \;

seems to help with both problems. Also, if you need to do more complex operations on these files, you might consider to script your stuff into multiple threads. Here is a python primer for scripting CLI stuff. http://www.ibm.com/developerworks/aix/library/au-pythocli/?ca=dgr-lnxw06pythonunixtool&S_TACT=105AGX59&S_CMP=GR

似乎有助于解决这两个问题。此外，如果您需要对这些文件进行更复杂的操作，您可能会考虑将您的内容脚本化到多个线程中。这是用于编写 CLI 内容的 Python 入门。 http://www.ibm.com/developerworks/aix/library/au-pythocli/?ca=dgr-lnxw06pythonunixtool&S_TACT=105AGX59&S_CMP=GR

在 Bash 中工作时如何处理“文件太多”的问题？

提问by Vinko Vrsalovic

回答by Charles Duffy

回答by JesperE

回答by camh

回答by paxdiablo

回答by paxdiablo

相关推荐

最近更新

标签

在 Bash 中工作时如何处理“文件太多”的问题？

提问by Vinko Vrsalovic

回答by Charles Duffy

回答by JesperE

回答by camh

回答by paxdiablo

回答by paxdiablo

相关推荐

vb.net VB：加载类型库/DLL 时出错。（来自 HRESULT 的异常：0x80029C4A（TYPE_E_CANTLOADLIBRARY）

vb.net 如何在visual basic中找到5中最大的数字？

vb.net 在文件上传中获取真实路径而不是“假路径”

vb.net VB.NET中的沙漏光标

相关推荐

最近更新

标签