bash 如何grep大量文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23572878/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 10:25:00  来源:igfitidea点击:

how to grep large number of files?

bashgrep

提问by upendra

I am trying to grep40k files in the current directory and i am getting this error.

我正在尝试grep当前目录中的 40k 文件,但出现此错误。

for i in $(cat A01/genes.txt); do grep $i *.kaks; done > A01/A01.result.txt
-bash: /usr/bin/grep: Argument list too long

How do one normally grepthousands of files?

怎么办一个一般grep上千个文件?

Thanks Upendra

感谢乌彭德拉

回答by David W.

This makes David sad...

这让大卫很伤心……

Everyone so far is wrong (except for anubhava).

到目前为止,每个人都错了(除了anubhava)。

Shell scripting is not like any other programming language because much of the interpretation of lines comes from the power of the shell interpolating them before the command is actually executed.

Shell 脚本与任何其他编程语言不同,因为行的大部分解释来自于 shell 在实际执行命令之前插入它们的能力。

Let's take something simple:

让我们来做一些简单的事情:

$ set -x
$ ls
+ ls
bar.txt foo.txt fubar.log
$ echo The text files are *.txt
echo The text files are *.txt
> echo The text files are bar.txt foo.txt
The text files are bar.txt foo.txt
$ set +x
$

The set -xallows you to see how the shell actually interpolates the glob and then passes that back to the command as input. The >points to the line that is actually being executed by the command.

set -x允许您查看 shell 如何实际插入 glob,然后将其作为输入传递回命令。的>指向实际上正在由命令执行的行。

You can see that the echocommand isn't interpreting the *. Instead, the shell grabs the *and replaces it with the names of the matching files. Then and only then does the echocommand actually executes the command.

您可以看到该echo命令没有解释*. 相反,shell 会抓取*并将其替换为匹配文件的名称。然后并且只有这样echo命令才会真正执行命令。

When you have 40K plus files, and you do grep *, you're expanding that *to the names of those 40,000 plus files before grepeven has a chance to execute, and that's where the error message /usr/bin/grep: Argument list too longis coming from.

当你有 40K 多个文件,并且你这样做时grep *,你甚至在有机会执行*之前将其扩展到那些 40,000 多个文件的名称,这grep就是错误消息/usr/bin/grep: Argument list too long的地方来自(哪里。

Fortunately, Unix has a way around this dilemma:

幸运的是,Unix 有办法解决这个难题:

$ find . -name "*.kaks" -type f -maxdepth 1 | xargs grep -f A01/genes.txt

The find . -name "*.kaks" -type f -maxdepth 1will find all of your *.kaksfiles, and the -depth 1will only include files in the current directory. The -type fmakes sure you only pick up files and not a directory.

find . -name "*.kaks" -type f -maxdepth 1会发现所有的*.kaks文件,并且-depth 1将只包括在当前目录下的文件。在-type f确保你只拿起文件,而不是一个目录。

The findcommand pipes the names of the files into xargsand xargswill append the names of the file to the grep -f A01/genes.txtcommand. However, xargshas a trick up it sleeve. It knows how long the command line buffer is, and will execute the grepwhen the command line buffer is full, then pass in another series of file to the grep. This way, grepgets executed maybe three or ten times (depending upon the size of the command line buffer), and all of our files are used.

find命令通过管道将文件转换成的名称xargsxargs将附加文件的名称grep -f A01/genes.txt命令。然而,xargs它袖手旁观。它知道命令行缓冲区有多长,并会grep在命令行缓冲区满时执行,然后将另一系列文件传递给grep. 这样,grep可能会被执行三到十次(取决于命令行缓冲区的大小),并且我们所有的文件都被使用。

Unfortunately, xargsuses whitespace as a separator for the file names. If your files contain spaces or tabs, you'll have trouble with xargs. Fortunately, there's another fix:

不幸的是,xargs使用空格作为文件名的分隔符。如果您的文件包含空格或制表符,您将无法使用xargs. 幸运的是,还有另一个修复:

$ find . -name "*.kaks" -type f -maxdepth 1 -print0 | xargs -0 grep -f A01/genes.txt

The -print0will cause findto print out the names of the files not separated by newlines, but by the NUL character. The -0parameter for xargstells xargsthat the file separator isn't whitespace, but the NUL character. Thus, fixes the issue.

-print0将导致find打印出不以换行符分隔,而是以 NUL 字符分隔的文件名。在-0为参数xargs告诉xargs该文件分隔符不是空白,但NULL字符。因此,解决了这个问题。

You could also do this too:

你也可以这样做:

$ find . -name "*.kaks" -type f -maxdepth 1 -exec grep -f A01/genes.txt {} \;

This will execute the grepfor each and every file found instead of what xargsdoes and only runs grepfor all the files it can stuff on the command line. The advantage of this is that it avoids shell interference entirely. However, it may or may not be less efficient.

这将为grep找到的每个文件执行 ,而不是xargs执行并且只grep为它可以在命令行上填充的所有文件运行。这样做的好处是完全避免了壳干涉。但是,它可能会或可能不会降低效率。

What would be interesting is to experiment and see which one is more efficient. You can use timeto see:

有趣的是进行实验,看看哪个更有效。您可以使用time来查看:

$ time find . -name "*.kaks" -type f -maxdepth 1 -exec grep -f A01/genes.txt {} \;

This will execute the command and then tell you how long it took. Try it with the -execand with xargsand see which is faster. Let us know what you find.

这将执行命令,然后告诉您花费了多长时间。用-exec和试试看xargs哪个更快。让我们知道你发现了什么。

回答by anubhava

You can combine findwith greplike this:

您可以结合find使用grep这样的:

find . -maxdepth 1 -name '*.kaks' -exec grep -H -f A01/genes.txt '{}' \; > A01/A01.result.txt

回答by zmo

you can use recursive feature of grep:

您可以使用以下递归功能grep

for i in $(cat A01/genes.txt); do 
    grep -r $i .
done > A01/A01.result.txt

though if you want to select only kaksfiles:

但是,如果您只想选择kaks文件:

for i in $(cat A01/genes.txt); do 
    find . -iregex '.*\.kaks$' -exec grep $i \;
done > A01/A01.result.txt

回答by Mark Setchell

Put another for loop inside your outer one:

将另一个 for 循环放入您的外部循环中:

for f in *.kaks; do
   grep -H  $i "$f"
done

By the way, are you interested in finding EVERY occurrence in each file, or merely if the search string exists in there one or more times? If it is "good enough" to know the string occurs in there one or more times you can specify "-n 1" to grep and it will not bother reading/searching the rest of the file after finding the first match, which could potentially save lots of time.

顺便说一句,您是否有兴趣在每个文件中查找每个匹配项,或者仅搜索字符串是否存在于其中一次或多次?如果知道字符串在其中出现一次或多次“足够好”,您可以将“-n 1”指定为 grep 并且在找到第一个匹配项后它不会打扰读取/搜索文件的其余部分,这可能是潜在的节省大量时间。

回答by Scientist

The following solution has worked for me:

以下解决方案对我有用:

Problem:

问题:

 grep -r "example\.com" *
 -bash: /bin/grep: Argument list too long

Solution:

解决方案:

grep -r "example\.com" .

["In newer versions of grep you can omit the “.“, as the current directory is implied."]

[“在较新版本的 grep 中,您可以省略“.”,因为当前目录是隐含的。”]

Source: Reinlick, J. https://www.saotn.org/bash-grep-through-large-number-files-argument-list-too-long/

资料来源:Reinlick, J. https://www.saotn.org/bash-grep-through-large-number-files-argument-list-too-long/