bash grep 两次或使用一次正则表达式是否更有效?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6040429/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 00:01:27  来源:igfitidea点击:

Is it more efficient to grep twice or use a regular expression once?

bashunixgrep

提问by dtbarne

I'm trying to parse a couple of 2gb+ files and want to grep on a couple of levels.

我正在尝试解析几个 2gb+ 文件并希望在几个级别上进行 grep。

Say I want to fetch lines that contain "foo" and lines that also contain "bar".

假设我想获取包含“foo”的行和还包含“bar”的行。

I could do grep foo file.log | grep bar, but my concern is that it will be expensive running it twice.

我可以grep foo file.log | grep bar,但我担心运行两次会很昂贵。

Would it be beneficial to use something like grep -E '(foo.*bar|bar.*foo)'instead?

使用类似的东西会有好处grep -E '(foo.*bar|bar.*foo)'吗?

采纳答案by pepoluan

grep -E '(foo|bar)'will find lines containing 'foo' OR'bar'.

grep -E '(foo|bar)'将找到包含“foo”“bar”的行。

You want lines containing BOTH'foo' AND'bar'. Either of these commands will do:

你要包含线BOTH“富”“酒吧”。这些命令中的任何一个都可以:

sed '/foo/!d;/bar/!d' file.log

awk '/foo/ && /bar/' file.log

Both commands -- in theory -- shouldbe much more efficient than your cat | grep | grepconstruct because:

这两个命令——理论上——应该比你的cat | grep | grep构造更有效,因为:

  • Both sedand awkperform their own file reading; no need for pipe overhead
  • The 'programs' I gave to sedand awkabove use Boolean short-circuiting to quickly skip lines not containing 'foo', thus testing only lines containing 'foo' to the /bar/ regex
  • 双方sedawk执行自己的文件读取; 无需管道架空
  • 我给出的“程序”sedawk以上使用布尔短路快速跳过不包含“foo”的行,从而仅测试包含“foo”的行到 /bar/ 正则表达式

However, I haven't tested them. YMMV :)

但是,我还没有测试过它们。YMMV :)

回答by Gordon Davisson

In theory, the fastest way should be:

理论上,最快的方式应该是:

grep -E '(foo.*bar|bar.*foo)' file.log

For several reasons: First, grep reads directly from the file, rather than adding the step of having cat read it and stuff it down a pipe for grep to read. Second, it uses only a single instance of grep, so each line of the file only has to be processed once. Third, grep -Eis generally faster than plain grep on large files (but slower on small files), although this will depend on your implementation of grep. Finally, grep (in all its variants) is optimized for string searching, while sed and awk are general-purpose tools that happen to be able to search (but aren't optimized for it).

有几个原因:首先,grep 直接从文件中读取,而不是添加让 cat 读取它并将其塞入管道以供 grep 读取的步骤。其次,它只使用一个 grep 实例,因此文件的每一行只需要处理一次。第三,grep -E在大文件上通常比普通 grep 快(但在小文件上较慢),尽管这取决于您对 grep 的实现。最后,grep(在其所有变体中)针对字符串搜索进行了优化,而 sed 和 awk 是碰巧能够进行搜索的通用工具(但并未针对它进行优化)。

回答by Rafe Kettler

These two operations are fundamentally different. This one:

这两种操作有着根本的不同。这个:

cat file.log | grep foo | grep bar

looks for foo in file.log, then looks for bar in whatever the last grep output. Whereas cat file.log | grep -E '(foo|bar)'looks for either foo or bar in file.log. The output should be very different. Use whatever behavior you need.

在 file.log 中查找 foo,然后在最后的 grep 输出中查找 bar。而cat file.log | grep -E '(foo|bar)'在 file.log 中查找 foo 或 bar。输出应该非常不同。使用您需要的任何行为。

As for efficiency, they're not really comparable because they do different things. Both should be fast enough, though.

至于效率,它们实际上没有可比性,因为它们做不同的事情。不过,两者都应该足够快。

回答by David W.

If you're doing this:

如果你这样做:

cat file.log | grep foo | grep bar

You're only printing lines that contain both fooand barin any order. If this is your intention:

您只打印包含两者foobar任何顺序的行。如果这是您的意图:

grep -e "foo.*bar" -e "bar.*foo" file.log

Will be more efficient since I only have to parse the output once.

因为我只需要解析输出一次,所以效率会更高。

Notice I don't need the catwhich is more efficient in itself. You rarely ever need catunless you are concatinatingfiles (which is the purpose of the command). 99% of the time you can either add a file name to the end of the first command in a pipe, or if you have a command like trthat doesn't allow you to use a file, you can always redirect the input like this:

请注意,我不需要cat本身更有效的 。cat除非您连接文件(这是命令的目的),否则您很少需要。99% 的情况下,您可以将文件名添加到管道中的第一个命令的末尾,或者如果您有这样的命令tr不允许您使用文件,您始终可以像这样重定向输入:

tr `a-z` `A-Z` < $fileName

But, enough about useless cats. I have two at home.

但是,关于无用的cats已经足够了。我家里有两个。

You can pass multiple regular expressions to a single grepwhich is usually a bit more efficient than piping multiple greps. However, if you can eliminate regular expressions, you might find this the most efficient:

您可以将多个正则表达式传递给单个grep,这通常比管道多个greps. 但是,如果您可以消除正则表达式,您可能会发现这是最有效的:

fgrep "foo" file.log | fgrep "bar"

Unlike grep, fgrepdoesn't parse regular expressions which means it can parse lines much, much faster. Try this:

grep,fgrep不解析正则表达式,这意味着它可以更快地解析行。尝试这个:

time fgrep "foo" file.log | fgrep "bar"

and

time grep -e "foo.*bar" -e "bar.*foo" file.log

And see which is faster.

看看哪个更快。