bash 最快的 grep

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9066609/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 21:36:02  来源:igfitidea点击:

Fastest possible grep

bashunixgrep

提问by pistacchio

I'd like to know if there is any tip to make grepas fast as possible. I have a rather large base of text files to search in the quickest possible way. I've made them all lowercase, so that I could get rid of -ioption. This makes the search much faster.

我想知道是否有任何提示grep可以尽快完成。我有一个相当大的文本文件库,可以以最快的方式进行搜索。我已将它们全部设为小写,以便我可以摆脱-i选项。这使得搜索速度更快。

Also, I've found out that -Fand -Pmodes are quicker than the default one. I use the former when the search string is not a regular expression (just plain text), the latter if regex is involved.

此外,我发现-F-P模式比默认模式更快。当搜索字符串不是正则表达式(只是纯文本)时,我使用前者,如果涉及正则表达式,则使用后者。

Does anyone have any experience in speeding up grep? Maybe compile it from scratch with some particular flag (I'm on Linux CentOS), organize the files in a certain fashion or maybe make the search parallel in some way?

有没有人有加速的经验grep?也许用一些特定的标志从头开始编译它(我在 Linux CentOS 上),以某种方式组织文件,或者以某种方式使搜索并行?

回答by Chewie

Try with GNU parallel, which includes an example of how to use it with grep:

尝试使用GNU parallel,其中包含如何使用它的示例grep

grep -rgreps recursively through directories. On multicore CPUs GNU parallelcan often speed this up.

find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}

This will run 1.5 job per core, and give 1000 arguments to grep.

grep -rgrep 递归遍历目录。在多核 CPU 上,GNU parallel通常可以加快速度。

find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}

这将为每个核心运行 1.5 个作业,并为grep.

For big files, it can split it the input in several chunks with the --pipeand --blockarguments:

对于大文件,它可以使用--pipe--block参数将输入分成几个块:

 parallel --pipe --block 2M grep foo < bigfile

You could also run it on several different machines through SSH (ssh-agent needed to avoid passwords):

您还可以通过 SSH 在多台不同的机器上运行它(需要 ssh-agent 以避免密码):

parallel --pipe --sshlogin server.example.com,server2.example.net grep foo < bigfile

回答by daveb

If you're searching very large files, then setting your locale can really help.

如果您正在搜索非常大的文件,那么设置您的区域设置真的很有帮助。

GNU grep goes a lot faster in the C locale than with UTF-8.

GNU grep 在 C 语言环境中的运行速度比在 UTF-8 中快得多。

export LC_ALL=C

回答by rado

Ripgrep claims to now be the fastest.

Ripgrep 声称现在是最快的。

https://github.com/BurntSushi/ripgrep

https://github.com/BurntSushi/ripgrep

Also includes parallelism by default

默认情况下还包括并行性

 -j, --threads ARG
              The number of threads to use.  Defaults to the number of logical CPUs (capped at 6).  [default: 0]

From the README

来自自述文件

It is built on top of Rust's regex engine. Rust's regex engine uses finite automata, SIMD and aggressive literal optimizations to make searching very fast.

它建立在 Rust 的正则表达式引擎之上。Rust 的正则表达式引擎使用有限自动机、SIMD 和积极的文字优化来使搜索非常快速。

回答by Sandro Pasquali

Apparently using --mmap can help on some systems:

显然,在某些系统上使用 --mmap 可以提供帮助:

http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html

http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html

回答by the wanderer

Not strictly a code improvement but something I found helpful after running grep on 2+ million files.

严格来说不是代码改进,而是我在 2+ 百万个文件上运行 grep 后发现有用的东西。

I moved the operation onto a cheap SSD drive (120GB). At about $100, it's an affordable option if you are crunching lots of files regularly.

我将操作移到便宜的 SSD 驱动器 (120GB) 上。大约 100 美元,如果您经常处理大量文件,这是一个经济实惠的选择。

回答by Alex V

If you don't care about which files contains the string, you might want to separate readingand greppinginto two jobs, since it might be costly to spawn grepmany times – once for each small file.

如果您不关心哪些文件包含字符串,您可能希望将读取grepping分成两个作业,因为生成grep多次可能代价高昂——每个小文件一次。

  1. If you've one very large file:

    parallel -j100% --pipepart --block 100M -a <very large SEEKABLE file> grep <...>

  2. Many small compressed files (sorted by inode)

    ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j80% --group "gzcat {}" | parallel -j50% --pipe --round-robin -u -N1000 grep <..>

  1. 如果您有一个非常大的文件:

    parallel -j100% --pipepart --block 100M -a <very large SEEKABLE file> grep <...>

  2. 许多小的压缩文件(按inode排序)

    ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j80% --group "gzcat {}" | parallel -j50% --pipe --round-robin -u -N1000 grep <..>

I usually compress my files with lz4 for maximum throughput.

我通常使用 lz4 压缩我的文件以获得最大吞吐量。

  1. If you want just the filename with the match:

    ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j100% --group "gzcat {} | grep -lq <..> && echo {}

  1. 如果你只想要匹配的文件名:

    ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j100% --group "gzcat {} | grep -lq <..> && echo {}

回答by Jinxmcg

I personally use the ag (silver searcher) instead of grep and it's way faster, also you can combine it with parallel and pipe block.

我个人使用 ag(银搜索器)而不是 grep,它的速度更快,您也可以将它与并行和管道块结合使用。

https://github.com/ggreer/the_silver_searcher

https://github.com/ggreer/the_silver_searcher

Update: I now use https://github.com/BurntSushi/ripgrepwhich is faster than ag depending on your use case.

更新:我现在使用https://github.com/BurntSushi/ripgrep,这比 ag 快,具体取决于您的用例。

回答by Chris

Building on the response by Sandro I looked at the reference he provided hereand played around with BSD grep vs. GNU grep. My quick benchmark results showed: GNU grep is way, way faster.

基于 Sandro 的响应,我查看了他在此处提供的参考并尝试了 BSD grep 与 GNU grep。我的快速基准测试结果显示:GNU grep 速度更快。

So my recommendation to the original question "fastest possible grep": Make sure you are using GNU grep rather than BSD grep (which is the default on MacOS for example).

所以我对原始问题“最快的 grep”的建议:确保您使用的是 GNU grep 而不是 BSD grep(例如,这是 MacOS 上的默认设置)。

回答by user6504312

One thing I've found faster for using grep to search (especially for changing patterns) in a single big file is to use split + grep + xargs with it's parallel flag. For instance:

我发现使用 grep 在单个大文件中搜索(尤其是更改模式)更快的一件事是使用 split + grep + xargs 及其并行标志。例如:

Having a file of ids you want to search for in a big file called my_ids.txt Name of bigfile bigfile.txt

有一个要在名为 my_ids.txt 的大文件中搜索的 id 文件 bigfile bigfile.txt 的名称

Use split to split the file into parts:

使用 split 将文件分成几部分:

# Use split to split the file into x number of files, consider your big file
# size and try to stay under 26 split files to keep the filenames 
# easy from split (xa[a-z]), in my example I have 10 million rows in bigfile
split -l 1000000 bigfile.txt
# Produces output files named xa[a-t]

# Now use split files + xargs to iterate and launch parallel greps with output
for id in $(cat my_ids.txt) ; do ls xa* | xargs -n 1 -P 20 grep $id >> matches.txt ; done
# Here you can tune your parallel greps with -P, in my case I am being greedy
# Also be aware that there's no point in allocating more greps than x files

In my case this cut what would have been a 17 hour job into a 1 hour 20 minute job. I'm sure there's some sort of bell curve here on efficiency and obviously going over the available cores won't do you any good but this was a much better solution than any of the above comments for my requirements as stated above. This has an added benefit over the script parallel in using mostly (linux) native tools.

就我而言,这将原本需要 17 小时的工作缩减为 1 小时 20 分钟的工作。我确信这里在效率方面存在某种钟形曲线,显然检查可用内核对您没有任何好处,但这是比上述针对我的要求的上述任何评论更好的解决方案。在使用大多数(linux)本机工具时,这比并行脚本有一个额外的好处。

回答by ccpizza

A slight deviation from the original topic: the indexed search command line utilities from the googlecodesearch project are way faster than grep: https://github.com/google/codesearch:

与原始主题略有不同:来自 googlecodesearch 项目的索引搜索命令行实用程序比 grep 快得多:https: //github.com/google/codesearch

Once you compile it (the golangpackage is needed), you can index a folder with:

一旦你编译它(需要golang包),你可以索引一个文件夹:

# index current folder
cindex .

The index will be created under ~/.csearchindex

索引将创建在 ~/.csearchindex

Now you can search:

现在您可以搜索:

# search folders previously indexed with cindex
csearch eggs

I'm still piping the results through grep to get colorized matches.

我仍在通过 grep 将结果管道化以获得彩色匹配。