bash Grepping一个巨大的文件(80GB)有什么办法可以加快速度?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13913014/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 23:07:17  来源:igfitidea点击:

Grepping a huge file (80GB) any way to speed it up?

bashgrep

提问by zzapper

 grep -i -A 5 -B 5 'db_pd.Clients'  eightygigsfile.sql

This has been running for an hour on a fairly powerful linux server which is otherwise not overloaded. Any alternative to grep? Anything about my syntax that can be improved, (egrep,fgrep better?)

这已经在相当强大的 linux 服务器上运行了一个小时,否则不会超载。有什么替代 grep 的吗?关于我的语法的任何可以改进的地方,(egrep、fgrep 更好?)

The file is actually in a directory which is shared with a mount to another server but the actual diskspace is local so that shouldn't make any difference?

该文件实际上位于一个目录中,该目录与另一台服务器的挂载共享,但实际磁盘空间是本地的,所以应该没有任何区别?

the grep is grabbing up to 93% CPU

grep 占用了高达 93% 的 CPU

回答by dogbane

Here are a few options:

这里有几个选项:

1) Prefix your grep command with LC_ALL=Cto use the C locale instead of UTF-8.

1) 为您的 grep 命令添加前缀LC_ALL=C以使用 C 语言环境而不是 UTF-8。

2) Use fgrepbecause you're searching for a fixed string, not a regular expression.

2)使用fgrep是因为您正在搜索固定字符串,而不是正则表达式。

3) Remove the -ioption, if you don't need it.

3)-i如果不需要,请删除该选项。

So your command becomes:

所以你的命令变成:

LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql

It will also be faster if you copy your file to RAM disk.

如果您将文件复制到 RAM 磁盘,速度也会更快。

回答by Steve

If you have a multicore CPU, I would really recommend GNU parallel. To grep a big file in parallel use:

如果你有一个多核 CPU,我真的会推荐GNU parallel。要并行使用 grep 大文件:

< eightygigsfile.sql parallel --pipe grep -i -C 5 'db_pd.Clients'

Depending on your disks and CPUs it may be faster to read larger blocks:

根据您的磁盘和 CPU,读取更大的块可能会更快:

< eightygigsfile.sql parallel --pipe --block 10M grep -i -C 5 'db_pd.Clients'

It's not entirely clear from you question, but other options for grepinclude:

您的问题并不完全清楚,但其他选项grep包括:

  • Dropping the -iflag.
  • Using the -Fflag for a fixed string
  • Disabling NLS with LANG=C
  • Setting a max number of matches with the -mflag.
  • -i旗。
  • -F标志用于固定字符串
  • 禁用 NLS LANG=C
  • 设置与-m标志的最大匹配数。

回答by BeniBela

Some trivial improvement:

一些微不足道的改进:

  • Remove the -i option, if you can, case insensitive is quite slow.

  • Replace the .by \.

    A single point is the regex symbol to match any character, which is also slow

  • 删除 -i 选项,如果可以,不区分大小写会很慢。

  • 更换.\.

    单点是匹配任何字符的正则表达式符号,这也很慢

回答by Eugen Rieck

Two lines of attack:

两条攻击线:

  • are you sure, you need the -i, or do you habe a possibility to get rid of it?
  • Do you have more cores to play with? grepis single-threaded, so you might want to start more of them at different offsets.
  • 你确定,你需要-i,或者你有可能摆脱它吗?
  • 你有更多的核心可以玩吗?grep是单线程的,因此您可能希望以不同的偏移量启动更多的线程。

回答by user584583

< eightygigsfile.sql parallel -k -j120% -n10 -m grep -F -i -C 5 'db_pd.Clients'  

If you need to search for multiple strings, grep -f strings.txt saves a ton of time. The above is a translation of something that I am currently testing. the -j and -n option value seemed to work best for my use case. The -F grep also made a big difference.

如果您需要搜索多个字符串,grep -f strings.txt 可以节省大量时间。以上是我目前正在测试的内容的翻译。-j 和 -n 选项值似乎最适合我的用例。-F grep 也有很大的不同。