Linux 针对大文件 grep 大列表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19380925/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-07 01:05:48  来源:igfitidea点击:

grep a large list against a large file

linuxshellunixawkgrep

提问by leifg

I am currently trying to grepa large list of ids (~5000) against an even larger csv file (3.000.000 lines).

我目前正在尝试grep针对更大的 csv 文件(3.000.000 行)创建一个大的 id 列表(~5000)。

I want all the csv lines, that contain an id from the id file.

我想要所有包含 id 文件中的 id 的 csv 行。

My naive approach was:

我天真的方法是:

cat the_ids.txt | while read line
do
  cat huge.csv | grep $line >> output_file
done

But this takes forever!

但这需要永远!

Are there more efficient approaches to this problem?

有没有更有效的方法来解决这个问题?

采纳答案by devnull

Try

尝试

grep -f the_ids.txt huge.csv

Additionally, since your patterns seem to be fixed strings, supplying the -Foption might speed up grep.

此外,由于您的模式似乎是固定字符串,因此提供该-F选项可能会加快grep.

   -F, --fixed-strings
          Interpret PATTERN as a  list  of  fixed  strings,  separated  by
          newlines,  any  of  which is to be matched.  (-F is specified by
          POSIX.)

回答by fedorqui 'SO stop harming'

Use grep -ffor this:

使用grep -f此:

grep -f the_ids.txt huge.csv > output_file

From man grep:

来自man grep

-f FILE, --file=FILE

Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing. (-f is specified by POSIX.)

-f 文件,--file=文件

从 FILE 获取模式,每行一个。空文件包含零个模式,因此不匹配任何内容。(-f 由 POSIX 指定。)

If you provide some sample input maybe we can even improve the grepcondition a little more.

如果您提供一些样本输入,也许我们甚至可以进一步改善grep条件。

Test

测试

$ cat ids
11
23
55
$ cat huge.csv 
hello this is 11 but
nothing else here
and here 23
bye

$ grep -f ids huge.csv 
hello this is 11 but
and here 23

回答by codeforester

grep -f filter.txt data.txtgets unruly when filter.txtis larger than a couple of thousands of lines and hence isn't the best choice for such a situation. Even while using grep -f, we need to keep a few things in mind:

grep -f filter.txt data.txtfilter.txt大于几千行时变得不守规矩,因此不是这种情况的最佳选择。即使在使用 时grep -f,我们也需要牢记以下几点:

  • use -xoption if there is a need to match the entire line in the second file
  • use -Fif the first file has strings, not patterns
  • use -wto prevent partial matches while not using the -xoption
  • -x如果需要匹配第二个文件中的整行,请使用选项
  • 使用-F如果第一个文件字符串,而不是模式
  • 用于-w在不使用-x选项时防止部分匹配

This post has a great discussion on this topic (grep -fon large files):

这篇文章对此主题进行了很好的讨论(grep -f关于大文件):

And this post talks about grep -vf:

这篇文章谈到grep -vf



In summary, the best way to handle grep -fon large files is:

总之,处理grep -f大文件的最佳方法是:

Matching entire line:

匹配整行:

awk 'FNR==NR {hash[
awk -F, 'FNR==NR {hash[]; next}  in hash' filter.txt data.txt > matching.txt
]; next}
awk 'FNR==NR {hash[
awk -F, 'FNR==NR {hash[
ugrep -F -f the_ids.txt huge.csv
]; next} !( in hash)' filter.txt data.txt > not_matching.txt
]; next} !(##代码## in hash)' filter.txt data.txt > not_matching.txt
in hash' filter.txt data.txt > matching.txt

Matching a particular field in the second file (using ',' delimiter and field 2 in this example):

匹配第二个文件中的特定字段(在本例中使用 ',' 分隔符和字段 2):

##代码##

and for grep -vf:

对于grep -vf

Matching entire line:

匹配整行:

##代码##

Matching a particular field in the second file (using ',' delimiter and field 2 in this example):

匹配第二个文件中的特定字段(在本例中使用 ',' 分隔符和字段 2):

##代码##

回答by Dr. Alex RE

You may get a significant search speedup with ugrepto match the strings in the_ids.txtin your large huge.csvfile:

使用ugrep匹配the_ids.txthuge.csv文件中的字符串可能会显着提高搜索速度:

##代码##

This works with GNU grep too, but I expect ugrep to run several times faster.

这也适用于 GNU grep,但我希望 ugrep 运行速度快几倍。