Linux 针对大文件 grep 大列表
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/19380925/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
grep a large list against a large file
提问by leifg
I am currently trying to grepa large list of ids (~5000) against an even larger csv file (3.000.000 lines).
我目前正在尝试grep针对更大的 csv 文件(3.000.000 行)创建一个大的 id 列表(~5000)。
I want all the csv lines, that contain an id from the id file.
我想要所有包含 id 文件中的 id 的 csv 行。
My naive approach was:
我天真的方法是:
cat the_ids.txt | while read line
do
  cat huge.csv | grep $line >> output_file
done
But this takes forever!
但这需要永远!
Are there more efficient approaches to this problem?
有没有更有效的方法来解决这个问题?
采纳答案by devnull
Try
尝试
grep -f the_ids.txt huge.csv
Additionally, since your patterns seem to be fixed strings, supplying the -Foption might speed up grep.
此外,由于您的模式似乎是固定字符串,因此提供该-F选项可能会加快grep.
   -F, --fixed-strings
          Interpret PATTERN as a  list  of  fixed  strings,  separated  by
          newlines,  any  of  which is to be matched.  (-F is specified by
          POSIX.)
回答by fedorqui 'SO stop harming'
Use grep -ffor this:
使用grep -f此:
grep -f the_ids.txt huge.csv > output_file
From man grep:
来自man grep:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing. (-f is specified by POSIX.)
-f 文件,--file=文件
从 FILE 获取模式,每行一个。空文件包含零个模式,因此不匹配任何内容。(-f 由 POSIX 指定。)
If you provide some sample input maybe we can even improve the grepcondition a little more.
如果您提供一些样本输入,也许我们甚至可以进一步改善grep条件。
Test
测试
$ cat ids
11
23
55
$ cat huge.csv 
hello this is 11 but
nothing else here
and here 23
bye
$ grep -f ids huge.csv 
hello this is 11 but
and here 23
回答by codeforester
grep -f filter.txt data.txtgets unruly when filter.txtis larger than a couple of thousands of lines and hence isn't the best choice for such a situation. Even while using grep -f, we need to keep a few things in mind:
grep -f filter.txt data.txt当filter.txt大于几千行时变得不守规矩,因此不是这种情况的最佳选择。即使在使用 时grep -f,我们也需要牢记以下几点:
- use -xoption if there is a need to match the entire line in the second file
- use -Fif the first file has strings, not patterns
- use -wto prevent partial matches while not using the-xoption
- -x如果需要匹配第二个文件中的整行,请使用选项
- 使用-F如果第一个文件字符串,而不是模式
- 用于-w在不使用-x选项时防止部分匹配
This post has a great discussion on this topic (grep -fon large files):
这篇文章对此主题进行了很好的讨论(grep -f关于大文件):
And this post talks about grep -vf:
这篇文章谈到grep -vf:
In summary, the best way to handle grep -fon large files is:
总之,处理grep -f大文件的最佳方法是:
Matching entire line:
匹配整行:
awk 'FNR==NR {hash[awk -F, 'FNR==NR {hash[]; next}  in hash' filter.txt data.txt > matching.txt
]; next} awk 'FNR==NR {hash[awk -F, 'FNR==NR {hash[ugrep -F -f the_ids.txt huge.csv
]; next} !( in hash)' filter.txt data.txt > not_matching.txt
]; next} !(##代码## in hash)' filter.txt data.txt > not_matching.txt
 in hash' filter.txt data.txt > matching.txt
Matching a particular field in the second file (using ',' delimiter and field 2 in this example):
匹配第二个文件中的特定字段(在本例中使用 ',' 分隔符和字段 2):
##代码##and for grep -vf:
对于grep -vf:
Matching entire line:
匹配整行:
##代码##Matching a particular field in the second file (using ',' delimiter and field 2 in this example):
匹配第二个文件中的特定字段(在本例中使用 ',' 分隔符和字段 2):
##代码##
