Linux 针对大文件 grep 大列表
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19380925/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
grep a large list against a large file
提问by leifg
I am currently trying to grep
a large list of ids (~5000) against an even larger csv file (3.000.000 lines).
我目前正在尝试grep
针对更大的 csv 文件(3.000.000 行)创建一个大的 id 列表(~5000)。
I want all the csv lines, that contain an id from the id file.
我想要所有包含 id 文件中的 id 的 csv 行。
My naive approach was:
我天真的方法是:
cat the_ids.txt | while read line
do
cat huge.csv | grep $line >> output_file
done
But this takes forever!
但这需要永远!
Are there more efficient approaches to this problem?
有没有更有效的方法来解决这个问题?
采纳答案by devnull
Try
尝试
grep -f the_ids.txt huge.csv
Additionally, since your patterns seem to be fixed strings, supplying the -F
option might speed up grep
.
此外,由于您的模式似乎是固定字符串,因此提供该-F
选项可能会加快grep
.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
回答by fedorqui 'SO stop harming'
Use grep -f
for this:
使用grep -f
此:
grep -f the_ids.txt huge.csv > output_file
From man grep
:
来自man grep
:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing. (-f is specified by POSIX.)
-f 文件,--file=文件
从 FILE 获取模式,每行一个。空文件包含零个模式,因此不匹配任何内容。(-f 由 POSIX 指定。)
If you provide some sample input maybe we can even improve the grep
condition a little more.
如果您提供一些样本输入,也许我们甚至可以进一步改善grep
条件。
Test
测试
$ cat ids
11
23
55
$ cat huge.csv
hello this is 11 but
nothing else here
and here 23
bye
$ grep -f ids huge.csv
hello this is 11 but
and here 23
回答by codeforester
grep -f filter.txt data.txt
gets unruly when filter.txt
is larger than a couple of thousands of lines and hence isn't the best choice for such a situation. Even while using grep -f
, we need to keep a few things in mind:
grep -f filter.txt data.txt
当filter.txt
大于几千行时变得不守规矩,因此不是这种情况的最佳选择。即使在使用 时grep -f
,我们也需要牢记以下几点:
- use
-x
option if there is a need to match the entire line in the second file - use
-F
if the first file has strings, not patterns - use
-w
to prevent partial matches while not using the-x
option
-x
如果需要匹配第二个文件中的整行,请使用选项- 使用
-F
如果第一个文件字符串,而不是模式 - 用于
-w
在不使用-x
选项时防止部分匹配
This post has a great discussion on this topic (grep -f
on large files):
这篇文章对此主题进行了很好的讨论(grep -f
关于大文件):
And this post talks about grep -vf
:
这篇文章谈到grep -vf
:
In summary, the best way to handle grep -f
on large files is:
总之,处理grep -f
大文件的最佳方法是:
Matching entire line:
匹配整行:
awk 'FNR==NR {hash[awk -F, 'FNR==NR {hash[]; next} in hash' filter.txt data.txt > matching.txt
]; next} awk 'FNR==NR {hash[awk -F, 'FNR==NR {hash[ugrep -F -f the_ids.txt huge.csv
]; next} !( in hash)' filter.txt data.txt > not_matching.txt
]; next} !(##代码## in hash)' filter.txt data.txt > not_matching.txt
in hash' filter.txt data.txt > matching.txt
Matching a particular field in the second file (using ',' delimiter and field 2 in this example):
匹配第二个文件中的特定字段(在本例中使用 ',' 分隔符和字段 2):
##代码##and for grep -vf
:
对于grep -vf
:
Matching entire line:
匹配整行:
##代码##Matching a particular field in the second file (using ',' delimiter and field 2 in this example):
匹配第二个文件中的特定字段(在本例中使用 ',' 分隔符和字段 2):
##代码##