Linux 针对大文件 grep 大列表

Question

提问by leifg

I am currently trying to grepa large list of ids (~5000) against an even larger csv file (3.000.000 lines).

我目前正在尝试grep针对更大的 csv 文件（3.000.000 行）创建一个大的 id 列表（~5000）。

I want all the csv lines, that contain an id from the id file.

我想要所有包含 id 文件中的 id 的 csv 行。

My naive approach was:

我天真的方法是：

cat the_ids.txt | while read line
do
  cat huge.csv | grep $line >> output_file
done

But this takes forever!

但这需要永远！

Are there more efficient approaches to this problem?

有没有更有效的方法来解决这个问题？

Answer 1

采纳答案by devnull

Try

尝试

grep -f the_ids.txt huge.csv

Additionally, since your patterns seem to be fixed strings, supplying the -Foption might speed up grep.

此外，由于您的模式似乎是固定字符串，因此提供该-F选项可能会加快grep.

   -F, --fixed-strings
          Interpret PATTERN as a  list  of  fixed  strings,  separated  by
          newlines,  any  of  which is to be matched.  (-F is specified by
          POSIX.)

Answer 2

回答by fedorqui 'SO stop harming'

Use grep -ffor this:

使用grep -f此：

grep -f the_ids.txt huge.csv > output_file

From man grep:

来自man grep：

-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing. (-f is specified by POSIX.)

-f 文件，--file=文件
从 FILE 获取模式，每行一个。空文件包含零个模式，因此不匹配任何内容。（-f 由 POSIX 指定。）

If you provide some sample input maybe we can even improve the grepcondition a little more.

如果您提供一些样本输入，也许我们甚至可以进一步改善grep条件。

Test

测试

$ cat ids
11
23
55
$ cat huge.csv 
hello this is 11 but
nothing else here
and here 23
bye

$ grep -f ids huge.csv 
hello this is 11 but
and here 23

Answer 3

回答by codeforester

grep -f filter.txt data.txtgets unruly when filter.txtis larger than a couple of thousands of lines and hence isn't the best choice for such a situation. Even while using grep -f, we need to keep a few things in mind:

grep -f filter.txt data.txt当filter.txt大于几千行时变得不守规矩，因此不是这种情况的最佳选择。即使在使用时grep -f，我们也需要牢记以下几点：

use -xoption if there is a need to match the entire line in the second file
use -Fif the first file has strings, not patterns
use -wto prevent partial matches while not using the -xoption

-x如果需要匹配第二个文件中的整行，请使用选项
使用-F如果第一个文件字符串，而不是模式
用于-w在不使用-x选项时防止部分匹配

This post has a great discussion on this topic (grep -fon large files):

这篇文章对此主题进行了很好的讨论（grep -f关于大文件）：

Fastest way to find lines of a file from another larger file in Bash

在 Bash 中从另一个较大文件中查找文件行的最快方法

And this post talks about grep -vf:

这篇文章谈到grep -vf：

grep -vf too slow with large files

grep -vf 对于大文件太慢

In summary, the best way to handle grep -fon large files is:

总之，处理grep -f大文件的最佳方法是：

Matching entire line:

匹配整行：

awk 'FNR==NR {hash[awk -F, 'FNR==NR {hash[]; next}  in hash' filter.txt data.txt > matching.txt
]; next} awk 'FNR==NR {hash[awk -F, 'FNR==NR {hash[ugrep -F -f the_ids.txt huge.csv
]; next} !( in hash)' filter.txt data.txt > not_matching.txt
]; next} !(##代码## in hash)' filter.txt data.txt > not_matching.txt
 in hash' filter.txt data.txt > matching.txt

Matching a particular field in the second file (using ',' delimiter and field 2 in this example):

匹配第二个文件中的特定字段（在本例中使用 ',' 分隔符和字段 2）：

##代码##

and for grep -vf:

对于grep -vf：

Matching entire line:

匹配整行：

##代码##

Matching a particular field in the second file (using ',' delimiter and field 2 in this example):

匹配第二个文件中的特定字段（在本例中使用 ',' 分隔符和字段 2）：

##代码##

Answer 4

回答by Dr. Alex RE

You may get a significant search speedup with ugrepto match the strings in the_ids.txtin your large huge.csvfile:

使用ugrep匹配the_ids.txt大huge.csv文件中的字符串可能会显着提高搜索速度：

##代码##

This works with GNU grep too, but I expect ugrep to run several times faster.

这也适用于 GNU grep，但我希望 ugrep 运行速度快几倍。

Linux 针对大文件 grep 大列表

提问by leifg

采纳答案by devnull

回答by fedorqui 'SO stop harming'

Test

测试

回答by codeforester

回答by Dr. Alex RE

相关推荐

最近更新

标签

Linux 针对大文件 grep 大列表

提问by leifg

采纳答案by devnull

回答by fedorqui 'SO stop harming'

Test

测试

回答by codeforester

回答by Dr. Alex RE

相关推荐

为什么要在 linux 中关闭管道？

Linux Shell 脚本使用变量更改目录

C# 中的枚举应该有自己的文件吗？

如何在 Linux Ubuntu 中创建 .sh 扩展文件？

相关推荐

最近更新

标签