在 Bash 中从另一个较大文件中查找文件行的最快方法

Question

提问by codeforester

I have two files, file1.txtand file2.txt. file1.txthas about 14K lines and file2.txthas about 2 billions. file1.txthas a single field f1per line while file2.txthas 3 fields, f1through f3, delimited by |.

我有两个文件，file1.txt和file2.txt. file1.txt大约有 14K 行，file2.txt大约有 20 亿行。每行file1.txt有一个字段f1，而file2.txt有 3 个字段，f1通过f3，由分隔|。

I want to find all lines from file2.txtwhere f1of file1.txtmatches f2of file2.txt(or anywhere on the line if we don't want to spend extra time splitting the values of file2.txt).

我想找到的所有行file2.txt那里f1的file1.txt比赛f2中file2.txt（或如果我们不想花费额外的时间分割的值线的任何位置file2.txt）。

file1.txt (about 14K lines, not sorted):

file1.txt（约 14K 行，未排序）：

foo1
foo2
...
bar1
bar2
...

file2.txt (about 2 billion lines, not sorted):

file2.txt（约 20 亿行，未排序）：

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

Output expected:

预期输出：

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

Here is what I have tried and it seems to be taking several hours to run:

这是我尝试过的，似乎需要几个小时才能运行：

fgrep -F -f file1.txt file2.txt > file.matched

I wonder if there is a better and faster way of doing this operation with the common Unix commands or with a small script.

我想知道是否有更好更快的方法来使用常见的 Unix 命令或使用小脚本来执行此操作。

Answer 1

采纳答案by codeforester

A small piece of Perl code solved the problem. This is the approach taken:

一小段 Perl 代码解决了这个问题。这是采取的方法：

store the lines of file1.txtin a hash
read file2.txtline by line, parse and extract the second field
check if the extracted field is in the hash; if so, print the line

将的行存储file1.txt在哈希中
file2.txt逐行读取，解析并提取第二个字段
检查提取的字段是否在哈希中；如果是这样，打印该行

Here is the code:

这是代码：

#!/usr/bin/perl -w

use strict;
if (scalar(@ARGV) != 2) {
  printf STDERR "Usage: fgrep.pl smallfile bigfile\n";
  exit(2);
}

my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]);
my ($small_fp, $big_fp, %small_hash, $field);

open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!;
open($big_fp, "<", $big_file)     || die "Can't open $big_file: "   . $!;

# store contents of small file in a hash
while (<$small_fp>) {
  chomp;
  $small_hash{$_} = undef;
}
close($small_fp);

# loop through big file and find matches
while (<$big_fp>) {
  # no need for chomp
  $field = (split(/\|/, $_))[1];
  if (defined($field) && exists($small_hash{$field})) {
    printf("%s", $_);
  }
}

close($big_fp);
exit(0);

I ran the above script with 14K lines in file1.txt and 1.3M lines in file2.txt. It finished in about 13 seconds, producing 126K matches. Here is the timeoutput for the same:

我用file1.txt 中的14K 行和file2.txt 中的1.3M 行运行了上面的脚本。它在大约 13 秒内完成，产生了 126K 场比赛。这是time相同的输出：

real    0m11.694s
user    0m11.507s
sys 0m0.174s

I ran @Inian's awkcode:

我运行了@Inian 的awk代码：

awk 'FNR==NR{hash[]; next}{for (i in hash) if (match(awk: illegal primary in regular expression 24/Nov/2016||592989 at 592989
 input record number 675280, file file2.txt
 source line number 1

real    55m5.539s
user    54m53.080s
sys 0m5.095s
,i)) {print; break}}' file1.txt FS='|' file2.txt

It was way slower than the Perl solution, since it is looping 14K times for each line in file2.txt - which is really expensive. It aborted after processing 592K records of file2.txtand producing 40K matched lines. Here is how long it took:

它比 Perl 解决方案慢得多，因为它为 file2.txt 中的每一行循环了 14K 次——这真的很昂贵。它在处理 592K 条记录file2.txt并产生 40K 匹配行后中止。这是花了多长时间：

time awk -F '|' 'FNR==NR{hash[]; next} in hash' file1.txt FS='|' file2.txt > awk1.out

real    0m39.966s
user    0m37.916s
sys 0m0.743s

time LC_ALL=C awk -F '|' 'FNR==NR{hash[]; next} in hash' file1.txt FS='|' file2.txt > awk.out

real    0m41.057s
user    0m38.475s
sys 0m0.904s

Using @Inian's other awksolution, which eliminates the looping issue:

使用@Inian 的其他awk解决方案，它消除了循环问题：

real    895m14.862s
user    806m59.219s
sys 1m12.147s

awkis very impressive here, given that we didn't have to write an entire program to do it.

awk考虑到我们不必编写整个程序来执行此操作，这在这里非常令人印象深刻。

I ran @oliv's Python code as well. It took about 15 hours to complete the job, and looked like it produced the right results. Building a huge regex isn't as efficient as using a hash lookup. Here the timeoutput:

我也运行了@oliv 的 Python 代码。完成这项工作大约需要 15 个小时，看起来它产生了正确的结果。构建一个巨大的正则表达式不如使用哈希查找有效。这里的time输出：

use warnings;
use strict;

# If 'prog smallfile bigfile' is the preferred use
die "Usage:         Rate regex  cfor split fgrep
regex 1.05/s    --  -23%  -35%  -44%
cfor  1.36/s   30%    --  -16%  -28%
split 1.62/s   54%   19%    --  -14%
fgrep 1.89/s   80%   39%   17%    --
 smallfile bigfile\n"  if @ARGV != 2;
my ($smallfile, $bigfile) = @ARGV;

open my $fh, '<', $smallfile or die "Can't open $smallfile: $!";    
my %word = map { chomp; $_ => 1 } <$fh>;

open    $fh, '<', $bigfile or die "Can't open $bigfile: $!";       
while (<$fh>) 
{
    exists $word{ (/\|([^|]+)/)[0] } && print;  

    # Or
    #exists $word{ (split /\|/)[1] // '' } && print;
}
close $fh;

I tried to follow the suggestion to use parallel. However, it failed with fgrep: memory exhaustederror, even with very small block sizes.

我尝试按照建议使用parallel。但是，fgrep: memory exhausted即使块大小非常小，它也因错误而失败。

What surprised me was that fgrepwas totally unsuitable for this. I aborted it after 22 hours and it produced about 100K matches. I wish fgrephad an option to force the content of -f fileto be kept in a hash, just like what the Perl code did.

令我惊讶的是，这fgrep完全不适合这个。我在 22 小时后中止了它，它产生了大约 10 万个匹配项。 我希望fgrep有一个选项可以强制将的内容-f file保存在散列中，就像 Perl 代码所做的那样。

I didn't check joinapproach - I didn't want the additional overhead of sorting the files. Also, given fgrep's poor performance, I don't believe joinwould have done better than the Perl code.

我没有检查join方法 - 我不想要对文件进行排序的额外开销。此外，鉴于fgrep性能不佳，我认为join不会比 Perl 代码做得更好。

Thanks everyone for your attention and responses.

感谢大家的关注和回复。

Answer 2

回答by zdim

A Perl solution. [See Notebelow.]

Perl 解决方案。[见下面的注释。]

Use a hash for the first file. As you read the big file line-by-line, extract the field by regex (captures the first pattern between ||) or split(gets the second word) and print if it exists. They likely differ in speed a bit (time them). The definedcheck isn't needed in the regex while for splituse //(defined-or) that short-circuits.

对第一个文件使用哈希。当您逐行阅读大文件时，通过正则表达式提取字段（捕获之间的第一个模式||）或split（获取第二个单词）并打印是否为exists。它们的速度可能略有不同（计时）。在defined不需要在正则表达式检查而split使用//（定义-或），该短路。

#! /usr/bin/env perl
use feature qw(say);
use strict;
use warnings;

use Data::Printer;
use Getopt::Long;

GetOptions ("num_lines=i" => \my $nlines )
  or die("Error in command line arguments\n");

# Generated random words from site: https://www.randomlists.com/random-words
my $word_filename        = 'words.txt'; # 75 random words
my $num_match_words      = 5;
my $num_file2_lines      = $nlines || 1_000_000;
my $file2_words_per_line = 3;
my $file2_match_field_no = 2;
my $file1_filename       = 'file1.txt';
my $file2_filename       = 'file2.txt';
my $file1_regex_fn       = 'regexp1.txt';

say "generating $num_file2_lines lines..";
my ( $words1, $words2 ) = get_words( $word_filename, $num_match_words );

write_file1( $file1_filename, $words2 );
write_file2(
    $file2_filename, $words1, $words2, $num_file2_lines,
    $file2_words_per_line, $file2_match_field_no
);
write_BOC_regexp_file( $file1_regex_fn, $words2 );


sub write_BOC_regexp_file {
    my ( $fn, $words ) = @_;

    open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!";
    print $fh '\|' . (join "|", @$words) . '\|';
    close $fh;
}

sub write_file2 {
    my ( $fn, $words1, $words2, $nlines, $words_per_line, $field_no ) = @_;

    my $nwords1 = scalar @$words1;
    my $nwords2 = scalar @$words2;
    my @lines;
    for (1..$nlines) {
        my @words_line;
        my $key;
        for (1..$words_per_line) {
            my $word;
            if ( $_ != $field_no ) {
                my $index = int (rand $nwords1);
                $word = @{ $words1 }[$index];
            }
            else {
                my $index = int (rand($nwords1 + $nwords2) );
                if ( $index < $nwords2 ) {
                    $word = @{ $words2 }[$index];
                }
                else {
                    $word =  @{ $words1 }[$index - $nwords2];
                }
                $key = $word;
            }
            push @words_line, $word;
        }
        push @lines, [$key, (join "|", @words_line)];
    }
    @lines = map { $_->[1] } sort { $a->[0] cmp $b->[0] } @lines; 
    open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!";
    print $fh (join "\n", @lines);
    close $fh;
}

sub write_file1 {
    my ( $fn, $words ) = @_;

    open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!";
    print $fh (join "\n", sort @$words);
    close $fh;
}

sub get_words {
    my ( $fn, $N ) = @_;

    open( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
    my @words = map {chomp $_; $_} <$fh>;
    close $fh;

    my @words1 = @words[$N..$#words];
    my @words2 = @words[0..($N - 1)];
    return ( \@words1, \@words2 );
}

Avoiding the ifbranch and using short-circuit is faster, but only very little. On billions of lines these tweaks add up but again not to too much. It may (or may not) be a tad bit faster to read the small file line by line, instead of in list context as above, but this should notbe noticeable.

避免if分支和使用短路会更快，但只是很少。在数十亿条线路上，这些调整加起来但又不会太多。逐行读取小文件可能（也可能不会）快一点，而不是像上面那样在列表上下文中读取，但这不应该引起注意。

Update Writing to STDOUTsaves two operations and I repeatedly time it to be a little faster than writing to a file. Such usage is also consistent with most UNIX tools so I changed to write to STDOUT. Next, the existstest is not needed and dropping it spares an operation. However, I consistently get a touch better runtimes with it, while it also conveys the purpose better. Altogether I am leaving it in. Thanks to ikegamifor comments.

更新写入STDOUT保存两个操作，我反复计时它比写入文件快一点。这样的用法也和大多数 UNIX 工具一致，所以我改写为STDOUT. 接下来，exists不需要测试，删除它可以节省操作。但是，我始终通过它获得更好的运行时间，同时它也更好地传达了目的。总而言之，我把它留在了。感谢池上的意见。

Note The commented out version is about 50% fasterthan the other, by my benchmark below. These are both given because they are different, one finding the first match and the other the second field. I am keeping it this way as a more generic choice, since the question is ambiguous on that.

注意根据我下面的基准，注释掉的版本比另一个版本快 50%。这些都是因为它们不同而给出的，一个找到第一个匹配项，另一个找到第二个字段。我保持这种方式作为一个更通用的选择，因为这个问题是模棱两可的。

Some comparisons (benchmark) [Updated for writing to STDOUT, see "Update" above]

一些比较（基准）[已更新写入STDOUT，请参阅上面的“更新”]

There is an extensive analysis in the answer by H?konH?gland, timing one run of most solutions. Here is another take, benchmarking the two solutions above, the OP's own answer, and the posted fgrepone, expected to be fast and used in the question and in many answers.

H?konH?gland的答案中有一个广泛的分析，对大多数解决方案的运行进行计时。这是另一种情况，对上述两个解决方案、OP 自己的答案和发布的答案进行了基准测试fgrep，预计会很快并在问题和许多答案中使用。

I build test data in the following way. A handful of lines of the length roughly as shown are made with random words, for both files, so to match in the second field. Then I pad this "seed" for data samples with lines that won't match, so to mimic ratios between sizes and matches quoted by OP: for 14Klines in small file there are 1.3Mlines in the big file, yielding 126Kmatches. Then these samples are written repeatedly to build full data files as OP's, shuffle-ed each time using List::Util.

我按以下方式构建测试数据。对于两个文件，几行长度大致如图所示是由随机单词组成的，以便在第二个字段中匹配。然后我用不匹配的行填充数据样本的“种子”，以便模拟 OP 引用的大小和匹配之间的比率：对于小文件中的14K行，大文件中有1.3M行，产生126K匹配。然后重复写入这些样本以构建完整的数据文件作为 OP，shuffle每次使用List::Util 进行-ed 。

All runs compared below produce 106_120matches for the above file sizes (diff-ed for a check), so the matching frequency is close enough. They are benchmarked by calling complete programs using my $res = timethese(60 ...). The result of cmpthese($res)on v5.16 are

下面比较的所有运行都会产生106_120与上述文件大小的匹配（diff-ed 用于检查），因此匹配频率足够接近。它们通过使用my $res = timethese(60 ...). cmpthese($res)在 v5.16 上的结果是

$ tree solutions/
solutions/
├── BOC1
│?? ├── out.txt
│?? └── run.sh
├── BOC2
│?? ├── out.txt
│?? └── run.sh
├── codeforester
│?? ├── out.txt
│?? ├── run.pl
│?? └── run.sh
[...]

The fact that the optimized C-program fgrepcomes on top isn't surprising. The lag of "regex" behind "split" may be due to the overhead of starting the engine for little matches, manytimes. This may vary over Perl versions, given the evolving regex engine optimizations. I include the answer of @codeforester ("cfor") since it was claimed to be fastest, and its 20%lag behind the very similar "split" is likely due to scattered small inefficiencies (see a comment below this answer).^†

优化的 C 程序fgrep居于首位这一事实并不奇怪。“滞后正则表达式”背后的“分裂”可能是由于发动机起动的小比赛，开销很多倍。鉴于不断发展的正则表达式引擎优化，这可能因 Perl 版本而异。我有@codeforester（“的答案CFOR因为它自称是最快的，其‘）20%背后的非常相似的’滞后分裂”可能是由于分散的小的低效率（见这个答案下面留言）。^†

This isn't shatteringly different, while there are sure variations across hardware and software and over data details. I ran this on different Perls and machines, and the notable difference is that in some cases fgrepwas indeed an order of magnitude faster.

这并没有太大的不同，虽然硬件和软件以及数据细节之间肯定存在差异。我在不同的 Perls 和机器上运行它，显着的区别是在某些情况下fgrep确实快了一个数量级。

The OP's experience of very slow fgrepis surprising. Given their quoted run times, order of magnitude slower than the above, I'd guess that there's an old system to "blame."

OP 非常缓慢的经历fgrep令人惊讶。考虑到他们引用的运行时间比上面的慢了一个数量级，我猜想“归咎于”一个旧系统。

Even though this is completely I/O based there are concurrency benefits from putting it on multiple cores and I'd expect a good speedup, up to a factor of a few.

尽管这完全是基于 I/O 的，但是将它放在多个内核上还是有并发优势的，我希望有一个很好的加速，最高可达几倍。

^†Alas, the comment got deleted (?). In short: unneeded use of a scalar (costs), of an ifbranch, of defined, of printfinstead of print(slow!). These matter for efficiency on 2 billion lines.

^†唉，评论被删除了（？）。简而言之：不需要使用标量（成本）、if分支、of defined、ofprintf而不是print（慢！）。这些对于 20 亿条线路的效率很重要。

Answer 3

回答by H?kon H?gland

I have tried to do a comparison between some of the methods presented here.

我试图对这里介绍的一些方法进行比较。

First I created a Perl script to generate the input files file1.txtand file2.txt. In order to compare some of the solutions, I made sure that the the words from file1.txtonly could appear in the second field in file2.txt. Also to be able to use the joinsolution presented by @GeorgeVasiliou, I sorted file1.txtand file2.txt. Currently I generated the input files based on only 75 random words ( taken from https://www.randomlists.com/random-words). Only 5 of these 75 words was used in file1.txtthe remaining 70 words was used to fill up the fields in file2.txt. It might be necessary to increase the number of words substantially to get realistic results ( according to the OP, the original file1.txtcontained 14000 words). In the tests below I used a file2.txtwith 1000000 ( 1 million ) lines. The script also generates the file regexp1.txtrequired by the grep solution of @BOC.

首先，我创建了一个 Perl 脚本来生成输入文件file1.txt和file2.txt. 为了比较一些解决方案，我确保单词 from file1.txtonly 可以出现在file2.txt. 另外为了能够使用join@GeorgeVasiliou 提供的解决方案，我对file1.txt和file2.txt. 目前我仅基于 75 个随机单词生成了输入文件（取自https://www.randomlists.com/random-words）。这 75 个单词中仅使用了 5 个，file1.txt其余 70 个单词用于填写中的字段file2.txt。可能需要大量增加单词数才能获得真实的结果（根据 OP，原始file1.txt包含 14000 个单词）。在下面的测试中，我使用了file2.txt有 1000000（100 万）行。该脚本还会生成regexp1.txt@BOC 的grep 解决方案所需的文件。

gen_input_files.pl:

grep -E -f regexp1.txt file2.txt

Next, I created a sub folder solutionswith all the test cases:

接下来，我创建了一个solutions包含所有测试用例的子文件夹：

LC_ALL=C grep -E -f regexp1.txt file2.txt

Here the files out.txtis the output from the greps for each solution. The scripts run.shruns the solution for the given test case.

这里的文件out.txt是每个解决方案的 grep 输出。脚本run.sh运行给定测试用例的解决方案。

Notes on the different solutions

不同解决方案的注意事项

BOC1: First solution presented by @BOC
```
fgrep -f file1.txt file2.txt
```

BOC2: Second solution suggested by @BOC:

parallel -k --pipepart -a file2.txt --block "$block_size" fgrep -F -f file1.txt

codeforester: Accepted Perl solution by @codeforester ( see source)
codeforester_orig: Original solution presented by @codeforested:
```
regexp=$(< "regexp_ikegami.txt")
grep -P "$regexp" file2.txt
```
dawg: Python solution using dictionary and split line proposed by @dawg ( see source)

gregory1: solution using Gnu Parallel suggested by @gregory

awk 'FNR==NR{
    hash[]; next
}
{
   for (i in hash) if (match(awk 'FNR==NR{
    hash[]; next
}
{
   for (i in hash) if (index(awk 'FNR==NR{
    hash[]; next
}
 in hash' file1.txt FS='|' file2.txt
,i)) {print; break}
}' file1.txt FS='|' file2.txt
,i)) {print; break}
}' file1.txt FS='|' file2.txt

See note below regarding how to choose $block_size.

hakon1: Perl solution provided by @H?konH?gland (see source). This solution requires compilation of the c-extension the first time the code is run. It does not require recompilation when file1.txtor file2.txtchanges. Note: The time used to compile the c-extension at the initial run is not included in the run times presented below.
ikegami: Solution using assembled regexp and using grep -Pas given by @ikegami. Note: The assembled regexp was written to a separate file regexp_ikegami.txt, so the runtime of generating the regexp is not included in the comparison below. This is the code used:
```
LC_ALL=C fgrep -f file1.txt file2.txt
```

inian1: First solution by @Inian using match()

LC_ALL=C awk 'FNR==NR{
    hash[]; next
}
{
   for (i in hash) if (match(join --nocheck-order -11 -22 -t'|' -o 2.1 2.2 2.3 file1.txt file2.txt
,i)) {print; break}
}' file1.txt FS='|' file2.txt

inian2: Second solution by @Inian using index()
```
grep -E -f regexp1.txt file2.txt
```
inian3: Third solution by @Inian checking only $2field:
```
LC_ALL=C grep -E -f regexp1.txt file2.txt
```
inian4: 4th soultion by @Inian ( basically the same as codeforester_origwith LC_ALL) :
```
fgrep -f file1.txt file2.txt
```

inian5: 5th solution by @Inian (same as inian1but with LC_ALL):

parallel -k --pipepart -a file2.txt --block "$block_size" fgrep -F -f file1.txt

inian6: Same as inian3but with LC_ALL=C. Thanks to @GeorgeVasiliou for suggestion.
jjoao: Compiled flex-generated C code as proposed by @JJoao (see source). Note: Recompilation of the exectuable must be done each time file1.txtchanges. The time used to compile the executable is not included in the run times presented below.
oliv: Python script provided by @oliv ( see source)

Vasiliou: Using joinas suggested by @GeorgeVasiliou:

regexp=$(< "regexp_ikegami.txt")
grep -P "$regexp" file2.txt

Vasiliou2: Same as Vasilioubut with LC_ALL=C.
zdim: Using Perl script provided by @zdim ( see source). Note: This uses the regexp search version ( instead of split line solution ).
zdim2: Same as zdimexcept that it uses the splitfunction instead of regexp search for the field in file2.txt.

BOC1：@BOC 提出的第一个解决方案

awk 'FNR==NR{
    hash[]; next
}
{
   for (i in hash) if (match(awk 'FNR==NR{
    hash[]; next
}
{
   for (i in hash) if (index(awk 'FNR==NR{
    hash[]; next
}
 in hash' file1.txt FS='|' file2.txt
,i)) {print; break}
}' file1.txt FS='|' file2.txt
,i)) {print; break}
}' file1.txt FS='|' file2.txt

BOC2：@BOC 建议的第二种解决方案：
```
LC_ALL=C fgrep -f file1.txt file2.txt
```
codeforester：@codeforester 接受的 Perl 解决方案（请参阅源代码）

codeforester_orig：@codeforested 提出的原始解决方案：

LC_ALL=C awk 'FNR==NR{
    hash[]; next
}
{
   for (i in hash) if (match(join --nocheck-order -11 -22 -t'|' -o 2.1 2.2 2.3 file1.txt file2.txt
,i)) {print; break}
}' file1.txt FS='|' file2.txt

dawg：使用@dawg 提出的字典和分割线的Python 解决方案（请参阅源代码）
gregory1：使用@gregory 建议的 Gnu Parallel 的解决方案
```
 ^[^|]*\|word1\|
 ^[^|]*\|word2\|
 ^[^|]*\|word3\|
 [...]
```
有关如何选择，请参阅下面的说明$block_size。
hakon1：@H?konH?gland 提供的 Perl 解决方案（请参阅源代码）。此解决方案需要在第一次运行代码时编译 c 扩展。它不需要在file1.txt或file2.txt更改时重新编译。注意：在初始运行时用于编译 c-extension 的时间不包括在下面显示的运行时间中。
ikegami：使用组装的正则表达式并使用grep -P@ikegami 给出的解决方案。注意：组装后的正则表达式被写入单独的文件regexp_ikegami.txt，因此生成正则表达式的运行时间不包括在下面的比较中。这是使用的代码：
```
 ^[^|]*\|word1\|
 ^[^|]*\|word2\|
 ^[^|]*\|word3\|
 [...]
```

inian1：@Inian 使用的第一个解决方案 match()

#! /usr/bin/env perl

use feature qw(say);
use strict;
use warnings;

use Cwd;
use Getopt::Long;
use Data::Printer;
use FGB::Common;
use List::Util qw(max shuffle);
use Number::Bytes::Human qw(format_bytes);
use Sys::Info;

GetOptions (
    "verbose"       => \my $verbose,
    "check"         => \my $check,
    "single-case=s" => \my $case,
    "expected=i"    => \my $expected_no_lines,
) or die("Error in command line arguments\n");

my $test_dir    = 'solutions';
my $output_file = 'out.txt';
my $wc_expected = $expected_no_lines; # expected number of output lines

my $tests       = get_test_names( $test_dir, $case );

my $file2_size  = get_file2_size();
my $num_cpus    = Sys::Info->new()->device( CPU => () )->count;

chdir $test_dir;
my $cmd = 'run.sh';
my @times;
for my $case (@$tests) {
    my $savedir = getcwd();
    chdir $case;
    say "Running '$case'..";
    my $arg = get_cmd_args( $case, $file2_size, $num_cpus );
    my $output = `bash -c "{ time -p $cmd $arg; } 2>&1"`;
    my ($user, $sys, $real ) = get_run_times( $output );
    print_timings( $user, $sys, $real ) if $verbose;
    check_output_is_ok( $output_file, $wc_expected, $verbose, $check );
    print "\n" if $verbose;
    push @times, $real;
    #push @times, $user + $sys; # this is wrong when using Gnu parallel
    chdir $savedir;
}

say "Done.\n";

print_summary( $tests, \@times );

inian2：@Inian 使用的第二个解决方案 index()

$  run_all.pl --verbose
Running 'inian3'..
..finished in 0.45 seconds ( user: 0.44, sys: 0.00 )
..no of output lines: 66711

Running 'inian2'..
..finished in 0.73 seconds ( user: 0.73, sys: 0.00 )
..no of output lines: 66711

Running 'Vasiliou'..
..finished in 0.09 seconds ( user: 0.08, sys: 0.00 )
..no of output lines: 66711

Running 'codeforester_orig'..
..finished in 0.05 seconds ( user: 0.05, sys: 0.00 )
..no of output lines: 66711

Running 'codeforester'..
..finished in 0.45 seconds ( user: 0.44, sys: 0.01 )
..no of output lines: 66711

[...]

inian3：@Inian 仅检查$2字段的第三个解决方案：

                               |Vasiliou
My Benchmark                   |Results  |   Details
-------------------------------|---------|----------------------
inian4             : 0.04s     |0.22s    | LC_ALL fgrep -f [loose] 
codeforester_orig  : 0.05s     |         | fgrep -f [loose]
Vasiliou2          : 0.06s     |0.16s    | [LC_ALL join [requires sorted files]]
BOC1               : 0.06s     |         | grep -E [loose] 
BOC2               : 0.07s     |15s      | LC_ALL grep -E [loose] 
BOC2B              : 0.07s     |         | LC_ALL grep -E [strict] 
inian4B            : 0.08s     |         | LC_ALL grep -E -f [strict] 
Vasiliou           : 0.08s     |0.23s    | [join [requires sorted files]] 
gregory1B          : 0.08s     |         | [parallel + grep -E -f [strict]] 
ikegami            : 0.1s      |         | grep -P 
gregory1           : 0.11s     |0.5s     | [parallel + fgrep -f [loose]] 
hakon1             : 0.14s     |         | [perl + c]
BOC1B              : 0.14s     |         | grep -E [strict] 
jjoao              : 0.21s     |         | [compiled flex generated c code] 
inian6             : 0.26s     |0.7s     | [LC_ALL awk + split+dict] 
codeforester_origB : 0.28s     |         | grep -E -f [strict] 
dawg               : 0.35s     |         | [python + split+dict] 
inian3             : 0.44s     |1.1s     | [awk + split+dict] 
zdim2              : 0.4s      |         | [perl + split+dict] 
codeforester       : 0.45s     |         | [perl + split+dict] 
oliv               : 0.5s      |         | [python + compiled regex + re.search()] 
zdim               : 0.61s     |         | [perl + regexp+dict] 
inian2             : 0.73s     |1.7s     | [awk + index(gregory1           : 0.86s     | [parallel + fgrep -f [loose]]
Vasiliou2          : 0.94s     | [LC_ALL join [requires sorted files]]
inian4B            : 1.12s     | LC_ALL grep -E -f [strict] 
BOC2B              : 1.13s     | LC_ALL grep -E [strict] 
BOC2               : 1.15s     | LC_ALL grep -E [loose] 
BOC1               : 1.18s     | grep -E [loose] 
ikegami            : 1.33s     | grep -P 
Vasiliou           : 1.37s     | [join [requires sorted files]]
hakon1             : 1.44s     | [perl + c]
inian4             : 2.18s     | LC_ALL fgrep -f [loose] 
codeforester_orig  : 2.2s      | fgrep -f [loose] 
inian6             : 2.82s     | [LC_ALL awk + split+dict] 
jjoao              : 3.09s     | [compiled flex generated c code] 
dawg               : 3.54s     | [python + split+dict] 
zdim2              : 4.21s     | [perl + split+dict]
codeforester       : 4.67s     | [perl + split+dict] 
inian3             : 5.52s     | [awk + split+dict] 
zdim               : 6.55s     | [perl + regexp+dict] 
gregory1B          : 45.36s    | [parallel + grep -E -f [strict]] 
oliv               : 60.35s    | [python + compiled regex + re.search()] 
BOC1B              : 74.71s    | grep -E [strict] 
codeforester_origB : 75.52s    | grep -E -f [strict] 
,i)] 
inian5             : 18.12s    |         | [LC_ALL awk + match(awk 'FNR==NR{hash[]; next}{for (i in hash) if (match(awk 'FNR==NR{hash[]; next}{for (i in hash) if (index(awk 'FNR==NR{hash[]; next}{for (i in hash) if (awk 'FNR==NR{hash[]; next} in hash' file1.txt FS='|' file2.txt
~i) {print; break}}' file1.txt FS='|' file2.txt
,i)) {print; break}}' file1.txt FS='|' file2.txt
,i)) {print; break}}' file1.txt FS='|' file2.txt
,i) [loose]] 
inian1             : 19.46s    |         | [awk + match($ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
,i) [loose]] 
inian5B            : 42.27s    |         | [LC_ALL awk + match($ LC_ALL=C locale
LANG=en_US.UTF-8
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"       
LC_ALL=C
,i) [strict]] 
inian1B            : 85.67s    |         | [awk + match(LC_ALL=C fgrep -F -f file1.txt file2.txt
,i) [strict]] 

Vasiliou Results : 2 X CPU Intel 2 Duo T6570 @ 2.10GHz - 2Gb RAM-Debian Testing 64bit- kernel 4.9.0.1 - no cpu freq scaling.

inian4：@Inian 的第四个灵魂（codeforester_orig与 with基本相同LC_ALL）：

LC_ALL=C awk 'FNR==NR{hash[]; next}{for (i in hash) if (match(parallel --pipepart -a file2.txt --block 10M fgrep -F -f file1.txt
,i)) {print; break}}' file1.txt FS='|' file2.txt

inian5：@Inian 的第 5 个解决方案（与相同，inian1但带有LC_ALL）：

#!/usr/bin/perl

use strict;
use warnings;

use Regexp::Assemble qw( );

chomp( my @ids = <> );
my $ra = Regexp::Assemble->new();
$ra->add(quotemeta($_)) for @ids;
print("^[^|]*\|(?:" . (re::regexp_pattern($ra->re()))[0] . ")\|");

inian6: 与相同，inian3但与LC_ALL=C. 感谢@GeorgeVasiliou 的建议。
jjoao：编译由@JJoao 提出的由 flex 生成的 C 代码（请参阅源代码）。注意：每次file1.txt更改时都必须重新编译可执行文件。用于编译可执行文件的时间不包括在下面显示的运行时间中。
oliv：@oliv 提供的 Python 脚本（见源码）

Vasiliou：join按照@GeorgeVasiliou 的建议使用：

$ LC_ALL=C grep -P "$( a file1.txt )" file2.txt
date1|foo1|number1
date2|foo2|number2
date1|bar1|number1
date2|bar2|number2

Vasiliou2: 与相同，Vasiliou但与LC_ALL=C.
zdim：使用@zdim 提供的 Perl 脚本（请参阅源代码）。注意：这使用正则表达式搜索版本（而不是分割线解决方案）。
zdim2:zdim除了它使用split函数而不是正则表达式搜索file2.txt.

Notes

笔记

I experimented a little bit with Gnu parallel (see gregory1solution above) to determine the optimal block size for my CPU. I have 4 cores, and and currently it seems that the optimal choice is to devide the file (file2.txt) into 4 equal sized chunks, and run a single job on each of the 4 processors. More testing might be needed here. So for the first test case where file2.txtis 20M, I set $block_sizeto 5M ( see gregory1solution above), whereas for the more realistic case presented below where file2.txtis 268M, a $block_sizeof 67M was used.
The solutions BOC1, BOC2, codeforester_orig, inian1, inian4, inian5, and gregory1all used loose matching. Meaning that the words from file1.txtdid not have to match exactly in field #2 of file2.txt. A match anywhere on the line was accepted. Since this behavior made it more difficult to compare them with the other methods, some modified methods were also introduced. The first two methods called BOC1Band BOC2Bused a modified regexp1.txtfile. The lines in the original regexp1.txtwhere on the form \|foo1|foo2|...|fooN\|which would match the words at any field boundary. The modified file, regexp1b.txt, anchored the match to field #2 exclusively using the form ^[^|]*\|foo1|foo2|...|fooN\|instead.
Then the rest of the modified methods codeforester_origB, inian1B, inian4B, inian5B, and gregory1Bused a modified file1.txt. Instead of a literalword per line, the modified file file1b.txtused one regexper line on the form:
```
sudo su
cpan Regexp::Assemble
```
and in addition, fgrep -fwas replaced by grep -E -ffor these methods.

我对 Gnu 并行进行了一些实验（参见gregory1上面的解决方案），以确定我的 CPU 的最佳块大小。我有 4 个内核，目前看来最佳选择是将文件 ( file2.txt) 分成 4 个相同大小的块，并在 4 个处理器中的每一个上运行一个作业。这里可能需要更多的测试。因此，对于第一个测试用例，其中file2.txt20M，我设置$block_size为 5M（参见gregory1上面的解决方案），而对于下面呈现的更现实的情况，其中file2.txt268M，使用$block_size了 67M。
该解决方案BOC1，BOC2，codeforester_orig，inian1，inian4，inian5，和gregory1所有使用松散匹配。这意味着来自的单词file1.txt不必在的字段 #2 中完全匹配file2.txt。线上任何地方的匹配都被接受。由于这种行为使得将它们与其他方法进行比较变得更加困难，因此还引入了一些修改过的方法。前两个方法调用BOC1B并BOC2B使用了一个修改过的regexp1.txt文件。原始regexp1.txt表格\|foo1|foo2|...|fooN\|中与任何字段边界处的单词匹配的行。修改后的文件，regexp1b.txt专门使用表单将匹配锚定到字段 #2 ^[^|]*\|foo1|foo2|...|fooN\|。
然后的改性方法的其余部分codeforester_origB，inian1B，inian4B，inian5B，和gregory1B使用经修饰的file1.txt。修改后的文件每行使用一个正则表达式，而不是每行一个字面意思：file1b.txt
```
file1.txt
foo1

file2.txt
date1|foo12|number5
```
此外，fgrep -f被替换grep -E -f为这些方法。

Running the tests

运行测试

Here is the script used for running all the tests. It uses the Bash timecommand to record the time spent for each script. Note that the timecommand returns three different times call real, user, and sys. First I used user+ sys, but realized that this was incorrect when using Gnu parallel command, so the time reported below is now the realpart returned by time. See this questionfor more information about the different times returned by time.

这是用于运行所有测试的脚本。它使用 Bashtime命令记录每个脚本花费的时间。需要注意的是该time命令返回三个不同的时代呼唤real，user和sys。首先，我用user+ sys，但意识到这种利用GNU并行命令时，所以时间报道下面是现在是不正确real的返回部分time。有关.返回的不同时间的更多信息，请参阅此问题time。

The first test is run with file1.txtcontaining 5 lines, and file2.txtcontaining 1000000lines. Here is the first 52 lines of the run_all.plscript, the rest of the script is available here.

第一个测试运行file1.txt包含 5 行和file2.txt包含1000000行。这是run_all.pl脚本的前 52 行，脚本的其余部分在这里可用。

run_all.pl

运行_all.pl

file1.txt
foo1

file2.txt
date1|foo12|number5

Results

结果

Here is the output from running the tests:

以下是运行测试的输出：

use strict;
use warnings;
use Inline C => './search.c';

my $smallfile = 'file1.txt';
my $bigfile   = 'file2.txt';

open my $fh, '<', $smallfile or die "Can't open $smallfile: $!";
my %word = map { chomp; $_ => 1 } <$fh>;
search( $bigfile, \%word );

Summary

概括

[Results obtained by @Vasiliou are shown in the middle column.]

[@Vasiliou 获得的结果显示在中间列中。]

#include <stdio.h>
#include <sys/stat.h> 
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>


#define BLOCK_SIZE 8192       /* how much to read from file each time */
static char read_buf[BLOCK_SIZE + 1];

/*  reads a block from file, returns -1 on error, 0 on EOF, 
     else returns chars read, pointer to buf, and pointer to end of buf  */
size_t read_block( int fd, char **ret_buf, char **end_buf ) {
    int ret;
    char *buf = read_buf;
    size_t len = BLOCK_SIZE;
    while (len != 0 && (ret = read(fd, buf, len)) != 0) {
        if (ret == -1) {
            if (errno == EINTR)
                continue;
            perror( "read" );
            return ret;
        }
        len -= ret;
        buf += ret;
    }
    *end_buf = buf;
    *ret_buf = read_buf;
    return (size_t) (*end_buf - *ret_buf);
}

/* updates the line buffer with the char pointed to by cur,
   also updates cur
    */
int update_line_buffer( char **cur, char **line, size_t *llen, size_t max_line_len ) {
    if ( *llen > max_line_len ) {
        fprintf( stderr, "Too long line. Maximimum allowed line length is %ld\n",
                 max_line_len );
        return 0;
    }
    **line = **cur;
    (*line)++;
    (*llen)++;
    (*cur)++; 
    return 1;
}


/*    search for first pipe on a line (or next line if this is empty),
    assume line ptr points to beginning of line buffer.
  return 1 on success
  Return 0 if pipe could not be found for some reason, or if 
    line buffer length was exceeded  */
int search_field_start(
    int fd, char **cur, char **end_buf, char **line, size_t *llen, size_t max_line_len
) {
    char *line_start = *line;

    while (1) {
        if ( *cur >= *end_buf ) {
            size_t res = read_block( fd, cur, end_buf );        
            if (res <= 0) return 0;
        }
        if ( **cur == '|' ) break;
        /* Currently we just ignore malformed lines ( lines that do not have a pipe,
           and empty lines in the input */
        if ( **cur == '\n' ) {
            *line = line_start;
            *llen = 0;
            (*cur)++;
        }
        else {
            if (! update_line_buffer( cur, line, llen, max_line_len ) ) return 0;
        }
    }
    return 1;
}

/* assume cur points at starting pipe of field
  return -1 on read error, 
  return 0 if field len was too large for buffer or line buffer length exceed,
  else return 1
  and field, and  length of field
 */
int copy_field(
    int fd, char **cur, char **end_buf, char *field,
    size_t *flen, char **line, size_t *llen, size_t max_field_len, size_t max_line_len
) {
    *flen = 0;
    while( 1 ) {
        if (! update_line_buffer( cur, line, llen, max_line_len ) ) return 0;
        if ( *cur >= *end_buf ) {
            size_t res = read_block( fd, cur, end_buf );        
            if (res <= 0) return -1;
        }
        if ( **cur == '|' ) break;
        if ( *flen > max_field_len ) {
            printf( "Field width too large. Maximum allowed field width: %ld\n",
                    max_field_len );
            return 0;
        }
        *field++ = **cur;
        (*flen)++;
    }
    /* It is really not necessary to null-terminate the field 
       since we return length of field and also field could 
       contain internal null characters as well
    */
    //*field = '#!/usr/bin/python

import sys 

with open(sys.argv[1]) as f:
    tgt={e.rstrip() for e in f}

with open(sys.argv[2]) as f:
    for line in f:
        cells=line.split("|")
        if cells[1] in tgt:
            print line.rstrip()
';
    return 1;
}

/* search to beginning of next line,
  return 0 on error,
  else return 1 */
int search_eol(
    int fd, char **cur, char **end_buf, char **line, size_t *llen, size_t max_line_len)
{
    while (1) {
        if ( *cur >= *end_buf ) {
            size_t res = read_block( fd, cur, end_buf );        
            if (res <= 0) return 0;
        }
        if ( !update_line_buffer( cur, line, llen, max_line_len ) ) return 0;
        if ( *(*cur-1) == '\n' ) {
            break;
        }
    }
    //**line = '$ awk 'FNR==NR{arr[]; next}  in arr{print awk 'NR==FNR{a[awk 'NR==FNR{a[$ cat d.txt
bar1
bar2
foo1
foo2

$ cat e.txt
date1|bar1|number1
date2|bar2|number2
date3|bar3|number3
date1|foo1|number1
date2|foo2|number2
date3|foo3|number3

$ join --nocheck-order -11 -22 -t'|' -o 2.1 2.2 2.3 d.txt e.txt
date1|bar1|number1
date2|bar2|number2
date1|foo1|number1
date2|foo2|number2
]=1;next}(a[] || a[] || a[])' patterns.txt FS="|" datafile.txt >matches.txt. #Printed if any of the three fields of datafile match pattern.
]=1;next}a[]' patterns.txt FS="|" datafile.txt >matches.txt # For matches restricted on Field2 of datafile
}' FS="|" /tmp/f1 /tmp/f2 
'; // not necessary
    return 1;
}

#define MAX_FIELD_LEN 80  /* max number of characters allowed in a field  */
#define MAX_LINE_LEN 80   /* max number of characters allowed on a line */

/* 
   Get next field ( i.e. field #2 on a line). Fields are
   separated by pipes '|' in the input file.
   Also get the line of the field.
   Return 0 on error,
   on success: Move internal pointer to beginning of next line
     return 1 and the field.
 */
size_t get_field_and_line_fast(
    int fd, char *field, size_t *flen, char *line, size_t *llen
) {
    static char *cur = NULL;
    static char *end_buf = NULL;

    size_t res;
    if (cur == NULL) {
        res = read_block( fd, &cur, &end_buf );        
        if ( res <= 0 ) return 0;
    }
    *llen = 0;
    if ( !search_field_start( fd, &cur, &end_buf, &line, llen, MAX_LINE_LEN )) return 0;
    if ( (res = copy_field(
        fd, &cur, &end_buf, field, flen, &line, llen, MAX_FIELD_LEN, MAX_LINE_LEN
    ) ) <= 0)
        return 0;
    if ( !search_eol( fd, &cur, &end_buf, &line, llen, MAX_LINE_LEN ) ) return 0;
    return 1;
}

void search( char *filename, SV *href) 
{
    if( !SvROK( href ) || ( SvTYPE( SvRV( href ) ) != SVt_PVHV ) ) {
        croak( "Not a hash reference" );
    }

    int fd = open (filename, O_RDONLY);
    if (fd == -1) {
        croak( "Could not open file '%s'", filename );
    }
    char field[MAX_FIELD_LEN+1];
    char line[MAX_LINE_LEN+1];
    size_t flen, llen;
    HV *hash = (HV *)SvRV( href );
    while ( get_field_and_line_fast( fd, field, &flen, line, &llen ) ) {
        if( hv_exists( hash, field, flen ) )
            fwrite( line, sizeof(char), llen, stdout);
    }
    if (close(fd) == -1)
        croak( "Close failed" );

}

A more realistic test case

更真实的测试用例

I then created a more realistic case with file1.txthaving 100 words and file2.txthaving 10 million lines (268Mb file size). I extracted 1000 random words from the dictionary at /usr/share/dict/american-englishusing shuf -n1000 /usr/share/dict/american-english > words.txtthen extracted 100 of these words into file1.txtand then constructed file2.txtthe same way as described above for the first test case. Note that the dictionary file was UTF-8 encoded, and I stripped away all non-ASCII characters from the words.txt.

然后我创建了一个更真实的案例，file1.txt有 100 个单词和file2.txt1000 万行（268Mb 文件大小）。我在/usr/share/dict/american-english使用时从字典中提取了 1000 个随机单词，shuf -n1000 /usr/share/dict/american-english > words.txt然后提取了其中的 100 个单词file1.txt，然后file2.txt以与上述第一个测试用例相同的方式构建。请注意，字典文件是 UTF-8 编码的，我从words.txt.

Then I run the test without the three slowest methods from the previous case. I.e. inian1, inian2, and inian5were left out. Here are the new results:

然后我在没有前一个案例中三种最慢的方法的情况下运行测试。即inian1、inian2和inian5被排除在外。以下是新结果：

##代码##

Note

笔记

The grepbased solutions were looking for a match on the whole line, so in this case they contained some false matches: the methods codeforester_orig, BOC1, BOC2, gregory1, inian4, and olivextracted 1,087,609 lines out of 10,000,000 lines, whereas the other methods extracted the correct 997,993 lines from file2.txt.

在grep基础的解决方案正在寻找对全行的匹配，所以在这种情况下，它们包含一些虚假的比赛：方法codeforester_orig，BOC1，BOC2，gregory1，inian4，和oliv提取1087609个线条勾勒出1000万线，而其他方法从中提取正确的997993行file2.txt。

Notes

笔记

I tested this on my Ubuntu 16.10 laptop (Intel Core i7-7500U CPU @ 2.70GHz)
The whole benchmark study is available here.

我在我的 Ubuntu 16.10 笔记本电脑（Intel Core i7-7500U CPU @ 2.70GHz）上测试了这个
整个基准研究可在此处获得。

Answer 4

回答by Inian

Did you try Awkthat could speed up things a bit:

您是否尝试Awk过可以加快速度：

##代码##

(or) using index()function in Awkas suggested by comments from Benjamin W., below

（或）按照Benjamin W.的评论建议使用index()函数 in ，如下Awk

##代码##

(or) a more direct regex match as suggested by Ed Mortonin comments,

(或) Ed Morton在评论中建议的更直接的正则表达式匹配，

##代码##

is all you need. I'm guessing this will be faster but not exactly sure on files with million+ entries. Here the problem is with the possibility match in anywhere along the line. Had the same been in any particular column (e.g. say $2alone), a faster approach could be

是你所需要的全部。我猜这会更快，但对于具有百万+条目的文件并不确定。这里的问题在于沿线任何地方的可能性匹配。如果在任何特定的列中都是相同的（例如$2单独说），可以使用更快的方法

##代码##

Also you could speed-things up by playing with the localeset in your system. Paraphrasing from this wonderful Stéphane Chazelas's answeron the subject, you could speed up things pretty quickly by setting passing the locale LC_ALL=Cto the command locallybeing run.

你也可以通过locale在你的系统中播放来加快速度。转述 Stéphane Chazelas对这个主题的精彩回答，您可以通过将语言环境设置LC_ALL=C为在本地运行的命令来加快速度。

On any GNUbased system, the defaults for the locale

在任何GNU基于系统上，默认值locale

##代码##

With one variable LC_ALL, you can set all LC_type variables at once to a specified locale

使用一个变量LC_ALL，您可以LC_一次将所有类型变量设置为指定的语言环境

##代码##

So what does this impact?

那么这有什么影响呢？

Simply put, when using the locale Cit will default to the server's base Unix/Linux language of ASCII. Basically when you grepsomething, by default your locale is going to be internationalized and set to UTF-8, which can represent every character in the Unicode character set to help display any of the world's writing systems, currently over more than 110,000unique characters, whereas with ASCIIeach character is encoded in a single byte sequence and its character set comprises of no longer than 128unique characters.

简而言之，当使用时，locale C它将默认为服务器的基本 Unix/Linux 语言ASCII. 基本上，当你grep做某事时，默认情况下你的语言环境将被国际化并设置为UTF-8，它可以代表 Unicode 字符集中的每个字符，以帮助显示世界上的任何书写系统，目前超过110,000唯一字符，而ASCII每个字符是以单字节序列编码，其字符集由不超过128唯一字符组成。

So it translates to this, when using grepon a file encoded in UTF-8character-set, it needs to match each character with any of the hundred thousand unique characters, but just 128in ASCII, so use your fgrepas

所以它转化为这个，当grep在一个以UTF-8字符集编码的文件上使用时，它需要将每个字符与十万个唯一字符中的任何一个匹配，但只是128在ASCII，所以使用你的fgrepas

##代码##

Also, the same can be adapted to the Awk, since it uses a regexmatch with the match($0,i)call, setting the Clocale could speed up the string match.

此外，同样可以适用于Awk，因为它使用regex与match($0,i)调用的匹配，设置C语言环境可以加快字符串匹配。

##代码##

Answer 5

回答by gregory

Assumptions: 1. You want to run this search on just your local workstation. 2. Your have multiple cores/cpus to take advantage of a parallel search.

假设： 1. 您只想在本地工作站上运行此搜索。2. 您有多个内核/CPU 以利用并行搜索。

##代码##

Some further tweaks depending on the context: A. Disable NLS with LANG=C (this is mentioned already in another answer) B. Set a max number of matches with the -m flag.

根据上下文进行一些进一步的调整： A. 使用 LANG=C 禁用 NLS（这在另一个答案中已经提到） B. 使用 -m 标志设置最大匹配数。

Note: I'm guessing that file2 is ~4GB and the 10M block size is ok, but you may need to optimize the block size to get the fastest run.

注意：我猜 file2 是 ~4GB 并且 10M 块大小是可以的，但是您可能需要优化块大小以获得最快的运行。

Answer 6

回答by ikegami

This Perl script (a) generates a regex pattern:

这个 Perl 脚本 ( a) 生成一个正则表达式模式：

##代码##

Here's how it can be used:

以下是它的使用方法：

##代码##

Note the the script uses Regexp::Assemble, so you may need to install it.

请注意该脚本使用 Regexp::Assemble，因此您可能需要安装它。

##代码##

Notes:

笔记：

Unlike the solutions dubbed BOC1, BOC2, codeforester_orig, gregory1, inian2, inian4 and oliv, my solution correctly handles
##代码##
Mine should be better than the similar solutionby @BOC because the pattern is optimized to reduce backtracking. (Mine also works if there are more than three fields in file2.txt, whereas the linked solution can fail.)
I don't know how it compares to the split+dictionary solutions.

与称为 BOC1、BOC2、codeforester_orig、gregory1、inian2、inian4 和 oliv 的解决方案不同，我的解决方案正确处理
##代码##
我的应该比@BOC的类似解决方案更好，因为该模式经过优化以减少回溯。（如果中的字段超过三个file2.txt，我的也可以使用，而链接的解决方案可能会失败。）
我不知道它与 split+dictionary 解决方案相比如何。

Answer 7

回答by H?kon H?gland

Here is Perl solution that uses Inline::Cto speed up the search for matching fields in the large file:

这是 Perl 解决方案，用于Inline::C加速大文件中匹配字段的搜索：

##代码##

The search()sub routine is implemented in pure C using perlapito look up keys in the small file dictionary %words:

该search()子例程使用纯C语言实现perlapi查找键在小文件词典%words：

search.c:

搜索.c：

##代码##

Tests indicate that it is approximately 3 times faster than the fastest pure Perl solution (see method zdim2in my other answer) presented here.

测试表明，它比此处提供的最快的纯 Perl 解决方案（请参阅zdim2我的其他答案中的方法）快大约 3 倍。

Answer 8

回答by dawg

Here is a Python solution using sets -- roughly equivalent to a Perl key only hash or awk array in concept.

这是一个使用集合的 Python 解决方案——在概念上大致相当于一个 Perl 键只有散列或 awk 数组。

##代码##

When I run this on files of similar size, it runs in about 8 seconds.

当我在类似大小的文件上运行它时，它运行大约 8 秒。

Same speed as:

与以下速度相同：

##代码##

Both the Python and awk solution here are full string match only; not a partial regex style match.

这里的 Python 和 awk 解决方案都只是全字符串匹配；不是部分正则表达式样式匹配。

Since the awk solution is fast and POSIX compliant, that is the better answer.

由于 awk 解决方案速度快且符合 POSIX 标准，因此这是更好的答案。

Answer 9

回答by George Vasiliou

Though this thread is over, but all grep-alike methods between two files are gathered in this post, why not to add this awk alternative, similar (or even improved) to the bounty winning Inian's awk solution:

虽然这个线程已经结束，但是这篇文章中收集了两个文件之间所有类似 grep 的方法，为什么不添加这个 awk 替代方案，类似于（甚至改进）赢得赏金的 Inian 的 awk 解决方案：

##代码##

This is equivalent to Inian awk $2 in hashsolution but it could be even faster due to the fact that we don't ask awk to check if the whole hash array contains $2 of file2 - we just check if a[$2] has a value or not.

这相当于 Inian awk$2 in hash解决方案，但它可能更快，因为我们不要求 awk 检查整个哈希数组是否包含 file2 的 $2 - 我们只检查 a[$2] 是否有值。

While reading the first patterns file appart from creating the hash array we assign also a value.

在从创建哈希数组中读取第一个模式文件时，我们还分配了一个值。

If $2of datafile had been found before in patterns file, then a[$2]would have a value and thus will be printed because is not null.

如果$2之前在模式文件中找到了数据文件，a[$2]则将有一个值，因此将被打印，因为它不为空。

if a[$2]of datafile returns no value(null) this is translated to false => no printing.

如果a[$2]数据文件没有返回值（空），这将被转换为 false => 不打印。

Extension to match any of the three fields of datafile:

扩展以匹配数据文件的三个字段中的任何一个：

##代码##

In both cases, applying LC_ALL=Cin front of awk, seems to speed things up.

在这两种情况下，在 awk 前面应用LC_ALL=C似乎都可以加快速度。

PS1: Offcourse this solution has also the pitfalls of all awk solutions. Is not a pattern matching. Is a direct / fixed matching between of the two files, like most of the solutions inhere.

PS1：当然这个解决方案也有所有 awk 解决方案的缺陷。不是模式匹配。是两个文件之间的直接/固定匹配，就像这里的大多数解决方案一样。

PS2: In my poor machine benchmark using the small benchmark files of H?kon H?gland, i get about 20% better performance comparing to the awk 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt

PS2：在我使用H?kon H?gland的小基准文件的糟糕机器基准测试中，与awk 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt

Answer 10

回答by George Vasiliou

Can you give a try to join? Files must be sorted though...

你能试试join吗？文件必须排序，但...

##代码##

Small Update:
By using LC_ALL=C in front of join, things are really speed up as can be seen in the benchmark of H?kon H?gland

小更新：
通过在 join 前使用 LC_ALL=C，在H?kon H?gland的基准测试中可以看出速度确实加快了

PS1: I have my doubts if join can be faster than grep -f ...

PS1：我怀疑 join 是否比 grep -f 更快 ...

在 Bash 中从另一个较大文件中查找文件行的最快方法

提问by codeforester

采纳答案by codeforester

回答by zdim

回答by H?kon H?gland

Notes on the different solutions

不同解决方案的注意事项

Notes

笔记

Running the tests

运行测试

Results

结果

Summary

概括

A more realistic test case

更真实的测试用例

Note

笔记

Notes

笔记

回答by Inian

回答by gregory

回答by ikegami

回答by H?kon H?gland

回答by dawg

回答by George Vasiliou

回答by George Vasiliou

相关推荐

最近更新

标签

在 Bash 中从另一个较大文件中查找文件行的最快方法

提问by codeforester

采纳答案by codeforester

回答by zdim

回答by H?kon H?gland

Notes on the different solutions

不同解决方案的注意事项

Notes

笔记

Running the tests

运行测试

Results

结果

Summary

概括

A more realistic test case

更真实的测试用例

Note

笔记

Notes

笔记

回答by Inian

回答by gregory

回答by ikegami

回答by H?kon H?gland

回答by dawg

回答by George Vasiliou

回答by George Vasiliou

相关推荐

bash Homebrew 是最新的，但我是否需要担心“无法链接”+“删除路径”错误？

bash 如何在子 shell 中激活 conda env？

运行 Bash 脚本会导致“错误的解释器：没有这样的文件或目录”错误

bash 弄乱了 Ubuntu 16.04 中的 PATH 环境变量

相关推荐

最近更新

标签