bash 如何在不排序的情况下删除两个文件之间的公共行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24324350/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 00:53:37  来源:igfitidea点击:

How to remove common lines between two files without sorting?

bashsortingoptimizationsedcomm

提问by harrison4

I have two files not sortered which have some lines in common.

我有两个未排序的文件,它们有一些共同点。

file1.txt

文件1.txt

Z
B
A
H
L

file2.txt

文件2.txt

S
L
W
Q
A

The way I'm using to remove common lines is the following:

我用来删除公共行的方式如下:

sort -u file1.txt > file1_sorted.txt
sort -u file2.txt > file2_sorted.txt

comm -23 file1_sorted.txt file2_sorted.txt > file_final.txt

Output:

输出:

B
H
Z

The problem is that I want to keep the order of file1.txt, I mean:

问题是我想保持file1.txt的顺序,我的意思是:

Desired output:

期望的输出:

Z
B
H

One solution I tought is doing a loop to read all the lines of file2.txt and:

我想到的一个解决方案是循环读取 file2.txt 的所有行,并且:

sed -i '/^${line_file2}$/d' file1.txt

But if files are big the performance may suck.

但是如果文件很大,性能可能会很差。

  • Do you like my idea?
  • Do you have any alternative to do it?
  • 你喜欢我的想法吗?
  • 你有其他选择吗?

回答by Kent

grep or awk:

grep 或 awk:

awk 'NR==FNR{a[
grep -vf input2 input1 
]=1;next}!a[
Z
B
H
]' file2 file1

回答by perreal

You can use just grep (-vfor invert, -ffor file). Grep lines from input1that do not match any line in input2:

您可以只使用 grep (-v用于反转,-f用于文件)。来自input1该行的 Grep 行与 中的任何行都不匹配input2

#!/usr/bin/env perl -w
use strict;
use Getopt::Std;
my %opts;
getopts('hvfcmdk:', \%opts);
my $missing=$opts{m}||undef;
my $column=$opts{k}||undef;
my $common=$opts{c}||undef;
my $verbose=$opts{v}||undef;
my $fast=$opts{f}||undef;
my $dupes=$opts{d}||undef;
$missing=1 unless $common || $dupes;;
&usage() unless $ARGV[1];
&usage() if $opts{h};
my (%found,%k,%fields);
if ($column) {
    die("The -k option only works in fast (-f) mode\n") unless $fast;
    $column--; ## So I don't need to count from 0
}

open(my $F1,"$ARGV[0]")||die("Cannot open $ARGV[0]: $!\n");
while(<$F1>){
    chomp;
    if ($fast){ 
    my @aa=split(/\s+/,$_);
    $k{$aa[0]}++;   
        $found{$aa[0]}++;
    }
    else {
    $k{$_}++;   
        $found{$_}++;
    }
}
close($F1);
my $n=0;
open(F2,"$ARGV[1]")||die("Cannot open $ARGV[1]: $!\n");
my $size=0;
if($verbose){
    while(<F2>){
        $size++;
    }
}
close(F2);
open(F2,"$ARGV[1]")||die("Cannot open $ARGV[1]: $!\n");

while(<F2>){
    next if /^\s+$/;
    $n++;
    chomp;
    print STDERR "." if $verbose && $n % 10==0;
    print STDERR "[$n of $size lines]\n" if $verbose && $n % 800==0;
    if($fast){
        my @aa=split(/\s+/,$_);
        $k{$aa[0]}++ if defined($k{$aa[0]});
        $fields{$aa[0]}=\@aa if $column;
    }
    else{
        my @keys=keys(%k);
        foreach my $key(keys(%found)){
            if (/\Q$key/){
            $k{$key}++ ;
            $found{$key}=undef unless $dupes;
            }
        }
    }
}
close(F2);
print STDERR "[$n of $size lines]\n" if $verbose;

if ($column) {
    $missing && do map{my @aa=@{$fields{$_}}; print "$aa[$column]\n" unless $k{$_}>1}keys(%k);
    $common &&  do map{my @aa=@{$fields{$_}}; print "$aa[$column]\n" if $k{$_}>1}keys(%k);
    $dupes &&   do map{my @aa=@{$fields{$_}}; print "$aa[$column]\n" if $k{$_}>2}keys(%k);
}
else {
    $missing && do map{print "$_\n" unless $k{$_}>1}keys(%k);
    $common &&  do map{print "$_\n" if $k{$_}>1}keys(%k);
    $dupes &&   do map{print "$_\n" if $k{$_}>2}keys(%k);
}
sub usage{
    print STDERR <<EndOfHelp;

  USAGE: compare_lists.pl FILE1 FILE2

      This script will compare FILE1 and FILE2, searching for the 
      contents of FILE1 in FILE2 (and NOT vice versa). FILE one must 
      be one search pattern per line, the search pattern need only be 
      contained within one of the lines of FILE2.

    OPTIONS: 
      -c : Print patterns COMMON to both files
      -f : Search only the first characters of each line of FILE2
      for the search pattern given in FILE1
      -d : Print duplicate entries     
      -m : Print patterns MISSING in FILE2 (default)
      -h : Print this help and exit
EndOfHelp
      exit(0);
}

Gives:

给出:

list_compare.pl -cf file1.txt file2.txt

回答by terdon

I've written a little Perl script that I use for this kind of thing. It can do more than what you ask for but it can also do what you need:

我写了一个小的 Perl 脚本,用于这种事情。它可以做的比你要求的更多,但它也可以做你需要的:

##代码##

In your case, you would run it as

在你的情况下,你会运行它

##代码##

The -foption makes it compare only the first word (defined by whitespace) of file2 and greatly speeds things up. To compare the entire line, remove the -f.

-f选项使其仅比较 file2 的第一个单词(由空格定义)并大大加快了速度。要比较整行,请删除-f.