Linux 在 2 个集合/文件之间提取唯一值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4717250/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 02:30:55  来源:igfitidea点击:

extracting unique values between 2 sets/files

linuxperlbashscriptingcommand-line

提问by mark

Working in linux/shell env, how can I accomplish the following:

在 linux/shell env 中工作,我如何完成以下工作:

text file 1 contains:

文本文件 1 包含:

1
2
3
4
5

text file 2 contains:

文本文件 2 包含:

6
7
1
2
3
4

I need to extract the entries in file 2 which are not in file 1. So '6' and '7' in this example.

我需要提取文件 2 中不在文件 1 中的条目。所以在本例中为 '6' 和 '7'。

How do I do this from the command line?

如何从命令行执行此操作?

many thanks!

非常感谢!

采纳答案by SiegeX

$ awk 'FNR==NR {a[
findUniqueValues(file1, file2){
    contents1 = array of values from file1
    contents2 = array of values from file2
    foreach(value2 in contents2){
        found=false
        foreach(value1 in contents1){
            if (value2 == value1) found=true
        }
        if(!found) print value2
    }
}
]++; next} !a[
sort file1 > file1.sorted
sort file2 > file2.sorted
comm -1 -3 file1.sorted file2.sorted
]' file1 file2 6 7

Explanation of how the code works:

代码工作原理说明:

  • If we're working on file1, track each line of text we see.
  • If we're working on file2, and have not seen the line text, then print it.
  • 如果我们正在处理 file1,请跟踪我们看到的每一行文本。
  • 如果我们正在处理 file2,并且还没有看到行文本,则打印它。

Explanation of details:

详细说明:

  • FNRis the current file's record number
  • NRis the current overall record number from all input files
  • FNR==NRis true only when we are reading file1
  • $0is the current line of text
  • a[$0]is a hash with the key set to the current line of text
  • a[$0]++tracks that we've seen the current line of text
  • !a[$0]is true only when we have not seen the line text
  • Print the line of text if the above pattern returns true, this is the default awk behavior when no explicit action is given
  • FNR是当前文件的记录号
  • NR是来自所有输入文件的当前总记录号
  • FNR==NR仅当我们正在读取 file1 时才为真
  • $0是当前文本行
  • a[$0]是一个哈希,键设置为当前文本行
  • a[$0]++跟踪我们看到的当前文本行
  • !a[$0]只有当我们没有看到行文本时才为真
  • 如果上述模式返回 true,则打印文本行,这是未给出显式操作时的默认 awk 行为

回答by David Weiser

If you are reallyset on doing this from the command line, this site(search for "no duplicates found") has an awkexample that searches for duplicates. It may be a good starting point to look at that.

如果您真的准备从命令行执行此操作,则此站点(搜索“未找到重复项”)有一个awk搜索重复项的示例。这可能是一个很好的起点。

However, I'd encourage you to use Perl or Python for this. Basically, the flow of the program would be:

但是,我鼓励您为此使用 Perl 或 Python。基本上,程序的流程是:

sort file1 | uniq > file1.sorted
sort file2 | uniq > file2.sorted
comm -1 -3 file1.sorted file2.sorted

This isn't the most elegant way of doing this, since it has a O(n^2) time complexity, but it will do the job.

这不是最优雅的方法,因为它的时间复杂度为 O(n^2),但它可以完成这项工作。

回答by Daniel Gallagher

Using some lesser-known utilities:

使用一些鲜为人知的实用程序:

grep -F -x -v -f file_1 file_2 

This will output duplicates, so if there is 1 3in file1, but 2 in file2, this will still output 1 3. If this is not what you want, pipe the output from sortthrough uniqbefore writing it to a file:

这将输出重复项,因此如果有 1 3in file1,但有 2 in file2,则仍将输出 1 3。如果这不是您想要的,请在将其写入文件之前sort通过管道uniq传输输出:

$ awk 'FNR==NR{a[
awk 'FNR==NR{a[
# Debian Squeeze, Bash 4.1.5, LC_ALL=C, slow 4 core CPU
$ wc file1 file2
  321599   321599  8098710 file1
  321603   321603  8098794 file2
]++}FNR!=NR && !a[
awk2: real 0m1.145s  user 0m1.088s  sys 0m0.056s  user+sys 1.144
awk1: real 0m1.369s  user 0m1.324s  sys 0m0.044s  user+sys 1.368
comm: real 0m0.980s  user 0m1.608s  sys 0m0.184s  user+sys 1.792
join: real 0m1.080s  user 0m1.756s  sys 0m0.140s  user+sys 1.896
grep: real 0m4.005s  user 0m3.844s  sys 0m0.160s  user+sys 4.004
]{print}' file1 file2 # awk1 by SiegeX awk 'FNR==NR{a[
awk 'FNR==NR{a[
diff file_1 file_2 | grep '^>' | cut -c 3-
]=1;next}!(##代码## in a)' file1 file2
]++;next}!(##代码## in a)' file1 file2 # awk2 by ghostdog74 comm -13 <(sort file1) <(sort file2) join -v 2 <(sort file1) <(sort file2) grep -v -F -x -f file1 file2
]++;next}(!(##代码## in a))' file1 file2 6 7

There are lots of utilities in the GNU coreutils package that allow for all sorts of text manipulations.

GNU coreutils 包中有许多实用程序可以进行各种文本操作。

回答by sid_com

with grep:

使用 grep:

##代码##

回答by ghostdog74

here's another awk solution

这是另一个 awk 解决方案

##代码##

回答by xebeche

I was wondering which of the following solutions was the "fastest" for "larger" files:

我想知道以下哪种解决方案对于“较大”文件来说是“最快的”:

##代码##

Results of my benchmarks in short:

简而言之,我的基准测试结果:

  • Do not use grep -Fxf, it's much slower (2-4 times in my tests).
  • commis slightly faster than join.
  • If file1 and file2 are already sorted, command joinare much faster than awk1 + awk2. (Of course, they do not assume sorted files.)
  • awk1 + awk2, supposedly, use more RAM and less CPU. Real run times are lower for commprobably due to the fact that it uses more threads. CPU times are lower for awk1 + awk2.
  • 不要使用grep -Fxf,它要慢得多(在我的测试中是 2-4 倍)。
  • comm比 略快join
  • 如果 file1 和 file2 已经排序,comm并且join比 awk1 + awk2 快得多。(当然,他们不假设已排序的文件。)
  • awk1 + awk2,据说使用更多的内存和更少的 CPU。实际运行时间较短,comm可能是因为它使用了更多线程。awk1 + awk2 的 CPU 时间较短。

For the sake of brevity I omit full details. However, I assume that anyone interested can contact me or just repeat the tests. Roughly, the setup was

为简洁起见,我省略了全部细节。但是,我假设任何感兴趣的人都可以与我联系或只是重复测试。粗略地说,设置是

##代码##

Typical results of fastest runs

最快运行的典型结果

##代码##

BTW, for the awkies: It seems that a[$0]=1is faster than a[$0]++, and (!($0 in a))is faster than (!a[$0]). So, for an awk solution I suggest:

顺便说一句,对于 awkies:似乎a[$0]=1比 快a[$0]++,并且(!($0 in a))比 快(!a[$0])。因此,对于 awk 解决方案,我建议:

##代码##

回答by Ivo

How about:

怎么样:

##代码##

This would print the entries in file_2 which are not in file_1. For the opposite result one just has to replace '>' with '<'. 'cut' removes the first two characters added by 'diff', that are not part of the original content.

这将打印 file_2 中不在 file_1 中的条目。对于相反的结果,只需将 '>' 替换为 '<'。'cut' 删除由 'diff' 添加的前两个字符,它们不是原始内容的一部分。

The files don't even need to be sorted.

甚至不需要对文件进行排序。