bash 按行比较两个文件并从第一个文件中删除重复项

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37503186/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 14:41:55  来源:igfitidea点击:

comparing two files by lines and removing duplicates from first file

bashunixgrep

提问by Ankit Jain

Problem:

问题:

  1. Need to compare two files,
  2. removing the duplicate from the first file
  3. then appending the lines of file1 to file2
  1. 需要比较两个文件,
  2. 从第一个文件中删除重复项
  3. 然后将 file1 的行附加到 file2

Illustration by example

举例说明

Suppose, The two files are test1 and test2.

假设,这两个文件是 test1 和 test2。

$ cat test2
www.xyz.com/abc-2
www.xyz.com/abc-3
www.xyz.com/abc-4
www.xyz.com/abc-5
www.xyz.com/abc-6

And test1 is

而 test1 是

$ cat test1
www.xyz.com/abc-1
www.xyz.com/abc-2
www.xyz.com/abc-3
www.xyz.com/abc-4
www.xyz.com/abc-5

Comparing test1 to test2and removing duplicates from test 1

比较 test1 和 test2并从测试 1 中删除重复项

Result Required:

结果要求:

$ cat test1
www.xyz.com/abc-1

and then adding this test1 data in to test2

然后将此 test1 数据添加到 test2

$ cat test2
www.xyz.com/abc-2
www.xyz.com/abc-3
www.xyz.com/abc-4
www.xyz.com/abc-5
www.xyz.com/abc-6
www.xyz.com/abc-1

Solutions Tried:

尝试的解决方案:

join -v1 -v2 <(sort test1) <(sort test2)

which resulted into this (that was wrong output)

这导致了这个(这是错误的输出)

$ join -v1 -v2 <(sort test1) <(sort test2)
www.xyz.com/abc-1
www.xyz.com/abc-6

Another solution i tried was :

我尝试的另一个解决方案是:

fgrep -vf test1 test2

which resulted nothing.

结果什么也没有。

回答by andlrc

With awk:

使用 awk:

% awk 'NR == FNR{ a[
NR == FNR { # Run for test2 only
  a[
$ grep -vxFf test2 test1
www.xyz.com/abc-1
] = 1 # Store whole line as key in associative array next # Skip next block } !a[
grep -vxFf test2 test1 >test1.tmp && mv test1.tmp test1
] # Print line from test1 that are not in a
] = 1;next } !a[
cat test1 >>test2
]' test2 test1 www.xyz.com/abc-1

Breakdown:

分解:

diff test1 test2 |grep "<"|sed  's/< \+//g' > test1.tmp|mv test1.tmp test1

回答by John1024

Remove lines from test1 because they are in test2:

从 test1 中删除行,因为它们在 test2 中:

$ cat test1
www.xyz.com/abc-1

To overwrite test1:

覆盖 test1:

cat test1 >> test2

To append the new test1 to the end of test2:

将新的 test1 附加到 test2 的末尾:

$ cat test2
www.xyz.com/abc-2
www.xyz.com/abc-3
www.xyz.com/abc-4
www.xyz.com/abc-5
www.xyz.com/abc-6
www.xyz.com/abc-1

The grep options

grep 选项

grep normally prints matching lines. -vtells grep to do the reverse: it prints only lines that do not match

grep 通常打印匹配的行。 -v告诉 grep 做相反的事情:它只打印不匹配的行

-xtells grep to do whole-line matches.

-x告诉 grep 进行整行匹配。

-Ftells grep that we are using fixed strings, not regular expressions.

-F告诉 grep 我们使用的是固定字符串,而不是正则表达式。

-f test2tells grep to read those fixed strings, one per line, from file test2.

-f test2告诉 grep 从文件 test2 中读取这些固定字符串,每行一个。

回答by sumitya

Solution to 1 and 2 problem.

问题1和2的解决方案。

$ sort -u test1 test2
www.xyz.com/abc-1
www.xyz.com/abc-2
www.xyz.com/abc-3
www.xyz.com/abc-4
www.xyz.com/abc-5
www.xyz.com/abc-6

here is the output

这是输出

##代码##

solution to 3 problem.

3 问题的解决方案。

##代码##

here is the output

这是输出

##代码##

回答by Ed Morton

If the lines in each file are unique as shown in your sample input then, since you are already sorting the input files in your attempted solutions so sorted output must be OK, this is all you need:

如果每个文件中的行都是唯一的,如您的示例输入所示,那么由于您已经在尝试的解决方案中对输入文件进行了排序,因此排序后的输出必须是正确的,这就是您所需要的:

##代码##

If you need something else then edit your question to clarify your requirements and provide sample input/output that would cause this to break.

如果您需要其他内容,请编辑您的问题以阐明您的要求并提供可能导致此问题中断的示例输入/输出。