bash 按行比较两个文件并从第一个文件中删除重复项
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37503186/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
comparing two files by lines and removing duplicates from first file
提问by Ankit Jain
Problem:
问题:
- Need to compare two files,
- removing the duplicate from the first file
- then appending the lines of file1 to file2
- 需要比较两个文件,
- 从第一个文件中删除重复项
- 然后将 file1 的行附加到 file2
Illustration by example
举例说明
Suppose, The two files are test1 and test2.
假设,这两个文件是 test1 和 test2。
$ cat test2
www.xyz.com/abc-2
www.xyz.com/abc-3
www.xyz.com/abc-4
www.xyz.com/abc-5
www.xyz.com/abc-6
And test1 is
而 test1 是
$ cat test1
www.xyz.com/abc-1
www.xyz.com/abc-2
www.xyz.com/abc-3
www.xyz.com/abc-4
www.xyz.com/abc-5
Comparing test1 to test2and removing duplicates from test 1
比较 test1 和 test2并从测试 1 中删除重复项
Result Required:
结果要求:
$ cat test1
www.xyz.com/abc-1
and then adding this test1 data in to test2
然后将此 test1 数据添加到 test2
$ cat test2
www.xyz.com/abc-2
www.xyz.com/abc-3
www.xyz.com/abc-4
www.xyz.com/abc-5
www.xyz.com/abc-6
www.xyz.com/abc-1
Solutions Tried:
尝试的解决方案:
join -v1 -v2 <(sort test1) <(sort test2)
which resulted into this (that was wrong output)
这导致了这个(这是错误的输出)
$ join -v1 -v2 <(sort test1) <(sort test2)
www.xyz.com/abc-1
www.xyz.com/abc-6
Another solution i tried was :
我尝试的另一个解决方案是:
fgrep -vf test1 test2
which resulted nothing.
结果什么也没有。
回答by andlrc
With awk:
使用 awk:
% awk 'NR == FNR{ a[NR == FNR { # Run for test2 only
a[$ grep -vxFf test2 test1
www.xyz.com/abc-1
] = 1 # Store whole line as key in associative array
next # Skip next block
}
!a[grep -vxFf test2 test1 >test1.tmp && mv test1.tmp test1
] # Print line from test1 that are not in a
] = 1;next } !a[cat test1 >>test2
]' test2 test1
www.xyz.com/abc-1
Breakdown:
分解:
diff test1 test2 |grep "<"|sed 's/< \+//g' > test1.tmp|mv test1.tmp test1
回答by John1024
Remove lines from test1 because they are in test2:
从 test1 中删除行,因为它们在 test2 中:
$ cat test1
www.xyz.com/abc-1
To overwrite test1:
覆盖 test1:
cat test1 >> test2
To append the new test1 to the end of test2:
将新的 test1 附加到 test2 的末尾:
$ cat test2
www.xyz.com/abc-2
www.xyz.com/abc-3
www.xyz.com/abc-4
www.xyz.com/abc-5
www.xyz.com/abc-6
www.xyz.com/abc-1
The grep options
grep 选项
grep normally prints matching lines. -v
tells grep to do the reverse: it prints only lines that do not match
grep 通常打印匹配的行。 -v
告诉 grep 做相反的事情:它只打印不匹配的行
-x
tells grep to do whole-line matches.
-x
告诉 grep 进行整行匹配。
-F
tells grep that we are using fixed strings, not regular expressions.
-F
告诉 grep 我们使用的是固定字符串,而不是正则表达式。
-f test2
tells grep to read those fixed strings, one per line, from file test2.
-f test2
告诉 grep 从文件 test2 中读取这些固定字符串,每行一个。
回答by sumitya
Solution to 1 and 2 problem.
问题1和2的解决方案。
$ sort -u test1 test2
www.xyz.com/abc-1
www.xyz.com/abc-2
www.xyz.com/abc-3
www.xyz.com/abc-4
www.xyz.com/abc-5
www.xyz.com/abc-6
here is the output
这是输出
##代码##solution to 3 problem.
3 问题的解决方案。
##代码##here is the output
这是输出
##代码##回答by Ed Morton
If the lines in each file are unique as shown in your sample input then, since you are already sorting the input files in your attempted solutions so sorted output must be OK, this is all you need:
如果每个文件中的行都是唯一的,如您的示例输入所示,那么由于您已经在尝试的解决方案中对输入文件进行了排序,因此排序后的输出必须是正确的,这就是您所需要的:
##代码##If you need something else then edit your question to clarify your requirements and provide sample input/output that would cause this to break.
如果您需要其他内容,请编辑您的问题以阐明您的要求并提供可能导致此问题中断的示例输入/输出。