bash 根据列排序和删除重复项
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17847799/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Sort and remove duplicates based on column
提问by Yang
I have a text file:
我有一个文本文件:
$ cat text
542,8,1,418,1
542,9,1,418,1
301,34,1,689070,1
542,9,1,418,1
199,7,1,419,10
I'd like to sort the file based on the first column and remove duplicates using sort
, but things are not going as expected.
我想根据第一列对文件进行排序并使用 删除重复项sort
,但事情并没有按预期进行。
Approach 1
方法一
$ sort -t, -u -b -k1n text
542,8,1,418,1
542,9,1,418,1
199,7,1,419,10
301,34,1,689070,1
It is not sorting based on the first column.
它不是基于第一列排序。
Approach 2
方法二
$ sort -t, -u -b -k1n,1n text
199,7,1,419,10
301,34,1,689070,1
542,8,1,418,1
It removes the 542,9,1,418,1
line but I'd like to keep one copy.
它删除了该542,9,1,418,1
行,但我想保留一份副本。
It seems that the first approach removes duplicate but not sorts correctly, whereas the second one sorts right but removes more than I want. How should I get the correct result?
似乎第一种方法删除了重复但排序不正确,而第二种方法排序正确但删除的比我想要的多。我应该如何得到正确的结果?
采纳答案by jaypal singh
The problem is that when you provide a key
to sort
the unique occurrences are looked for that particular field. Since the line 542,8,1,418,1
is displayed, sort
sees the next two lines starting with 542
as duplicate and filters them out.
问题是,当您提供一个key
到sort
该特定字段的唯一出现时。由于542,8,1,418,1
显示了该行,因此sort
将接下来的两行542
视为重复行并将其过滤掉。
Your best bet would be to either sort all columns:
您最好的选择是对所有列进行排序:
sort -t, -nk1,1 -nk2,2 -nk3,3 -nk4,4 -nk5,5 -u text
or
或者
use awk
to filter duplicate lines and pipe it to sort
.
用于awk
过滤重复的行并将其通过管道传输到sort
.
awk '!_[sort -t, -u -k1,1n text
]++' text | sort -t, -nk1,1
回答by choroba
When sorting on a key, you must provide the end of the key as well, otherwise sort uses all following keys as well.
在对键进行排序时,您还必须提供键的结尾,否则 sort 也会使用所有后续键。
The following should work:
以下应该工作:
##代码##