为列的每个唯一值输出整行一次(Bash)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12052633/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 03:05:25  来源:igfitidea点击:

Output whole line once for each unique value of a column (Bash)

bashshellawkuniq

提问by Bede Constantinides

This must surely be a trivial task with awkor otherwise, but it's left me scratching my head this morning. I have a file with a format similar to this:

awk无论是否如此,这肯定是一项微不足道的任务,但今天早上让我摸不着头脑。我有一个格式与此类似的文件:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> AIQLTGK        8   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> AIQLTGK        10  genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750
pep> VSSILEDKILSR   2   genes ADUm.2146,ADUm.5750

I would like to print a line for each distinct value of the peptides in column 2, meaning the above input would become:

我想为第 2 列中肽的每个不同值打印一行,这意味着上述输入将变为:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750

This is what I've tried so far, but clearly neither does what I need:

这是我到目前为止所尝试的,但显然我所需要的也不是:

awk '{print }' file | sort | uniq
# Prints only the peptides...
awk '{print 
sort -k 2,2 -u file
, "\t", }' file |sort | uniq -u -f 4 # Altogether omits peptides which are not unique...

One last thing, It will need to treat peptides which are substrings of other peptides as distinct values (eg VSSILED and VSSILEDKILSR). Thanks :)

最后一件事,它需要将作为其他肽的子串的肽视为不同的值(例如 VSSILED 和 VSSILEDKILSR)。谢谢 :)

回答by flolo

Just use sort:

只需使用排序:

awk '!array[]++' file.txt

The -uremoves duplicate entries (as you wanted), and the -k 2,2makes just the field 2 the sorting field (and so ignores the rest when checking for duplicates).

-u删除重复项(如你想),并-k 2,2使得刚刚场2排序字段(和重复检查时忽略这样的其余部分)。

回答by Steve

One way using awk:

一种使用方式awk

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750

Results:

结果:

perl -nae 'print unless exists $seen{$F[1]}; undef $seen{$F[1]}' < input.txt

回答by choroba

I would use Perl for this:

我会为此使用 Perl:

awk '{if(==temp){next;}else{print}temp=}' your_file

The nswitch works line by line with the input, the aswitch splits the line into the @Farray.

n开关工作线路通过与所述输入线,所述a开关分割行成@F阵列。

回答by Vijay

> awk '{if(==temp){next;}else{print}temp=}' temp
pep> AEYTCVAETK         2       genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK            1       genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR      5       genes ADUm.367
pep> VSSILEDKTT         9       genes ADUm.1192,ADUm.2731
pep> AIQLTGK            10      genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR       3       genes ADUm.2146,ADUm.5750

tested below:

测试如下:

##代码##