为列的每个唯一值输出整行一次（Bash）

Question

提问by Bede Constantinides

This must surely be a trivial task with awkor otherwise, but it's left me scratching my head this morning. I have a file with a format similar to this:

awk无论是否如此，这肯定是一项微不足道的任务，但今天早上让我摸不着头脑。我有一个格式与此类似的文件：

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> AIQLTGK        8   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> AIQLTGK        10  genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750
pep> VSSILEDKILSR   2   genes ADUm.2146,ADUm.5750

I would like to print a line for each distinct value of the peptides in column 2, meaning the above input would become:

我想为第 2 列中肽的每个不同值打印一行，这意味着上述输入将变为：

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750

This is what I've tried so far, but clearly neither does what I need:

这是我到目前为止所尝试的，但显然我所需要的也不是：

awk '{print }' file | sort | uniq
# Prints only the peptides...
awk '{print sort -k 2,2 -u file
, "\t", }' file |sort | uniq -u -f 4
# Altogether omits peptides which are not unique...

One last thing, It will need to treat peptides which are substrings of other peptides as distinct values (eg VSSILED and VSSILEDKILSR). Thanks :)

最后一件事，它需要将作为其他肽的子串的肽视为不同的值（例如 VSSILED 和 VSSILEDKILSR）。谢谢：）

Answer 1

回答by flolo

Just use sort:

只需使用排序：

awk '!array[]++' file.txt

The -uremoves duplicate entries (as you wanted), and the -k 2,2makes just the field 2 the sorting field (and so ignores the rest when checking for duplicates).

在-u删除重复项（如你想），并-k 2,2使得刚刚场2排序字段（和重复检查时忽略这样的其余部分）。

Answer 2

回答by Steve

One way using awk:

一种使用方式awk：

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750

Results:

结果：

perl -nae 'print unless exists $seen{$F[1]}; undef $seen{$F[1]}' < input.txt

Answer 3

回答by choroba

I would use Perl for this:

我会为此使用 Perl：

awk '{if(==temp){next;}else{print}temp=}' your_file

The nswitch works line by line with the input, the aswitch splits the line into the @Farray.

该n开关工作线路通过与所述输入线，所述a开关分割行成@F阵列。

Answer 4

回答by Vijay

> awk '{if(==temp){next;}else{print}temp=}' temp
pep> AEYTCVAETK         2       genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK            1       genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR      5       genes ADUm.367
pep> VSSILEDKTT         9       genes ADUm.1192,ADUm.2731
pep> AIQLTGK            10      genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR       3       genes ADUm.2146,ADUm.5750

tested below:

测试如下：

##代码##

为列的每个唯一值输出整行一次（Bash）

提问by Bede Constantinides

回答by flolo

回答by Steve

回答by choroba

回答by Vijay

相关推荐

最近更新

标签

为列的每个唯一值输出整行一次（Bash）

提问by Bede Constantinides

回答by flolo

回答by Steve

回答by choroba

回答by Vijay

相关推荐

如何维护由 git 源代码控制的 bash 脚本的版本号？

追加在同一行 bash

bash 在循环中分配给数组索引的bash麻烦

bash 为什么“超时”不适用于管道？

相关推荐

最近更新

标签