bash awk 中的 uniq;使用awk删除列中的重复值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2978361/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-17 22:10:39  来源:igfitidea点击:

Uniq in awk; removing duplicate values in a column using awk

bashawkunique

提问by D W

I have a large datafile in the following format below:

我有以下格式的大数据文件:

ENST00000371026 WDR78,WDR78,WDR78,  WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32   WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458,  atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,

The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:

列以制表符分隔。列中的多个值以逗号分隔。我想删除第二列中的重复值以产生如下结果:

ENST00000371026 WDR78   WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32   WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458   atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,

I tried the following code below but it doesn't seem to remove the duplicate values.

我尝试了下面的代码,但它似乎没有删除重复的值。

awk ' 
BEGIN { FS="\t" } ;
{
  split(, valueArray,",");
  j=0;
  for (i in valueArray) 
  { 
    if (!( valueArray[i] in duplicateArray))
    {
      duplicateArray[j] = valueArray[i];
      j++;
    }
  };
  printf  "\t";
  for (j in duplicateArray) 
  {
    if (duplicateArray[j]) {
      printf duplicateArray[j] ",";
    }
  }
  printf "\t";
  print 

}' knownGeneFromUCSC.txt

How can I remove the duplicates in column 2 correctly?

如何正确删除第 2 列中的重复项?

回答by Paused until further notice.

Your script acts only on the second record (line) in the file because of NR==2. I took it out, but it may be what you intend. If so, you should put it back.

您的脚本仅作用于文件中的第二条记录(行),因为NR==2. 我把它拿出来,但它可能是你想要的。如果是这样,你应该把它放回去。

The inoperator checks for the presence of the index, not the value, so I made duplicateArrayan associative array*that uses the values from valueArrayas its indices. This saves from having to iterate over both arrays in a loop within a loop.

所述in的存在操作者检查索引,而不是值,所以由duplicateArray一个关联数组*从用这些值valueArray作为其指标。这样就不必在循环内的循环中迭代两个数组。

The splitstatement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added an ifto keep it from printing a null value which would result in ",WDR78," being printed if the ifweren't there.

split语句将“WDR78、WDR78、WDR78”视为四个字段而不是三个字段,因此我添加了一个if以防止它打印空值,这将导致“,WDR78”在if不存在时被打印。

* In reality all arrays in AWK are associative.

* 实际上,AWK 中的所有数组都是关联的。

awk '
BEGIN { FS="\t" } ;
{
  split(, valueArray,",");
  j=0;
  for (i in valueArray)
  { 
    if (!(valueArray[i] in duplicateArray))
    { 
      duplicateArray[valueArray[i]] = 1
    }
  };
  printf  "\t";
  for (j in duplicateArray)
  {
    if (j)    # prevents printing an extra comma
    {
      printf j ",";
    }
  }
  printf "\t";
  print 
  delete duplicateArray    # for non-gawk, use split("", duplicateArray)
}'

回答by leonbloy

Sorry, I know you asked about awk... but Perl makes this much more simple:

抱歉,我知道您问的是 awk ……但是 Perl 使这变得更简单:

$ perl -n -e ' @t = split(/\t/);
  %t2 = map { $_ => 1 } split(/,/,$t[1]);
  $t[1] = join(",",keys %t2);
  print join("\t",@t); ' knownGeneFromUCSC.txt

回答by Dimitre Radoulov

Perl:

珀尔:

perl -F'\t' -lane'
  $F[1] = join ",", grep !$_{$_}++, split ",", $F[1]; 
  print join "\t", @F; %_ = ();
  ' infile  

awk:

awk:

awk -F'\t' '{
  n = split(, t, ","); _2 = x
  split(x, _) # use delete _ if supported
  for (i = 0; ++i <= n;)
    _[t[i]]++ || _2 = _2 ? _2 "," t[i] : t[i]
   = _2 
  }-3' OFS='\t' infile

The line 4 in the awk script is used to preserve the original order of the values in the second field after filtering the unique values.

awk 脚本中的第 4 行用于在过滤唯一值后保留第二个字段中值的原始顺序。

回答by Fritz G. Mehner

Pure Bash 4.0 (one associative array):

Pure Bash 4.0(一个关联数组):

declare -a part                            # parts of a line
declare -a part2                           # parts 2. column
declare -A check                           # used to remember items in part2

while read  line ; do
  part=( $line )                           # split line using whitespaces
  IFS=','                                  # separator is comma
  part2=( ${part[1]} )                     # split 2. column using comma
  if [ ${#part2[@]} -gt 1 ] ; then         # more than 1 field in 2. column?
    check=()                               # empty check array
    new2=''                                # empty new 2. column
    for item in ${part2[@]} ; do 
      (( check[$item]++ ))                 # remember items in 2. column
      if [ ${check[$item]} -eq 1 ] ; then  # not yet seen?
        new2=$new2,$item                   # add to new 2. column
      fi 
    done
    part[1]=${new2#,}                      # remove leading comma
  fi 
  IFS=$'\t'                                # separator for the output
  echo "${part[*]}"                        # rebuild line
done < "$infile"