bash 在bash中获取列中唯一值的计数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4921879/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 20:10:06  来源:igfitidea点击:

Getting the count of unique values in a column in bash

bashcommand-linefrequency

提问by sfactor

I have tab delimited files with several columns. I want to count the frequency of occurrence of the different values in a column for all the files in a folder and sort them in decreasing order of count (highest count first). How would I accomplish this in a Linux command line environment?

我有带有几列的制表符分隔文件。我想计算文件夹中所有文件的列中不同值的出现频率,并按计数降序对它们进行排序(最高计数在前)。我将如何在 Linux 命令行环境中完成此操作?

It can use any common command line language like awk, perl, python etc.

它可以使用任何常见的命令行语言,如 awk、perl、python 等。

回答by Paused until further notice.

To see a frequency count for column two (for example):

要查看第二列的频率计数(例如):

awk -F '\t' '{print }' * | sort | uniq -c | sort -nr

fileA.txt

文件A.txt

z    z    a
a    b    c
w    d    e

fileB.txt

文件B.txt

t    r    e
z    d    a
a    g    c

fileC.txt

文件C.txt

z    r    a
v    d    c
a    m    c

Result:

结果:

  3 d
  2 r
  1 z
  1 m
  1 g
  1 b

回答by Thedward

Here is a way to do it in the shell:

这是在 shell 中执行此操作的一种方法:

FIELD=2
cut -f $FIELD * | sort| uniq -c |sort -nr

This is the sort of thing bash is great at.

这是 bash 擅长的事情。

回答by Adam Matan

The GNU sitesuggests this nice awk script, which prints both the words and their frequency.

GNU网站表明这个漂亮的awk脚本,它打印两个单词和它们的频率。

Possible changes:

可能的变化:

  • You can pipe through sort -nr(and reverse wordand freq[word]) to see the result in descending order.
  • If you want a specific column, you can omit the for loop and simply write freq[3]++- replace 3 with the column number.
  • 您可以通过管道sort -nr(和反向wordfreq[word])以降序查看结果。
  • 如果你想要一个特定的列,你可以省略 for 循环并简单地写freq[3]++- 用列号替换 3。

Here goes:

开始:

 # wordfreq.awk --- print list of word frequencies

 {
     
# columnvalues.pl
while (<>) {
    @Fields = split /\s+/;
    for $i ( 0 .. $#Fields ) {
        $result[$i]{$Fields[$i]}++
    };
}
for $j ( 0 .. $#result ) {
    print "column $j:\n";
    @values = keys %{$result[$j]};
    @sorted = sort { $result[$j]{$b} <=> $result[$j]{$a}  ||  $a cmp $b } @values;
    for $k ( @sorted ) {
        print " $k $result[$j]{$k}\n"
    }
}
= tolower(
column 0:
 a 3
 z 3
 t 1
 v 1
 w 1
column 1:
 d 3
 r 2
 b 1
 g 1
 m 1
 z 1
column 2:
 c 4
 a 3
 e 2
) # remove case distinctions # remove punctuation gsub(/[^[:alnum:]_[:blank:]]/, "",
perl -lane 'for $i (0..$#F){$g[$i]{$F[$i]}++};END{for $j (0..$#g){print "$j:";for $k (sort{$g[$j]{$b}<=>$g[$j]{$a}||$a cmp $b} keys %{$g[$j]}){print " $k $g[$j]{$k}"}}}' files*
) for (i = 1; i <= NF; i++) freq[$i]++ } END { for (word in freq) printf "%s\t%d\n", word, freq[word] }

回答by Chris Koknat

Perl

珀尔

This code computes the occurrences of allcolumns, and prints a sorted report for each of them:

此代码计算所有列的出现次数,并为每个列打印一个排序报告:

#!/usr/bin/env ruby
Dir["*"].each do |file|
    h=Hash.new(0)
    open(file).each do |row|
        row.chomp.split("\t").each do |w|
            h[ w ] += 1
        end
    end
    h.sort{|a,b| b[1]<=>a[1] }.each{|x,y| print "#{x}:#{y}\n" }
end

Save the text as columnvalues.pl
Run it as: perl columnvalues.pl files*

将文本另存为 columnvalues.pl 将其
运行为: perl columnvalues.pl files*

Explanation

解释

In the top-level while loop:
* Loop over each line of the combined input files
* Split the line into the @Fields array
* For every column, increment the result array-of-hashes data structure

在顶层 while 循环中:
* 循环组合输入文件的每一行
* 将该行拆分为 @Fields 数组
* 对于每一列,递增结果数组哈希数据结构

In the top-level for loop:
* Loop over the result array
* Print the column number
* Get the values used in that column
* Sort the values by the number of occurrences
* Secondary sort based on the value (for example b vs g vs m vs z)
* Iterate through the result hash, using the sorted list
* Print the value and number of each occurrence

在顶级 for 循环中:
* 循环结果数组
* 打印列号
* 获取该列中使用的值
* 按出现次数
对值进行排序 * 基于值的二级排序(例如 b vs g vs m vs z)
* 使用排序列表遍历结果哈希
* 打印每次出现的值和数量

Results based on the sample input files provided by @Dennis

结果基于@Dennis 提供的示例输入文件

##代码##

.csv input

.csv 输入

If your input files are .csv, change /\s+/to /,/

如果您的输入文件是 .csv,请更改/\s+//,/

Obfuscation

混淆

In an ugly contest, Perl is particularly well equipped.
This one-liner does the same:

在一场丑陋的比赛中,Perl 的装备特别好。
这条单线也做同样的事情:

##代码##

回答by kurumi

Ruby(1.9+)

红宝石(1.9+)

##代码##