bash 在bash中获取列中唯一值的计数

Question

提问by sfactor

I have tab delimited files with several columns. I want to count the frequency of occurrence of the different values in a column for all the files in a folder and sort them in decreasing order of count (highest count first). How would I accomplish this in a Linux command line environment?

我有带有几列的制表符分隔文件。我想计算文件夹中所有文件的列中不同值的出现频率，并按计数降序对它们进行排序（最高计数在前）。我将如何在 Linux 命令行环境中完成此操作？

It can use any common command line language like awk, perl, python etc.

它可以使用任何常见的命令行语言，如 awk、perl、python 等。

Answer 1

回答by Paused until further notice.

To see a frequency count for column two (for example):

要查看第二列的频率计数（例如）：

awk -F '\t' '{print }' * | sort | uniq -c | sort -nr

fileA.txt

文件A.txt

z    z    a
a    b    c
w    d    e

fileB.txt

文件B.txt

t    r    e
z    d    a
a    g    c

fileC.txt

文件C.txt

z    r    a
v    d    c
a    m    c

Result:

结果：

Answer 2

回答by Thedward

Here is a way to do it in the shell:

这是在 shell 中执行此操作的一种方法：

FIELD=2
cut -f $FIELD * | sort| uniq -c |sort -nr

This is the sort of thing bash is great at.

这是 bash 擅长的事情。

Answer 3

回答by Adam Matan

The GNU sitesuggests this nice awk script, which prints both the words and their frequency.

在GNU网站表明这个漂亮的awk脚本，它打印两个单词和它们的频率。

Possible changes:

可能的变化：

You can pipe through sort -nr(and reverse wordand freq[word]) to see the result in descending order.
If you want a specific column, you can omit the for loop and simply write freq[3]++- replace 3 with the column number.

您可以通过管道sort -nr（和反向word和freq[word]）以降序查看结果。
如果你想要一个特定的列，你可以省略 for 循环并简单地写freq[3]++- 用列号替换 3。

Here goes:

开始：

 # wordfreq.awk --- print list of word frequencies

 {
     # columnvalues.pl
while (<>) {
    @Fields = split /\s+/;
    for $i ( 0 .. $#Fields ) {
        $result[$i]{$Fields[$i]}++
    };
}
for $j ( 0 .. $#result ) {
    print "column $j:\n";
    @values = keys %{$result[$j]};
    @sorted = sort { $result[$j]{$b} <=> $result[$j]{$a}  ||  $a cmp $b } @values;
    for $k ( @sorted ) {
        print " $k $result[$j]{$k}\n"
    }
}
 = tolower(column 0:
 a 3
 z 3
 t 1
 v 1
 w 1
column 1:
 d 3
 r 2
 b 1
 g 1
 m 1
 z 1
column 2:
 c 4
 a 3
 e 2
)    # remove case distinctions
     # remove punctuation
     gsub(/[^[:alnum:]_[:blank:]]/, "", perl -lane 'for $i (0..$#F){$g[$i]{$F[$i]}++};END{for $j (0..$#g){print "$j:";for $k (sort{$g[$j]{$b}<=>$g[$j]{$a}||$a cmp $b} keys %{$g[$j]}){print " $k $g[$j]{$k}"}}}' files*
)
     for (i = 1; i <= NF; i++)
         freq[$i]++
 }

 END {
     for (word in freq)
         printf "%s\t%d\n", word, freq[word]
 }

Answer 4

回答by Chris Koknat

Perl

珀尔

This code computes the occurrences of allcolumns, and prints a sorted report for each of them:

此代码计算所有列的出现次数，并为每个列打印一个排序报告：

#!/usr/bin/env ruby
Dir["*"].each do |file|
    h=Hash.new(0)
    open(file).each do |row|
        row.chomp.split("\t").each do |w|
            h[ w ] += 1
        end
    end
    h.sort{|a,b| b[1]<=>a[1] }.each{|x,y| print "#{x}:#{y}\n" }
end

Save the text as columnvalues.pl
Run it as: perl columnvalues.pl files*

将文本另存为 columnvalues.pl 将其
运行为： perl columnvalues.pl files*

Explanation

解释

In the top-level while loop:
* Loop over each line of the combined input files
* Split the line into the @Fields array
* For every column, increment the result array-of-hashes data structure

在顶层 while 循环中：
* 循环组合输入文件的每一行
* 将该行拆分为 @Fields 数组
* 对于每一列，递增结果数组哈希数据结构

In the top-level for loop:
* Loop over the result array
* Print the column number
* Get the values used in that column
* Sort the values by the number of occurrences
* Secondary sort based on the value (for example b vs g vs m vs z)
* Iterate through the result hash, using the sorted list
* Print the value and number of each occurrence

在顶级 for 循环中：
* 循环结果数组
* 打印列号
* 获取该列中使用的值
* 按出现次数
对值进行排序 * 基于值的二级排序（例如 b vs g vs m vs z)
* 使用排序列表遍历结果哈希
* 打印每次出现的值和数量

Results based on the sample input files provided by @Dennis

结果基于@Dennis 提供的示例输入文件

##代码##

.csv input

.csv 输入

If your input files are .csv, change /\s+/to /,/

如果您的输入文件是 .csv，请更改/\s+/为/,/

Obfuscation

混淆

In an ugly contest, Perl is particularly well equipped.
This one-liner does the same:

在一场丑陋的比赛中，Perl 的装备特别好。
这条单线也做同样的事情：

##代码##

Answer 5

回答by kurumi

Ruby(1.9+)

红宝石（1.9+）

##代码##

bash 在bash中获取列中唯一值的计数

提问by sfactor

回答by Paused until further notice.

回答by Thedward

回答by Adam Matan

回答by Chris Koknat

Perl

珀尔

Explanation

解释

Results based on the sample input files provided by @Dennis

结果基于@Dennis 提供的示例输入文件

.csv input

.csv 输入

Obfuscation

混淆

回答by kurumi

相关推荐

最近更新

标签

bash 在bash中获取列中唯一值的计数

提问by sfactor

回答by Paused until further notice.

回答by Thedward

回答by Adam Matan

回答by Chris Koknat

Perl

珀尔

Explanation

解释

Results based on the sample input files provided by @Dennis

结果基于@Dennis 提供的示例输入文件

.csv input

.csv 输入

Obfuscation

混淆

回答by kurumi

相关推荐

检查传递的参数是 Bash 中的文件还是目录

bash 设置在 while 循环内部的 Shell 变量在其外部不可见

bash 写入文件，但如果存在则覆盖它

bash 如何捕获 ls 或 find 命令的输出以将所有文件名存储在数组中？

相关推荐

最近更新

标签