用于打印 linux 中数字统计信息的命令行实用程序

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9789806/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 05:17:17  来源:igfitidea点击:

command line utility to print statistics of numbers in linux

linuxcommand-linestatistics

提问by MK.

I often find myself with a file that has one number per line. I end up importing it in excel to view things like median, standard deviation and so forth.

我经常发现自己的文件每行有一个数字。我最终将其导入 excel 以查看诸如中位数、标准偏差等内容。

Is there a command line utility in linux to do the same? I usually need to find the average, median, min, max and std deviation.

linux中是否有命令行实用程序可以执行相同操作?我通常需要找到平均值、中值、最小值、最大值和标准偏差。

采纳答案by Matt Parker

This is a breeze with R. For a file that looks like this:

这对于 R 来说是轻而易举的。对于看起来像这样的文件:

1
2
3
4
5
6
7
8
9
10

Use this:

用这个:

R -q -e "x <- read.csv('nums.txt', header = F); summary(x); sd(x[ , 1])"

To get this:

要得到这个:

       V1       
 Min.   : 1.00  
 1st Qu.: 3.25  
 Median : 5.50  
 Mean   : 5.50  
 3rd Qu.: 7.75  
 Max.   :10.00  
[1] 3.02765
  • The -qflag squelches R's startup licensing and help output
  • The -eflag tells R you'll be passing an expression from the terminal
  • xis a data.frame- a table, basically. It's a structure that accommodates multiple vectors/columns of data, which is a little peculiar if you're just reading in a single vector. This has an impact on which functions you can use.
  • Some functions, like summary(), naturally accommodate data.frames. If xhad multiple fields, summary()would provide the above descriptive stats for each.
  • But sd()can only take one vector at a time, which is why I index xfor that command (x[ , 1]returns the first column of x). You could use apply(x, MARGIN = 2, FUN = sd)to get the SDs for all columns.
  • -q标志压制 R 的启动许可和帮助输出
  • -e标记告诉R您会从终端传递的表达
  • xdata.frame基本上是一张桌子。它是一种容纳多个向量/数据列的结构,如果您只是在单个向量中读取,这有点奇怪。这会影响您可以使用哪些功能。
  • 一些功能,例如summary(),自然容纳data.frames。如果x有多个字段,summary()将为每个字段提供上述描述性统计信息。
  • 但是sd()一次只能取一个向量,这就是我x为该命令编制索引的原因(x[ , 1]返回 的第一列x)。您可以使用apply(x, MARGIN = 2, FUN = sd)获取所有列的 SD。

回答by Skippy le Grand Gourou

For the average, median & standard deviation you can use awk. This will generally be faster than Rsolutions. For instance the following will print the average?:

对于平均值、中位数和标准差,您可以使用awk. 这通常比R解决方案更快。例如,以下将打印平均值?:

awk '{a+=} END{print a/NR}' myfile

(NRis an awkvariable for the number of records, $1means the first (space-separated) argument of the line ($0would be the whole line, which would also work here but in principle would be less secure, although for the computation it would probably just take the first argument anyway) and ENDmeans that the following commands will be executed after having processed the whole file (one could also have initialized ato 0in a BEGIN{a=0}statement)).

(NRawk记录数的变量,$1表示行的第一个(空格分隔的)参数($0将是整行,这也可以在这里工作,但原则上会不太安全,尽管对于计算它可能只是取的第一个参数反正)和END装置,以下命令将在处理了整个文件后(一个也都初始化被执行a0在一BEGIN{a=0}语句))。

Here is a simple awkscript which provides more detailed statistics (takes a CSV file as input, otherwise change FS)?:

这是一个简单的awk脚本,它提供了更详细的统计信息(以 CSV 文件作为输入,否则更改FS)?:

#!/usr/bin/awk -f

BEGIN {
    FS=",";
}
{
   a += ;
   b[++i] = ;
}
END {
    m = a/NR; # mean
    for (i in b)
    {
        d += (b[i]-m)^2;
        e += (b[i]-m)^3;
        f += (b[i]-m)^4;
    }
    va = d/NR; # variance
    sd = sqrt(va); # standard deviation
    sk = (e/NR)/sd^3; # skewness
    ku = (f/NR)/sd^4-3; # standardized kurtosis
    print "N,sum,mean,variance,std,SEM,skewness,kurtosis"
    print NR "," a "," m "," va "," sd "," sd/sqrt(NR) "," sk "," ku
}

It is straightforward to add min/max to this script, but it is as easy to pipe sort& head/tail?:

向该脚本添加 min/max 很简单,但管道sort& head/ tail?也很容易:

sort -n myfile | head -n1
sort -n myfile | tail -n1

回答by bua

Yep, it's called perl
and here is concise one-liner:

是的,它叫做 perl
,这里是简洁的单行代码:

perl -e 'use List::Util qw(max min sum); @a=();while(<>){$sqsum+=$_*$_; push(@a,$_)}; $n=@a;$s=sum(@a);$a=$s/@a;$m=max(@a);$mm=min(@a);$std=sqrt($sqsum/$n-($s/$n)*($s/$n));$mid=int @a/2;@srtd=sort @a;if(@a%2){$med=$srtd[$mid];}else{$med=($srtd[$mid-1]+$srtd[$mid])/2;};print "records:$n\nsum:$s\navg:$a\nstd:$std\nmed:$med\max:$m\min:$mm";'

Example

例子

$ cat tt
1
3
4
5
6.5
7.
2
3
4

And the command

和命令

cat tt | perl -e 'use List::Util qw(max min sum); @a=();while(<>){$sqsum+=$_*$_; push(@a,$_)}; $n=@a;$s=sum(@a);$a=$s/@a;$m=max(@a);$mm=min(@a);$std=sqrt($sqsum/$n-($s/$n)*($s/$n));$mid=int @a/2;@srtd=sort @a;if(@a%2){$med=$srtd[$mid];}else{$med=($srtd[$mid-1]+$srtd[$mid])/2;};print "records:$n\nsum:$s\navg:$a\nstd:$std\nmed:$med\max:$m\min:$mm";'
records:9
sum:35.5
avg:3.94444444444444
std:1.86256162380447
med:4
max:7.
min:1

回答by tchrist

#!/usr/bin/perl
#
# stdev - figure N, min, max, median, mode, mean, & std deviation
#
# pull out all the real numbers in the input
# stream and run standard calculations on them.
# they may be intermixed with other test, need
# not be on the same or different lines, and 
# can be in scientific notion (avagadro=6.02e23).
# they also admit a leading + or -.
#
# Tom Christiansen
# [email protected]

use strict;
use warnings;

use List::Util qw< min max >;

#
my $number_rx = qr{

  # leading sign, positive or negative
    (?: [+-] ? )

  # mantissa
    (?= [0123456789.] )
    (?: 
        # "N" or "N." or "N.N"
        (?:
            (?: [0123456789] +     )
            (?:
                (?: [.] )
                (?: [0123456789] * )
            ) ?
      |
        # ".N", no leading digits
            (?:
                (?: [.] )
                (?: [0123456789] + )
            ) 
        )
    )

  # abscissa
    (?:
        (?: [Ee] )
        (?:
            (?: [+-] ? )
            (?: [0123456789] + )
        )
        |
    )
}x;

my $n = 0;
my $sum = 0;
my @values = ();

my %seen = ();

while (<>) {
    while (/($number_rx)/g) {
        $n++;
        my $num = 0 + ;  # 0+ is so numbers in alternate form count as same
        $sum += $num;
        push @values, $num;
        $seen{$num}++;
    } 
} 

die "no values" if $n == 0;

my $mean = $sum / $n;

my $sqsum = 0;
for (@values) {
    $sqsum += ( $_ ** 2 );
} 
$sqsum /= $n;
$sqsum -= ( $mean ** 2 );
my $stdev = sqrt($sqsum);

my $max_seen_count = max values %seen;
my @modes = grep { $seen{$_} == $max_seen_count } keys %seen;

my $mode = @modes == 1 
            ? $modes[0] 
            : "(" . join(", ", @modes) . ")";
$mode .= ' @ ' . $max_seen_count;

my $median;
my $mid = int @values/2;
if (@values % 2) {
    $median = $values[ $mid ];
} else {
    $median = ($values[$mid-1] + $values[$mid])/2;
} 

my $min = min @values;
my $max = max @values;

printf "n is %d, min is %g, max is %d\n", $n, $min, $max;
printf "mode is %s, median is %g, mean is %g, stdev is %g\n", 
    $mode, $median, $mean, $stdev;

回答by ghoti

Mean:

意思:

awk '{sum += } END {print "mean = " sum/NR}' filename

Median:

中位数:

gawk -v max=128 '

    function median(c,v,    j) { 
       asort(v,j) 
       if (c % 2) return j[(c+1)/2]
       else return (j[c/2+1]+j[c/2])/2.0
    }

    { 
       count++
       values[count]=
       if (count >= max) { 
         print  median(count,values); count=0
       } 
    } 

    END { 
       print  "median = " median(count,values)
    }
    ' filename

Mode:

模式:

awk '{c[]++} END {for (i in count) {if (c[i]>max) {max=i}} print "mode = " max}' filename

This mode calculation requires an even number of samples, but you see how it works...

此模式计算需要偶数个样本,但您会看到它是如何工作的...

Standard Deviation:

标准偏差:

awk '{sum+=; sumsq+=*} END {print "stdev = " sqrt(sumsq/NR - (sum/NR)**2)}' filename

回答by Tommaso

Just in case, there's datastat, a simple program for Linux computing simple statistics from the command-line. For example,

以防万一,有datastat一个简单的 Linux 程序,用于从命令行计算简单的统计信息。例如,

cat file.dat | datastat

will output the average value across all rows for each column of file.dat. If you need to know the standard deviation, min, max, you can add the --dev, --minand --maxoptions, respectively.

将为file.dat 的每一列输出所有行的平均值。如果你需要知道的标准偏差,最小值,最大值,您可以添加--dev--min--max分别选择。

datastathas the possibility to aggregate rows based on the value of one or more "key" columns. For example,

datastat有可能根据一个或多个“键”列的值聚合行。例如,

cat file.dat | datastat -k 1

will produce, for each different value found on the first column (the "key"), the average of all other column values as aggregated among all rows with the same value on the key. You can use more columns as key fields (e.g., -k 1-3, -k 2,4 etc...).

对于在第一列(“键”)上找到的每个不同值,将生成所有其他列值的平均值,这些值在键上具有相同值的所有行中聚合。您可以使用更多列作为关键字段(例如,-k 1-3、-k 2,4 等...)。

It's written in C++, runs fast and with small memory occupation, and can be piped nicely with other tools such as cut, grep, sed, sort, awketc.

它是用C ++编写,运行速度快和小内存占用,可以用其他工具,如管道很好cutgrepsedsortawk等。

回答by user2747481

Using "st" (https://github.com/nferraz/st)

使用“st”(https://github.com/nferraz/st

$ st numbers.txt
N    min   max   sum   mean  stddev
10   1     10    55    5.5   3.02765

Or:

或者:

$ st numbers.txt --transpose
N      10
min    1
max    10
sum    55
mean   5.5
stddev 3.02765

(DISCLAIMER: I wrote this tool :))

(免责声明:我写了这个工具:))

回答by Tom

There is also simple-r, which can do almost everything that R can, but with less keystrokes:

还有 simple-r,它几乎可以做 R 能做的所有事情,但击键次数更少:

https://code.google.com/p/simple-r/

https://code.google.com/p/simple-r/

To calculate basic descriptive statistics, one would have to type one of:

要计算基本的描述性统计数据,必须键入以下内容之一:

r summary file.txt
r summary - < file.txt
cat file.txt | r summary -

For each of average, median, min, max and std deviation, the code would be:

对于每个平均值、中值、最小值、最大值和标准偏差,代码将是:

seq 1 100 | r mean - 
seq 1 100 | r median -
seq 1 100 | r min -
seq 1 100 | r max -
seq 1 100 | r sd -

Doesn't get any simple-R!

没有任何简单的-R!

回答by Matt Parker

data_hacksis a Python command-line utility for basic statistics.

data_hacks是一个用于基本统计的 Python 命令行实用程序。

The first example from that page produces the desired results:

该页面的第一个示例产生了所需的结果:

$ cat /tmp/data | histogram.py
# NumSamples = 29; Max = 10.00; Min = 1.00
# Mean = 4.379310; Variance = 5.131986; SD = 2.265389
# each * represents a count of 1
    1.0000 -     1.9000 [     1]: *
    1.9000 -     2.8000 [     5]: *****
    2.8000 -     3.7000 [     8]: ********
    3.7000 -     4.6000 [     3]: ***
    4.6000 -     5.5000 [     4]: ****
    5.5000 -     6.4000 [     2]: **
    6.4000 -     7.3000 [     3]: ***
    7.3000 -     8.2000 [     1]: *
    8.2000 -     9.1000 [     1]: *
    9.1000 -    10.0000 [     1]: *

回答by dpmcmlxxvi

You might also consider using clistats. It is a highly configurable command line interface tool to compute statistics for a stream of delimited input numbers.

您也可以考虑使用clistats。它是一个高度可配置的命令行界面工具,用于计算分隔输入数字流的统计信息。

I/O options

输入/输出选项

  • Input data can be from a file, standard input, or a pipe
  • Output can be written to a file, standard output, or a pipe
  • Output uses headers that start with "#" to enable piping to gnuplot
  • 输入数据可以来自文件、标准输入或管道
  • 输出可以写入文件、标准输出或管道
  • 输出使用以“#”开头的标题来启用管道到 gnuplot

Parsing options

解析选项

  • Signal, end-of-file, or blank line based detection to stop processing
  • Comment and delimiter character can be set
  • Columns can be filtered out from processing
  • Rows can be filtered out from processing based on numeric constraint
  • Rows can be filtered out from processing based on string constraint
  • Initial header rows can be skipped
  • Fixed number of rows can be processed
  • Duplicate delimiters can be ignored
  • Rows can be reshaped into columns
  • Strictly enforce that only rows of the same size are processed
  • A row containing column titles can be used to title output statistics
  • 基于信号、文件结尾或空白行的检测以停止处理
  • 可以设置注释和分隔符
  • 可以从处理中过滤掉列
  • 可以根据数字约束从处理中过滤掉行
  • 可以根据字符串约束从处理中过滤掉行
  • 可以跳过初始标题行
  • 可处理固定行数
  • 可以忽略重复的分隔符
  • 行可以被重新塑造成列
  • 严格执行只处理相同大小的行
  • 包含列标题的行可用于标题输出统计

Statistics options

统计选项

  • Summary statistics (Count, Minimum, Mean, Maximum, Standard deviation)
  • Covariance
  • Correlation
  • Least squares offset
  • Least squares slope
  • Histogram
  • Raw data after filtering
  • 汇总统计(计数、最小值、平均值、最大值、标准差)
  • 协方差
  • 相关性
  • 最小二乘偏移
  • 最小二乘斜率
  • 直方图
  • 过滤后的原始数据

NOTE: I'm the author.

注意:我是作者。