使用 Bash 脚本计算均值、方差和范围
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9387751/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Calculate mean, variance and range using Bash script
提问by Brian James
Given a file file.txt:
给定一个文件 file.txt:
AAA 1 2 3 4 5 6 3 4 5 2 3
BBB 3 2 3 34 56 1
CCC 4 7 4 6 222 45
Does any one have any ideas on how to calculate the mean, variance and range for each item, i.e. AAA, BBB, CCC respectively using Bash script? Thanks.
有没有人对如何使用 Bash 脚本分别计算每个项目的均值、方差和范围有任何想法,即分别为 AAA、BBB、CCC?谢谢。
回答by Adam Liss
Here's a solution with awk, which calculates:
这是一个带有 的解决方案awk,它计算:
- minimum = smallest value on each line
- maximum = largest value on each line
- average = μ = sum of all values on each line, divided by the count of the numbers.
- variance = 1/n × [(Σx)² - Σ(x²)] where
n = number of values on the line =NF- 1 (in awk,NF= number of fields on the line)
(Σx)² = square of the sum of the values on the line
Σ(x²) = sum of the squares of the values on the line
- 最小值 = 每行的最小值
- 最大值 = 每行的最大值
- 平均值 = μ = 每行上所有值的总和,除以数字的计数。
- 方差 = 1/n × [(Σx)² - Σ(x²)] 其中
n = 行上的值数 =NF- 1(在 awk 中,NF= 行上的字段数)
(Σx)² = 总和的平方
Σ(x²)线上值的总和 = 线上值的平方和
awk '{
min = max = sum = ; # Initialize to the first value (2nd field)
sum2 = * # Running sum of squares
for (n=3; n <= NF; n++) { # Process each value on the line
if ($n < min) min = $n # Current minimum
if ($n > max) max = $n # Current maximum
sum += $n; # Running sum of values
sum2 += $n * $n # Running sum of squares
}
print ": min=" min ", avg=" sum/(NF-1) ", max=" max ", var=" ((sum*sum) - sum2)/(NF-1);
}' filename
Output:
输出:
AAA: min=1, avg=3.45455, max=6, var=117.273
BBB: min=1, avg=16.5, max=56, var=914.333
CCC: min=4, avg=48, max=222, var=5253
Note that you can save the awk script (everything between, but not including, the single-quotes) in a file, say called script, and execute it with awk -f script filename
请注意,您可以将 awk 脚本(单引号之间的所有内容,但不包括单引号)保存在一个文件中,例如调用script,并使用awk -f script filename
回答by kev
You can use python:
您可以使用python:
$ AAA() { echo "$@" | python -c 'from sys import stdin; nums = [float(i) for i in stdin.read().split()]; print(sum(nums)/len(nums))'; }
$ AAA 1 2 3 4 5 6 3 4 5 2 3
3.45454545455
回答by user unknown
Part 1 (mean):
第 1 部分(平均值):
mean () {
len=$#
echo $* | tr " " "\n" | sort -n | head -n $(((len+1)/2)) | tail -n 1
}
nMean () {
echo -n " "
shift
mean $*
}
mean usage:
平均用法:
nMean AAA 3 4 5 6 3 4 3 6 2 4
4
Part 2 (variance):
第 2 部分(差异):
variance () {
count=
avg=
shift
shift
sum=0
for n in $*
do
diff=$((avg-n))
quad=$((diff*diff))
sum=$((sum+quad))
done
echo $((sum/count))
}
sum () {
form="$(echo $*)"
formula=${form// /+}
echo $((formula))
}
nVariance () {
echo -n " "
shift
count=$#
s=$(sum $*)
avg=$((s/$count))
var=$(variance $count $avg $*)
echo $var
}
usage:
用法:
nVariance AAA 3 4 5 6 3 4 3 6 2 4
1
Part 3 (range):
第 3 部分(范围):
range () {
min=
max=
for p in $* ; do
(( $p < $min )) && min=$p
(( $p > $max )) && max=$p
done
echo $min ":" $max
}
nRange () {
echo -n " "
shift
range $*
}
usage:
用法:
nRange AAA 1 2 3 4 5 6 3 4 5 2 3
AAA 1 : 6
nX is short for named X, named mean, named variance, ... . Note, that I use integer arithmetic, which is, what is possible with the shell. To use floating point arithmetic, you would use bc, for instance. Here you loose precision, which might be acceptable for big natural numbers.
nX 是命名 X、命名平均值、命名方差、...的缩写。请注意,我使用整数算术,也就是说,shell 可以实现什么。例如,要使用浮点运算,您可以使用 bc。在这里,您失去了精度,这对于大自然数来说可能是可以接受的。
Process all 3 commands for an input line:
处理输入行的所有 3 个命令:
processLine () {
nVariance $*
nMean $*
nRange $*
}
Read the data from a file, line by line:
从文件中逐行读取数据:
# data:
# AAA 1 2 3 4 5 6 3 4 5 2 3
# BBB 3 2 3 34 56 1
# CCC 4 7 4 6 222 45
while read line
do
processLine $line
done < data
update:
更新:
Contrary to my expectation, it doesn't seem easy to handle an unknown number of arguments with functions in bc, for example min (3, 4, 5, 2, 6).
与我的预期相反bc,例如,使用 中的函数处理未知数量的参数似乎并不容易min (3, 4, 5, 2, 6)。
But the need to call bc can be reduced to 2 places, if the input are integers. I used a precision of 2 ("scale=2") - you may change this to your needs.
但是如果输入是整数,调用 bc 的需要可以减少到 2 个地方。我使用了 2 的精度(“scale=2”) - 您可以根据需要更改它。
variance () {
count=
avg=
shift
shift
sum=0
for n in $*
do
diff="($avg-$n)"
quad="($diff*$diff)"
sum="($sum+$quad)"
done
# echo "$sum/$count"
echo "scale=2;$sum/$count" | bc
}
nVariance () {
echo -n " "
shift
count=$#
s=$(sum $*)
avg=$(echo "scale=2;$s/$count" | bc)
var=$(variance $count $avg $*)
echo $var
}
The rest of the code can stay the same. Please verify that the formula for the variance is correct - I used what I had in mind:
其余代码可以保持不变。请验证方差的公式是否正确 - 我使用了我的想法:
For values (1, 5, 9), I sum up (15) divide by count (3) => 5. Then I create the diff to the avg for each value (-4, 0, 4), build the square (16, 0, 16), sum them up (32) and divide by count (3) => 10.66
对于值 (1, 5, 9),我总结 (15) 除以计数 (3) => 5。然后我创建每个值 (-4, 0, 4) 的平均值的差异,构建平方 ( 16, 0, 16),将它们相加 (32) 并除以计数 (3) => 10.66
Is this correct, or do I need a square root somewhere ;) ?
这是正确的,还是我需要在某处使用平方根 ;) ?
Note, that I had to correct the mean calculation. For 1, 5, 9, the mean is 5, not 1 - am I right? It now uses sort -n(numeric) and (len+1)/2.
请注意,我必须更正平均值计算。对于 1、5、9,平均值是 5,而不是 1 - 我说得对吗?它现在使用sort -n(数字)和(len+1)/2.
回答by pbot
There is a typo in the accepted answer that causes the variance to be miscalculated. In the printstatement:
接受的答案中有一个错字,导致方差计算错误。在print声明中:
", var=" ((sum*sum) - sum2)/(NF-1)
should be:
应该:
", var=" (sum2 - ((sum*sum)/NF))/(NF-1)
Also, it is better to use something like Welford's algorithmto calculate variance; the algorithm in the accepted answer is unstable when the variance is small relative to the mean:
另外,最好使用类似Welford 算法的东西来计算方差;当方差相对于均值较小时,接受答案中的算法不稳定:
foo="1 2 3 4 5 6 3 4 5 2 3";
awk '{
M = 0;
S = 0;
for (k=1; k <= NF; k++) {
x = $k;
oldM = M;
M = M + ((x - M)/k);
S = S + (x - M)*(x - oldM);
}
var = S/(NF - 1);
print " var=" var;
}' <<< $foo

