bash awk 列的中位数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6166375/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 20:34:26  来源:igfitidea点击:

median of column with awk

bashsedawkmedian

提问by Nick

How can I use AWK to compute the median of a column of numerical data?

如何使用 AWK 计算一列数值数据的中位数?

I can think of a simple algorithm but I can't seem to program it:

我能想到一个简单的算法,但我似乎无法对其进行编程:

What I have so far is:

到目前为止我所拥有的是:

sort | awk 'END{print NR}' 

And this gives me the number of elements in the column. I'd like to use this to print a certain row (NR/2). If NR/2is not an integer, then I round up to the nearest integer and that is the median, otherwise I take the average of (NR/2)+1and (NR/2)-1.

这给了我列中元素的数量。我想用它来打印某一行(NR/2)。如果NR/2不是整数,然后我四舍五入到最接近的整数,这是中位数,否则我取平均值的(NR/2)+1(NR/2)-1

回答by maxschlepzig

With awkyou have to store the values in an array and compute the median at the end, assuming we look at the first column:

随着awk你要的值存储在数组中,并计算在最后的中位数,假设我们看看第一列:

sort -n file | awk ' { a[i++]=; } END { print a[int(i/2)]; }'

Sure, for real median computation do the rounding as described in the question:

当然,对于真正的中位数计算,按照问题中的描述进行四舍五入:

sort -n file | awk ' { a[i++]=; }
    END { x=int((i+1)/2); if (x < (i+1)/2) print (a[x-1]+a[x])/2; else print a[x-1]; }'

回答by Johnsyweb

This awkprogram assumes one column of numerically sorted data:

awk程序假设一列按数字排序的数据:

#/usr/bin/env awk
{
    count[NR] = ;
}
END {
    if (NR % 2) {
        print count[(NR + 1) / 2];
    } else {
        print (count[(NR / 2)] + count[(NR / 2) + 1]) / 2.0;
    }
}

Sample usage:

示例用法:

sort -n data_file | awk -f median.awk

回答by Vinicius Placco

OK, just saw this topic and thought I could add my two cents, since I looked for something similar in the past. Even though the title says awk, all the answers make use of sortas well. Calculating the median for a column of data can be easily accomplished with datamash:

好吧,刚看到这个话题,我想我可以加上我的两分钱,因为我过去寻找过类似的东西。即使标题说awk,所有的答案也使用sort。使用datamash可以轻松完成计算一列数据的中位数:

> seq 10 | datamash median 1
5.5

Note that sortis not needed, even if you have an unsorted column:

请注意sort,即使您有未排序的列,也不需要:

> seq 10 | gshuf | datamash median 1
5.5

The documentation gives all the functions it can perform, and good examples as well for files with many columns. Anyway, it has nothing to do with awk, but I think datamashis of great help in cases like this, and could also be used in conjunction with awk. Hope it helps somebody!

该文档提供了它可以执行的所有功能,以及具有许多列的文件的好示例。无论如何,它与 无关awk,但我认为datamash在这种情况下有很大帮助,也可以与awk. 希望它可以帮助某人!

回答by Brad Parks

This AWK based answerto a similar question on unix.stackexchange.com gives the same results as Excel for calculating the median.

这个基于 AWK 的对 unix.stackexchange.com 上类似问题的回答给出了与 Excel 相同的计算中位数的结果。

回答by arenaq

If you have an array to compute median from (contains one-liner of Johnsyweb solution):

如果您有一个数组来计算中位数(包含单行 Johnsyweb 解决方案):

array=(5 6 4 2 7 9 3 1 8) # numbers 1-9
IFS=$'\n'
median=$(awk '{arr[NR]=} END {if (NR%2==1) print arr[(NR+1)/2]; else print (arr[NR/2]+arr[NR/2+1])/2}' <<< sort <<< "${array[*]}")
unset IFS