java 从长数组计算百分位数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41413544/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 05:51:25  来源:igfitidea点击:

Calculate percentile from a long array?

javamathstatisticsapache-commonspercentile

提问by user7358693

Given a long array of latencies which are in milliseconds, I want to calculate percentile from them. I got below method which does the work but I am not sure how I can verify whether this gives me accurate result?

给定一长串以毫秒为单位的延迟,我想从中计算百分位数。我得到了下面的方法,但我不确定如何验证这是否给了我准确的结果?

  public static long[] percentiles(long[] latencies, double... percentiles) {
    Arrays.sort(latencies, 0, latencies.length);
    long[] values = new long[percentiles.length];
    for (int i = 0; i < percentiles.length; i++) {
      int index = (int) (percentiles[i] * latencies.length);
      values[i] = latencies[index];
    }
    return values;
  }

I would like to get 50th, 95th, 99th and 99.9th percentile from latenciesarray.

我想从latencies数组中获得第 50、95、99 和 99.9 个百分点。

long[] percs = percentiles(latencies, 0.5, 0.95, 0.99, 0.999);

Is this the right way to get percentile given a long array of latencies? I am working with Java 7.

考虑到大量的延迟,这是获得百分位数的正确方法吗?我正在使用 Java 7。

回答by user7358693

This is what you are looking for:

这就是你要找的:

class Program
{
    static void Main(string[] args)
    {
        List<long> latencies = new List<long>() { 3, 6, 7, 8, 8, 9, 10, 13, 15, 16, 20 };

        Console.WriteLine(Percentile(latencies,25));
        Console.WriteLine(Percentile(latencies, 50));
        Console.WriteLine(Percentile(latencies, 75));
        Console.WriteLine(Percentile(latencies, 100));

        Console.ReadLine();
    }

    public static long Percentile(List<long> latencies, double Percentile)
    {
        latencies.Sort();
        int Index = (int)Math.Ceiling(((double)Percentile / (double)100) * (double)latencies.Count);
        return latencies[Index-1];
    }
}

enter image description here

在此处输入图片说明

回答by ajb

According to Wikipedia, there is no standard definition of percentile; however, they give a few possible definitions. The code you've posted appears to be closest to the Nearest Rank Method, but it's not quite the same.

根据维基百科,百分位数没有标准定义;然而,它们给出了一些可能的定义。您发布的代码似乎最接近最近排名方法,但并不完全相同。

The formula they give is

他们给出的公式是

n = ceiling((P / 100) x N)

where Nis the length of the list, Pis the percentile, and nwill be the ordinal rank. You've already done the division by 100. Looking at the examples they give, it's clear that the "ordinal rank" is the index in the list, but it's 1-relative. Thus, to get an index into a Java array, you'd have to subtract 1. Therefore, the correct formula should be

其中N是列表的长度,P是百分位数,n将是序数等级。您已经完成了除以 100 的工作。查看他们给出的示例,很明显“序数排名”是列表中的索引,但它是 1 相对的。因此,要获得 Java 数组的索引,您必须减去 1。因此,正确的公式应该是

n = ceiling(percentile * N) - 1

Using the variables in your code, the Java equivalent would be

使用代码中的变量,Java 等效项将是

(int) Math.ceil(percentiles[i] * latencies.length) - 1

This is not quite the code you've written. When you cast a doubleto an int, the result is rounded toward 0, i.e. it's the equivalent of the "floor" function. So your code computes

这不是您编写的代码。当您将 a 转换double为 an 时int,结果将向 0 舍入,即它相当于“floor”函数。所以你的代码计算

floor(percentiles[i] * latencies.length)

If percentiles[i] * latencies.lengthis not an integer, the result is the same either way. However, if it is an integer, so that "floor" and "ceiling" are the same value, then the result will be different.

如果percentiles[i] * latencies.length不是整数,则结果是相同的。但是,如果它是一个整数,那么“地板”和“天花板”是相同的值,那么结果就会不同。

An example from Wikipedia is to compute the 40th percentile when the list is {15, 20, 35, 40, 50}. Their answer is to find the second item in the list, i.e. 20, because 0.40 * 5 = 2.0, and ceiling(2.0) = 2.0.

维基百科的一个例子是当列表为 {15, 20, 35, 40, 50} 时计算第 40 个百分位数。他们的答案是找到列表中的第二项,即 20,因为 0.40 * 5 = 2.0,并且天花板 (2.0) = 2.0。

However, your code:

但是,您的代码:

int index = (int) (percentiles[i] * latencies.length);

will result in indexbeing 2, which isn't what you want, because that will give you the third item in the list, instead of the second.

将导致index为 2,这不是您想要的,因为这将为您提供列表中的第三个项目,而不是第二个。

So in order to match the Wikipedia definition, your computation of the index will need to be modified a little. (On the other hand, I wouldn't be surprised if someone comes along and says your computation is correct and Wikipedia is wrong. We'll see...)

因此,为了匹配维基百科的定义,您对索引的计算需要稍作修改。(另一方面,如果有人过来说你的计算是正确的而维基百科是错误的,我不会感到惊讶。我们会看到......)