postgresql 如何在postgres中计算百分位数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27883999/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to calculate percentile in postgres
提问by lampdev
I Have table called timings where we are storing 1 million response timings for load testing , now we need to divide this data into 100 groups i.e. - first 500 records as one group and so on , and calculate percentile of each group , rather than average.
我有一个名为timings的表,我们在其中存储了100万个用于负载测试的响应时间,现在我们需要将这些数据分成100组,即前500条记录作为一组等等,并计算每组的百分位数,而不是平均值。
so far i tried this query
到目前为止我试过这个查询
Select quartile
, avg(data)
, max(data)
FROM (
SELECT data
, ntile(500) over (order by data) as quartile
FROM data
) x
GROUP BY quartile
ORDER BY quartile
but how do i have find the percentile
但我如何找到百分位数
回答by Paul A Jungwirth
Usually, if you want to know the percentile, you are safer using cume_dist
than ntile
. That is because ntile
behaves strangely when given few inputs. Consider:
通常情况下,如果你想知道百分,你是更安全的使用cume_dist
比ntile
。那是因为ntile
当输入很少时,行为会很奇怪。考虑:
=# select v,
ntile(100) OVER (ORDER BY v),
cume_dist() OVER (ORDER BY v)
FROM (VALUES (1), (2), (4), (4)) x(v);
v | ntile | cume_dist
---+-------+-----------
1 | 1 | 0.25
2 | 2 | 0.5
4 | 3 | 1
4 | 4 | 1
You can see that ntile
only uses the first 4 out of 100 buckets, where cume_dist
always gives you a number from 0 to 1. So if you want to find out the 99th percentile, you can just throw away everything with a cume_dist
under 0.99 and take the smallest v
from what's left.
你可以看到ntile
只使用了 100 个桶中的前 4 个,其中cume_dist
总是给你一个从 0 到 1 的数字。所以如果你想找出第 99 个百分位数,你可以扔掉所有cume_dist
低于 0.99 的东西,取最小的v
从剩下的。
If you are on Postgres 9.4+, then percentile_cont
and percentile_disc
make it even easier, because you don't have to construct the buckets yourself. The former even gives you interpolation between values, which again may be useful if you have a small data set.
如果您对Postgres的9.4+,然后percentile_cont
和percentile_disc
使其更容易,因为你不必自己构建桶。前者甚至可以为您提供值之间的插值,如果您的数据集很小,这再次可能很有用。
回答by khampson
Edit:
编辑:
Please note that since I originally answered this question, Postgres has gotten additional aggregate functions to help with this. See percentile_disc
and percentile_cont
here. These were introduced in 9.4.
请注意,自从我最初回答这个问题以来,Postgres 已经获得了额外的聚合函数来帮助解决这个问题。看到percentile_disc
和percentile_cont
在这里。这些是在 9.4 中引入的。
Original Answer:
原答案:
ntile
is how one calculates percentiles(among other n-tiles, such as quartile, decile, etc.).
ntile
是如何计算百分位数(在其他 n 分位数中,例如四分位数、十分位数等)。
ntile
groups the table into the specified number of buckets as equally as possible. If you specified 4 buckets, that would be a quartile. 10 would be a decile.
ntile
将表尽可能平均地分组到指定数量的桶中。如果您指定了 4 个桶,那将是一个四分位数。10 将是一个十分位数。
For percentile, you would set the number of buckets to be 100.
对于percentile,您可以将桶数设置为 100。
I'm not sure where the 500 comes in here... if you want to determine which percentileyour data is in (i.e. divide the million timings as equally as possible into 100 buckets), you would use ntile
with an argument of 100, and the groups would have more than 500 entries.
我不确定 500 在这里的位置......如果你想确定你的数据在哪个百分位(即将百万次尽可能平均地分成 100 个桶),你可以使用ntile
100 的参数,并且这些组将有 500 多个条目。
If you don't care about avg
nor max
, you can drop a bunch from your query. So it would look something like this:
如果你不关心avg
或者max
,你可以从你的查询中删除一堆。所以它看起来像这样:
SELECT data, ntile(100) over (order by data) AS percentile
FROM data
ORDER BY data