在 MySQL 中计算百分位排名
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1057027/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Calculating percentile rank in MySQL
提问by lhahne
I have a very big table of measurement data in MySQL and I need to compute the percentile rank for each and every one of these values. Oracle appears to have a function called percent_rank but I can't find anything similar for MySQL. Sure I could just brute-force it in Python which I use anyways to populate the table but I suspect that would be quite inefficient because one sample might have 200.000 observations.
我在 MySQL 中有一个非常大的测量数据表,我需要计算这些值中的每一个的百分位等级。Oracle 似乎有一个名为 percent_rank 的函数,但我找不到任何与 MySQL 类似的函数。当然,我可以在 Python 中暴力破解它,我无论如何都使用它来填充表格,但我怀疑这会非常低效,因为一个样本可能有 200.000 个观察值。
采纳答案by TheJacobTaylor
This is a relatively ugly answer, and I feel guilty saying it. That said, it might help you with your issue.
这是一个相对丑陋的答案,我很内疚。也就是说,它可能会帮助您解决问题。
One way to determine the percentage would be to count all of the rows, and count the number of rows that are greater than the number you provided. You can calculate either greater or less than and take the inverse as necessary.
确定百分比的一种方法是计算所有行,并计算大于您提供的数字的行数。您可以计算大于或小于,并根据需要取倒数。
Create an index on your number. total = select count(); less_equal = select count() where value > indexed_number;
在您的号码上创建索引。总计=选择计数();less_equal = select count() where value > indexed_number;
The percentage would be something like: less_equal / total or (total - less_equal)/total
百分比类似于:less_equal / total 或 (total - less_equal)/total
Make sure that both of them are using the index that you created. If they are not, tweak them until they are. The explain query should have "using index" in the right hand column. In the case of the select count(*) it should be using index for InnoDB and something like const for MyISAM. MyISAM will know this value at any time without having to calculate it.
确保它们都使用您创建的索引。如果不是,调整它们直到它们是。解释查询的右侧列中应该有“使用索引”。在 select count(*) 的情况下,它应该对 InnoDB 使用索引,对 MyISAM 使用 const 之类的东西。MyISAM 会随时知道这个值,而无需计算它。
If you needed to have the percentage stored in the database, you can use the setup from above for performance and then calculate the value for each row by using the second query as an inner select. The first query's value can be set as a constant.
如果您需要将百分比存储在数据库中,您可以使用上面的设置来提高性能,然后通过使用第二个查询作为内部选择来计算每行的值。第一个查询的值可以设置为常量。
Does this help?
这有帮助吗?
Jacob
雅各布
回答by mattstuehler
Here's a different approach that doesn't require a join. In my case (a table with 15,000+) rows, it runs in about 3 seconds. (The JOIN method takes an order of magnitude longer).
这是一种不需要连接的不同方法。在我的情况下(一个包含 15,000 多个行的表),它运行大约 3 秒。(JOIN 方法需要更长的数量级)。
In the sample, assume that measureis the column on which you're calculating the percent rank, and idis just a row identifier (not required):
在示例中,假设度量是您计算百分比排名的列,而id只是一个行标识符(不是必需的):
SELECT
id,
@prev := @curr as prev,
@curr := measure as curr,
@rank := IF(@prev > @curr, @rank+@ties, @rank) AS rank,
@ties := IF(@prev = @curr, @ties+1, 1) AS ties,
(1-@rank/@total) as percentrank
FROM
mytable,
(SELECT
@curr := null,
@prev := null,
@rank := 0,
@ties := 1,
@total := count(*) from mytable where measure is not null
) b
WHERE
measure is not null
ORDER BY
measure DESC
Credit for this method goes to Shlomi Noach. He writes about it in detail here:
这种方法归功于 Shlomi Noach。他在这里详细地写到:
http://code.openark.org/blog/mysql/sql-ranking-without-self-join
http://code.openark.org/blog/mysql/sql-ranking-without-self-join
I've tested this in MySQL and it works great; no idea about Oracle, SQLServer, etc.
我已经在 MySQL 中测试过这个,效果很好;不知道 Oracle、SQLServer 等。
回答by Conor
SELECT
c.id, c.score, ROUND(((@rank - rank) / @rank) * 100, 2) AS percentile_rank
FROM
(SELECT
*,
@prev:=@curr,
@curr:=a.score,
@rank:=IF(@prev = @curr, @rank, @rank + 1) AS rank
FROM
(SELECT id, score FROM mytable) AS a,
(SELECT @curr:= null, @prev:= null, @rank:= 0) AS b
ORDER BY score DESC) AS c;
回答by Nir Levy
there is no easy way to do this. see http://rpbouman.blogspot.com/2008/07/calculating-nth-percentile-in-mysql.html
没有简单的方法可以做到这一点。见http://rpbouman.blogspot.com/2008/07/calculating-nth-percentile-in-mysql.html
回答by Pete
If you're combining your SQL with a procedural language like PHP, you can do the following. This example breaks down excess flight block times into an airport, into their percentiles. Uses the LIMIT x,y clause in MySQL in combination with ORDER BY
. Not very pretty, but does the job (sorry struggled with the formatting):
如果您将 SQL 与 PHP 等过程语言结合使用,您可以执行以下操作。此示例将多余的航班阻塞时间分解为机场的百分位数。结合使用 MySQL 中的 LIMIT x,y 子句ORDER BY
。不是很漂亮,但可以完成工作(抱歉在格式上挣扎):
$startDt = "2011-01-01";
$endDt = "2011-02-28";
$arrPort= 'JFK';
$strSQL = "SELECT COUNT(*) as TotFlights FROM FIDS where depdt >= '$startDt' And depdt <= '$endDt' and ArrPort='$arrPort'";
if (!($queryResult = mysql_query($strSQL, $con)) ) {
echo $strSQL . " FAILED\n"; echo mysql_error();
exit(0);
}
$totFlights=0;
while($fltRow=mysql_fetch_array($queryResult)) {
echo "Total Flights into " . $arrPort . " = " . $fltRow['TotFlights'];
$totFlights = $fltRow['TotFlights'];
/* 1906 flights. Percentile 90 = int(0.9 * 1906). */
for ($x = 1; $x<=10; $x++) {
$pctlPosn = $totFlights - intval( ($x/10) * $totFlights);
echo "PCTL POSN for " . $x * 10 . " IS " . $pctlPosn . "\t";
$pctlSQL = "SELECT (ablk-sblk) as ExcessBlk from FIDS where ArrPort='" . $arrPort . "' order by ExcessBlk DESC limit " . $pctlPosn . ",1;";
if (!($query2Result = mysql_query($pctlSQL, $con)) ) {
echo $pctlSQL . " FAILED\n";
echo mysql_error();
exit(0);
}
while ($pctlRow = mysql_fetch_array($query2Result)) {
echo "Excess Block is :" . $pctlRow['ExcessBlk'] . "\n";
}
}
}
回答by Lukas Eder
MySQL 8 finally introduced window functions, and among them, the PERCENT_RANK()
function you were looking for. So, just write:
MySQL 8 终于引入了窗口函数,其中就有PERCENT_RANK()
你要找的函数。所以,只要写:
SELECT col, percent_rank() OVER (ORDER BY col)
FROM t
ORDER BY col
Your question mentions "percentiles", which are a slightly different thing. For completeness' sake, there are PERCENTILE_DISC
and PERCENTILE_CONT
inverse distribution functions in the SQL standard and in some RBDMS (Oracle, PostgreSQL, SQL Server, Teradata), but not in MySQL. With MySQL 8 and window functions, you can emulate PERCENTILE_DISC
, however, again using the PERCENT_RANK
and FIRST_VALUE
window functions.
你的问题提到了“百分位数”,这是一个稍微不同的东西。为了完整起见,在 SQL 标准和某些 RBDMS(Oracle、PostgreSQL、SQL Server、Teradata)中有PERCENTILE_DISC
和PERCENTILE_CONT
逆分布函数,但在 MySQL 中没有。使用 MySQL 8 和窗口函数,您可以模拟PERCENTILE_DISC
,但是,再次使用PERCENT_RANK
和FIRST_VALUE
窗口函数。
回答by Ishaan Kulshrestha
Suppose we have a sales table like :
假设我们有一个销售表,如:
user_id,units
用户 ID,单位
then following query will give percentile of each user :
然后以下查询将给出每个用户的百分位数:
select a.user_id,a.units,
(sum(case when a.units >= b.units then 1 else 0 end )*100)/count(1) percentile
from sales a join sales b ;
Note that this will go for cross join so result in O(n2) complexity so can be considered as unoptimized solution but seems simple given we do not have any function in mysql version.
请注意,这将用于交叉连接,因此导致 O(n2) 复杂性,因此可以将其视为未优化的解决方案,但鉴于我们在 mysql 版本中没有任何功能,这似乎很简单。
回答by Ishaan Kulshrestha
To get the rank, I'd say you need to (left) outer join the table on itself something like :
要获得排名,我会说您需要(左)将表格本身加入外部,例如:
select t1.name, t1.value, count(distinct isnull(t2.value,0))
from table t1
left join table t2
on t1.value>t2.value
group by t1.name, t1.value
For each row, you will count how many (if any) rows of the same table have an inferior value.
对于每一行,您将计算同一个表中有多少(如果有)行具有较低的值。
Note that I'm more familiar with sqlserver so the syntax might not be right. Also the distinct may not have the right behaviour for what you want to achieve. But that's the general idea.
Then to get the real percentile rank you will need to first get the number of values in a variable (or distinct values depending on the convention you want to take) and compute the percentile rank using the real rank given above.
请注意,我对 sqlserver 更熟悉,因此语法可能不正确。此外,对于您想要实现的目标,不同的人可能没有正确的行为。但这是一般的想法。
然后要获得真正的百分位排名,您需要首先获得变量中的值数量(或不同的值,取决于您要采用的约定)并使用上面给出的真实排名计算百分位排名。