在 MySQL 中计算百分位排名

Question

提问by lhahne

I have a very big table of measurement data in MySQL and I need to compute the percentile rank for each and every one of these values. Oracle appears to have a function called percent_rank but I can't find anything similar for MySQL. Sure I could just brute-force it in Python which I use anyways to populate the table but I suspect that would be quite inefficient because one sample might have 200.000 observations.

我在 MySQL 中有一个非常大的测量数据表，我需要计算这些值中的每一个的百分位等级。Oracle 似乎有一个名为 percent_rank 的函数，但我找不到任何与 MySQL 类似的函数。当然，我可以在 Python 中暴力破解它，我无论如何都使用它来填充表格，但我怀疑这会非常低效，因为一个样本可能有 200.000 个观察值。

Answer 1

采纳答案by TheJacobTaylor

This is a relatively ugly answer, and I feel guilty saying it. That said, it might help you with your issue.

这是一个相对丑陋的答案，我很内疚。也就是说，它可能会帮助您解决问题。

One way to determine the percentage would be to count all of the rows, and count the number of rows that are greater than the number you provided. You can calculate either greater or less than and take the inverse as necessary.

确定百分比的一种方法是计算所有行，并计算大于您提供的数字的行数。您可以计算大于或小于，并根据需要取倒数。

Create an index on your number. total = select count(); less_equal = select count() where value > indexed_number;

在您的号码上创建索引。总计=选择计数（）；less_equal = select count() where value > indexed_number;

The percentage would be something like: less_equal / total or (total - less_equal)/total

百分比类似于：less_equal / total 或 (total - less_equal)/total

Make sure that both of them are using the index that you created. If they are not, tweak them until they are. The explain query should have "using index" in the right hand column. In the case of the select count(*) it should be using index for InnoDB and something like const for MyISAM. MyISAM will know this value at any time without having to calculate it.

确保它们都使用您创建的索引。如果不是，调整它们直到它们是。解释查询的右侧列中应该有“使用索引”。在 select count(*) 的情况下，它应该对 InnoDB 使用索引，对 MyISAM 使用 const 之类的东西。MyISAM 会随时知道这个值，而无需计算它。

If you needed to have the percentage stored in the database, you can use the setup from above for performance and then calculate the value for each row by using the second query as an inner select. The first query's value can be set as a constant.

如果您需要将百分比存储在数据库中，您可以使用上面的设置来提高性能，然后通过使用第二个查询作为内部选择来计算每行的值。第一个查询的值可以设置为常量。

Does this help?

这有帮助吗？

Jacob

雅各布

Answer 2

回答by mattstuehler

Here's a different approach that doesn't require a join. In my case (a table with 15,000+) rows, it runs in about 3 seconds. (The JOIN method takes an order of magnitude longer).

这是一种不需要连接的不同方法。在我的情况下（一个包含 15,000 多个行的表），它运行大约 3 秒。（JOIN 方法需要更长的数量级）。

In the sample, assume that measureis the column on which you're calculating the percent rank, and idis just a row identifier (not required):

在示例中，假设度量是您计算百分比排名的列，而id只是一个行标识符（不是必需的）：

SELECT
    id,
    @prev := @curr as prev,
    @curr := measure as curr,
    @rank := IF(@prev > @curr, @rank+@ties, @rank) AS rank,
    @ties := IF(@prev = @curr, @ties+1, 1) AS ties,
    (1-@rank/@total) as percentrank
FROM
    mytable,
    (SELECT
        @curr := null,
        @prev := null,
        @rank := 0,
        @ties := 1,
        @total := count(*) from mytable where measure is not null
    ) b
WHERE
    measure is not null
ORDER BY
    measure DESC

Credit for this method goes to Shlomi Noach. He writes about it in detail here:

这种方法归功于 Shlomi Noach。他在这里详细地写到：

http://code.openark.org/blog/mysql/sql-ranking-without-self-join

I've tested this in MySQL and it works great; no idea about Oracle, SQLServer, etc.

我已经在 MySQL 中测试过这个，效果很好；不知道 Oracle、SQLServer 等。

Answer 3

回答by Conor

SELECT 
    c.id, c.score, ROUND(((@rank - rank) / @rank) * 100, 2) AS percentile_rank
FROM
    (SELECT 
    *,
        @prev:=@curr,
        @curr:=a.score,
        @rank:=IF(@prev = @curr, @rank, @rank + 1) AS rank
    FROM
        (SELECT id, score FROM mytable) AS a,
        (SELECT @curr:= null, @prev:= null, @rank:= 0) AS b
ORDER BY score DESC) AS c;

Answer 4

回答by Nir Levy

there is no easy way to do this. see http://rpbouman.blogspot.com/2008/07/calculating-nth-percentile-in-mysql.html

没有简单的方法可以做到这一点。见http://rpbouman.blogspot.com/2008/07/calculating-nth-percentile-in-mysql.html

Answer 5

回答by Pete

If you're combining your SQL with a procedural language like PHP, you can do the following. This example breaks down excess flight block times into an airport, into their percentiles. Uses the LIMIT x,y clause in MySQL in combination with ORDER BY. Not very pretty, but does the job (sorry struggled with the formatting):

如果您将 SQL 与 PHP 等过程语言结合使用，您可以执行以下操作。此示例将多余的航班阻塞时间分解为机场的百分位数。结合使用 MySQL 中的 LIMIT x,y 子句ORDER BY。不是很漂亮，但可以完成工作（抱歉在格式上挣扎）：

$startDt = "2011-01-01";
$endDt = "2011-02-28";
$arrPort= 'JFK';

$strSQL = "SELECT COUNT(*) as TotFlights FROM FIDS where depdt >= '$startDt' And depdt <= '$endDt' and ArrPort='$arrPort'";
if (!($queryResult = mysql_query($strSQL, $con)) ) {
    echo $strSQL . " FAILED\n"; echo mysql_error();
    exit(0);
}
$totFlights=0;
while($fltRow=mysql_fetch_array($queryResult)) {
    echo "Total Flights into " . $arrPort . " = " . $fltRow['TotFlights'];
    $totFlights = $fltRow['TotFlights'];

    /* 1906 flights. Percentile 90 = int(0.9 * 1906). */
    for ($x = 1; $x<=10; $x++) {
        $pctlPosn = $totFlights - intval( ($x/10) * $totFlights);
        echo "PCTL POSN for " . $x * 10 . " IS " . $pctlPosn . "\t";
        $pctlSQL = "SELECT  (ablk-sblk) as ExcessBlk from FIDS where ArrPort='" . $arrPort . "' order by ExcessBlk DESC limit " . $pctlPosn . ",1;";
        if (!($query2Result = mysql_query($pctlSQL, $con)) ) {
            echo $pctlSQL  . " FAILED\n";
            echo mysql_error();
            exit(0);
        }
        while ($pctlRow = mysql_fetch_array($query2Result)) {
            echo "Excess Block is :" . $pctlRow['ExcessBlk'] . "\n";
        }
    }
}

Answer 6

回答by Lukas Eder

MySQL 8 finally introduced window functions, and among them, the PERCENT_RANK()function you were looking for. So, just write:

MySQL 8 终于引入了窗口函数，其中就有PERCENT_RANK()你要找的函数。所以，只要写：

SELECT col, percent_rank() OVER (ORDER BY col)
FROM t
ORDER BY col

Your question mentions "percentiles", which are a slightly different thing. For completeness' sake, there are PERCENTILE_DISCand PERCENTILE_CONTinverse distribution functions in the SQL standard and in some RBDMS (Oracle, PostgreSQL, SQL Server, Teradata), but not in MySQL. With MySQL 8 and window functions, you can emulate PERCENTILE_DISC, however, again using the PERCENT_RANKand FIRST_VALUEwindow functions.

你的问题提到了“百分位数”，这是一个稍微不同的东西。为了完整起见，在 SQL 标准和某些 RBDMS（Oracle、PostgreSQL、SQL Server、Teradata）中有PERCENTILE_DISC和PERCENTILE_CONT逆分布函数，但在 MySQL 中没有。使用 MySQL 8 和窗口函数，您可以模拟PERCENTILE_DISC，但是，再次使用PERCENT_RANK和FIRST_VALUE窗口函数。

Answer 7

回答by Ishaan Kulshrestha

Suppose we have a sales table like :

假设我们有一个销售表，如：

user_id,units

用户 ID，单位

then following query will give percentile of each user :

然后以下查询将给出每个用户的百分位数：

select a.user_id,a.units,
(sum(case when a.units >= b.units then 1 else 0 end )*100)/count(1) percentile
from sales a join sales b ;

Note that this will go for cross join so result in O(n2) complexity so can be considered as unoptimized solution but seems simple given we do not have any function in mysql version.

请注意，这将用于交叉连接，因此导致 O(n2) 复杂性，因此可以将其视为未优化的解决方案，但鉴于我们在 mysql 版本中没有任何功能，这似乎很简单。

Answer 8

回答by Ishaan Kulshrestha

To get the rank, I'd say you need to (left) outer join the table on itself something like :

要获得排名，我会说您需要（左）将表格本身加入外部，例如：

select t1.name, t1.value, count(distinct isnull(t2.value,0))  
from table t1  
left join table t2  
on t1.value>t2.value  
group by t1.name, t1.value

For each row, you will count how many (if any) rows of the same table have an inferior value.

对于每一行，您将计算同一个表中有多少（如果有）行具有较低的值。

Note that I'm more familiar with sqlserver so the syntax might not be right. Also the distinct may not have the right behaviour for what you want to achieve. But that's the general idea.
Then to get the real percentile rank you will need to first get the number of values in a variable (or distinct values depending on the convention you want to take) and compute the percentile rank using the real rank given above.

请注意，我对 sqlserver 更熟悉，因此语法可能不正确。此外，对于您想要实现的目标，不同的人可能没有正确的行为。但这是一般的想法。
然后要获得真正的百分位排名，您需要首先获得变量中的值数量（或不同的值，取决于您要采用的约定）并使用上面给出的真实排名计算百分位排名。

在 MySQL 中计算百分位排名

提问by lhahne

采纳答案by TheJacobTaylor

回答by mattstuehler

回答by Conor

回答by Nir Levy

回答by Pete

回答by Lukas Eder

回答by Ishaan Kulshrestha

回答by Ishaan Kulshrestha

相关推荐

最近更新

标签

在 MySQL 中计算百分位排名

提问by lhahne

采纳答案by TheJacobTaylor

回答by mattstuehler

回答by Conor

回答by Nir Levy

回答by Pete

回答by Lukas Eder

回答by Ishaan Kulshrestha

回答by Ishaan Kulshrestha

相关推荐

MySQL 错误 2013

MySQL：启用 LOAD DATA LOCAL INFILE

如何分析 MySQL

MySQL 如何从mysql中的当前日期时间中减去30天？

相关推荐

最近更新

标签