用 MySQL 计算中位数的简单方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1291152/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 13:54:03  来源:igfitidea点击:

Simple way to calculate median with MySQL

sqlmysqlstatisticsmedian

提问by davr

What's the simplest (and hopefully not too slow) way to calculate the median with MySQL? I've used AVG(x)for finding the mean, but I'm having a hard time finding a simple way of calculating the median. For now, I'm returning all the rows to PHP, doing a sort, and then picking the middle row, but surely there must be some simple way of doing it in a single MySQL query.

使用 MySQL 计算中位数的最简单(希望不会太慢)方法是什么?我曾经用于AVG(x)求平均值,但我很难找到一种计算中位数的简单方法。现在,我将所有行返回给 PHP,进行排序,然后选择中间的行,但肯定有一些简单的方法可以在单个 MySQL 查询中执行此操作。

Example data:

示例数据:

id | val
--------
 1    4
 2    7
 3    2
 4    2
 5    9
 6    8
 7    3

Sorting on valgives 2 2 3 4 7 8 9, so the median should be 4, versus SELECT AVG(val)which == 5.

val给出排序2 2 3 4 7 8 9,所以中位数应该是4,而不是SELECT AVG(val)== 5

回答by velcrow

In MariaDB / MySQL:

在 MariaDB/MySQL 中:

SELECT AVG(dd.val) as median_val
FROM (
SELECT d.val, @rownum:=@rownum+1 as `row_number`, @total_rows:=@rownum
  FROM data d, (SELECT @rownum:=0) r
  WHERE d.val is NOT NULL
  -- put some where clause here
  ORDER BY d.val
) as dd
WHERE dd.row_number IN ( FLOOR((@total_rows+1)/2), FLOOR((@total_rows+2)/2) );

Steve Cohenpoints out, that after the first pass, @rownum will contain the total number of rows. This can be used to determine the median, so no second pass or join is needed.

Steve Cohen指出,在第一遍之后,@rownum 将包含总行数。这可用于确定中位数,因此不需要第二遍或连接。

Also AVG(dd.val)and dd.row_number IN(...)is used to correctly produce a median when there are an even number of records. Reasoning:

此外AVG(dd.val),并dd.row_number IN(...)用正确产生的中位数当有记录偶数。推理:

SELECT FLOOR((3+1)/2),FLOOR((3+2)/2); -- when total_rows is 3, avg rows 2 and 2
SELECT FLOOR((4+1)/2),FLOOR((4+2)/2); -- when total_rows is 4, avg rows 2 and 3

Finally, MariaDB 10.3.3+ contains a MEDIAN function

最后,MariaDB 10.3.3+ 包含一个 MEDIAN 函数

回答by TheJacobTaylor

I just found another answer online in the comments:

我刚刚在评论中在网上找到了另一个答案

For medians in almost any SQL:

SELECT x.val from data x, data y
GROUP BY x.val
HAVING SUM(SIGN(1-SIGN(y.val-x.val))) = (COUNT(*)+1)/2

对于几乎所有 SQL 中的中位数:

SELECT x.val from data x, data y
GROUP BY x.val
HAVING SUM(SIGN(1-SIGN(y.val-x.val))) = (COUNT(*)+1)/2

Make sure your columns are well indexed and the index is used for filtering and sorting. Verify with the explain plans.

确保您的列索引良好,并且索引用于过滤和排序。验证解释计划。

select count(*) from table --find the number of rows

Calculate the "median" row number. Maybe use: median_row = floor(count / 2).

计算“中位数”行号。也许使用:median_row = floor(count / 2)

Then pick it out of the list:

然后从列表中选择它:

select val from table order by val asc limit median_row,1

This should return you one row with just the value you want.

这应该只返回你想要的值的一行。

Jacob

雅各布

回答by zookatron

I found the accepted solution didn't work on my MySQL install, returning an empty set, but this query worked for me in all situations that I tested it on:

我发现接受的解决方案在我的 MySQL 安装上不起作用,返回一个空集,但是这个查询在我测试它的所有情况下都对我有用:

SELECT x.val from data x, data y
GROUP BY x.val
HAVING SUM(SIGN(1-SIGN(y.val-x.val)))/COUNT(*) > .5
LIMIT 1

回答by bob

Unfortunately, neither TheJacobTaylor's nor velcrow's answers return accurate results for current versions of MySQL.

不幸的是,TheJacobTaylor 和 velcrow 的答案都没有为当前版本的 MySQL 返回准确的结果。

Velcro's answer from above is close, but it does not calculate correctly for result sets with an even number of rows. Medians are defined as either 1) the middle number on odd numbered sets, or 2) the average of the two middle numbers on even number sets.

上面的 Velcro 的答案很接近,但是对于具有偶数行的结果集,它无法正确计算。中位数定义为 1) 奇数集上的中间数,或 2) 偶数集上两个中间数的平均值。

So, here's velcro's solution patched to handle both odd and even number sets:

因此,这是修补过的魔术贴解决方案以处理奇数和偶数集:

SELECT AVG(middle_values) AS 'median' FROM (
  SELECT t1.median_column AS 'middle_values' FROM
    (
      SELECT @row:=@row+1 as `row`, x.median_column
      FROM median_table AS x, (SELECT @row:=0) AS r
      WHERE 1
      -- put some where clause here
      ORDER BY x.median_column
    ) AS t1,
    (
      SELECT COUNT(*) as 'count'
      FROM median_table x
      WHERE 1
      -- put same where clause here
    ) AS t2
    -- the following condition will return 1 record for odd number sets, or 2 records for even number sets.
    WHERE t1.row >= t2.count/2 and t1.row <= ((t2.count/2) +1)) AS t3;

To use this, follow these 3 easy steps:

要使用它,请按照以下 3 个简单步骤操作:

  1. Replace "median_table" (2 occurrences) in the above code with the name of your table
  2. Replace "median_column" (3 occurrences) with the column name you'd like to find a median for
  3. If you have a WHERE condition, replace "WHERE 1" (2 occurrences) with your where condition
  1. 将上面代码中的“median_table”(出现 2 次)替换为您的表名
  2. 将“median_column”(出现 3 次)替换为您要为其查找中位数的列名
  3. 如果您有 WHERE 条件,请将“WHERE 1”(出现 2 次)替换为您的 where 条件

回答by Reggie Edwards

I propose a faster way.

我提出了一个更快的方法。

Get the row count:

获取行数:

SELECT CEIL(COUNT(*)/2) FROM data;

SELECT CEIL(COUNT(*)/2) FROM data;

Then take the middle value in a sorted subquery:

然后在排序的子查询中取中间值:

SELECT max(val) FROM (SELECT val FROM data ORDER BY val limit @middlevalue) x;

SELECT max(val) FROM (SELECT val FROM data ORDER BY val limit @middlevalue) x;

I tested this with a 5x10e6 dataset of random numbers and it will find the median in under 10 seconds.

我使用 5x10e6 随机数数据集对此进行了测试,它会在 10 秒内找到中位数。

回答by Sebastian Paaske T?rholm

A comment on this page in the MySQL documentationhas the following suggestion:

MySQL 文档中此页面的评论有以下建议:

-- (mostly) High Performance scaling MEDIAN function per group
-- Median defined in http://en.wikipedia.org/wiki/Median
--
-- by Peter Hlavac
-- 06.11.2008
--
-- Example Table:

DROP table if exists table_median;
CREATE TABLE table_median (id INTEGER(11),val INTEGER(11));
COMMIT;


INSERT INTO table_median (id, val) VALUES
(1, 7), (1, 4), (1, 5), (1, 1), (1, 8), (1, 3), (1, 6),
(2, 4),
(3, 5), (3, 2),
(4, 5), (4, 12), (4, 1), (4, 7);



-- Calculating the MEDIAN
SELECT @a := 0;
SELECT
id,
AVG(val) AS MEDIAN
FROM (
SELECT
id,
val
FROM (
SELECT
-- Create an index n for every id
@a := (@a + 1) mod o.c AS shifted_n,
IF(@a mod o.c=0, o.c, @a) AS n,
o.id,
o.val,
-- the number of elements for every id
o.c
FROM (
SELECT
t_o.id,
val,
c
FROM
table_median t_o INNER JOIN
(SELECT
id,
COUNT(1) AS c
FROM
table_median
GROUP BY
id
) t2
ON (t2.id = t_o.id)
ORDER BY
t_o.id,val
) o
) a
WHERE
IF(
-- if there is an even number of elements
-- take the lower and the upper median
-- and use AVG(lower,upper)
c MOD 2 = 0,
n = c DIV 2 OR n = (c DIV 2)+1,

-- if its an odd number of elements
-- take the first if its only one element
-- or take the one in the middle
IF(
c = 1,
n = 1,
n = c DIV 2 + 1
)
)
) a
GROUP BY
id;

-- Explanation:
-- The Statement creates a helper table like
--
-- n id val count
-- ----------------
-- 1, 1, 1, 7
-- 2, 1, 3, 7
-- 3, 1, 4, 7
-- 4, 1, 5, 7
-- 5, 1, 6, 7
-- 6, 1, 7, 7
-- 7, 1, 8, 7
--
-- 1, 2, 4, 1

-- 1, 3, 2, 2
-- 2, 3, 5, 2
--
-- 1, 4, 1, 4
-- 2, 4, 5, 4
-- 3, 4, 7, 4
-- 4, 4, 12, 4


-- from there we can select the n-th element on the position: count div 2 + 1 

回答by Leonardo Nicolas

Install and use this mysql statistical functions: http://www.xarg.org/2012/07/statistical-functions-in-mysql/

安装和使用这个mysql统计函数:http: //www.xarg.org/2012/07/statistical-functions-in-mysql/

After that, calculate median is easy:

之后,计算中位数很容易:

SELECT median(val) FROM data;

回答by Prashant Srivastav

I have this below code which I found on HackerRank and it is pretty simple and works in each and every case.

我在 HackerRank 上找到了以下代码,它非常简单,适用于各种情况。

SELECT M.MEDIAN_COL FROM MEDIAN_TABLE M WHERE  
  (SELECT COUNT(MEDIAN_COL) FROM MEDIAN_TABLE WHERE MEDIAN_COL < M.MEDIAN_COL ) = 
  (SELECT COUNT(MEDIAN_COL) FROM MEDIAN_TABLE WHERE MEDIAN_COL > M.MEDIAN_COL );

回答by Nico

Most of the solutions above work only for one field of the table, you might need to get the median (50th percentile) for many fields on the query.

上述大多数解决方案仅适用于表的一个字段,您可能需要获取查询中许多字段的中位数(第 50 个百分位数)。

I use this:

我用这个:

SELECT CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(
 GROUP_CONCAT(field_name ORDER BY field_name SEPARATOR ','),
  ',', 50/100 * COUNT(*) + 1), ',', -1) AS DECIMAL) AS `Median`
FROM table_name;

You can replace the "50" in example above to any percentile, is very efficient.

您可以将上面示例中的“50”替换为任何百分位数,非常有效。

Just make sure you have enough memory for the GROUP_CONCAT, you can change it with:

只要确保您有足够的内存用于 GROUP_CONCAT,您可以通过以下方式更改它:

SET group_concat_max_len = 10485760; #10MB max length

More details: http://web.performancerasta.com/metrics-tips-calculating-95th-99th-or-any-percentile-with-single-mysql-query/

更多细节:http: //web.performancerasta.com/metrics-tips-calculating-95th-99th-or-any-percentile-with-single-mysql-query/

回答by Doug

Building off of velcro's answer, for those of you having to do a median off of something that is grouped by another parameter:

基于 velcro 的答案,对于那些必须对由另一个参数分组的东西进行中位数的人:

SELECT grp_field, t1.val FROM (
   SELECT grp_field, @rownum:=IF(@s = grp_field, @rownum + 1, 0) AS row_number,
   @s:=IF(@s = grp_field, @s, grp_field) AS sec, d.val
  FROM data d,  (SELECT @rownum:=0, @s:=0) r
  ORDER BY grp_field, d.val
) as t1 JOIN (
  SELECT grp_field, count(*) as total_rows
  FROM data d
  GROUP BY grp_field
) as t2
ON t1.grp_field = t2.grp_field
WHERE t1.row_number=floor(total_rows/2)+1;