用 Mysql 计算中位数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5401158/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 19:14:34  来源:igfitidea点击:

Calculating the Median with Mysql

mysqlstatisticsmedian

提问by Tim

I'm having trouble with calculating the median of a list of values, not the average.

我在计算值列表的中位数时遇到问题,而不是平均值。

I found this article Simple way to calculate median with MySQL

我发现这篇文章 用 MySQL 计算中位数的简单方法

It has a reference to the following query which I don't understand properly.

它引用了我不太理解的以下查询。

SELECT x.val from data x, data y
GROUP BY x.val
HAVING SUM(SIGN(1-SIGN(y.val-x.val))) = (COUNT(*)+1)/2

If I have a timecolumn and I want to calculate the median value, what do the xand ycolumns refer to?

如果我有一time列并且我想计算中值,那么xy列指的是什么?

采纳答案by Krab

valis your time column, xand yare two references to the data table (you can write data AS x, data AS y).

val是你的时间列,x并且y是两个引用数据表(可以写data AS x, data AS y)。

EDIT: To avoid computing your sums twice, you can store the intermediate results.

编辑:为避免计算两次总和,您可以存储中间结果。

CREATE TEMPORARY TABLE average_user_total_time 
      (SELECT SUM(time) AS time_taken 
            FROM scores 
            WHERE created_at >= '2010-10-10' 
                    and created_at <= '2010-11-11' 
            GROUP BY user_id);

Then you can compute median over these values which are in a named table.

然后,您可以计算命名表中这些值的中位数。

EDIT: Temporary table won't workhere. You could try using a regular table with "MEMORY" table type. Or just have your subquery that computes the values for the median twice in your query. Apart from this, I don't see another solution. This doesn't mean there isn't a better way, maybe somebody else will come with an idea.

编辑:临时表在这里不起作用。您可以尝试使用具有“MEMORY”表类型的常规表。或者只是让您的子查询在您的查询中计算两次中位数的值。除此之外,我没有看到其他解决方案。这并不意味着没有更好的方法,也许其他人会提出一个想法。

回答by Reggie Edwards

I propose a faster way.

我提出了一个更快的方法。

Get the row count:

获取行数:

SELECT CEIL(COUNT(*)/2) FROM data;

SELECT CEIL(COUNT(*)/2) FROM data;

Then take the middle value in a sorted subquery:

然后在排序的子查询中取中间值:

SELECT max(val) FROM (SELECT val FROM data ORDER BY val limit @middlevalue) x;

SELECT max(val) FROM (SELECT val FROM data ORDER BY val limit @middlevalue) x;

I tested this with a 5x10e6 dataset of random numbers and it will find the median in under 10 seconds.

我使用 5x10e6 随机数数据集对此进行了测试,它会在 10 秒内找到中位数。

This will find an arbitrary percentile by replacing the COUNT(*)/2with COUNT(*)*nwhere nis the percentile (.5 for median, .75 for 75th percentile, etc).

这将通过更换找到任意百分位COUNT(*)/2COUNT(*)*n地方n是百分(.5位,0.75为第75百分位,等等)。

回答by Escualo

First try to understand what the median is: it is the middle value in the sorted list of values.

首先尝试了解中位数是什么:它是值排序列表中的中间值。

Once you understand that, the approach is two steps:

一旦你理解了这一点,方法就是两个步骤:

  1. sort the values in either order
  2. pick the middle value (if not an odd number of values, pick the average of the two middle values)
  1. 按任一顺序对值进行排序
  2. 选择中间值(如果不是奇数个值,则选择两个中间值的平均值)

Example:

例子:

Median of 0 1 3 7 9 10: 5 (because (7+3)/2=5)
Median of 0 1 3 7 9 10 11: 7 (because 7 is the middle value)

So, to sort dates you need a numerical value; you can get their time stamp (as seconds elapsed from epoch) and use the definition of median.

因此,要对日期进行排序,您需要一个数值;您可以获得它们的时间戳(从纪元经过的秒数)并使用中位数的定义。

回答by subrat mishra

Finding median in mysql using group_concat

使用group_concat在mysql中查找中位数

Query:

询问:

SELECT
    IF(count%2=1,
       SUBSTRING_INDEX(substring_index(data_str,",",pos),",",-1),
       (SUBSTRING_INDEX(substring_index(data_str,",",pos),",",-1) 
         + SUBSTRING_INDEX(substring_index(data_str,",",pos+1),",",-1))/2) 
    as median 
FROM (SELECT group_concat(val order by val) data_str,
      CEILING(count(*)/2) pos,
      count(*) as count from data)temp;

Explanation:

解释:

Sorting is done using order by inside group_concat function

使用 group_concat 函数内部的 order by 进行排序

Position(pos) and Total number of elements (count) is identified. CEILING to identify position helps us to use substring_index function in the below steps.

位置(pos) 和元素总数(count) 被标识。CEILING 来识别位置有助于我们在以下步骤中使用 substring_index 函数。

Based on count, even or odd number of values is decided.

根据计数,决定偶数或奇数个值。

  • Odd values: Directly choose the element belonging to the pos using substring_index.
  • Even values: Find the element belonging to the pos and pos+1, then add them and divide by 2 to get the median.
  • 奇数值:直接使用 substring_index 选择属于 pos 的元素。
  • 偶数值:找到属于pos和pos+1的元素,然后相加除以2得到中位数。

Finally the median is calculated.

最后计算中位数。

回答by Nicolas De Bin

If you have a table Rwith a column named A, and you want the median of A, you can do as follows:

如果您有一个R名为的表A,并且您想要A中位数,您可以执行以下操作:

SELECT A FROM R R1
WHERE ( SELECT COUNT(A) FROM R R2 WHERE R2.A < R1.A ) = ( SELECT COUNT(A) FROM R R3 WHERE R3.A > R1.A )

Note: This will only work if there are no duplicated values in A. Also, null values are not allowed.

注意:这仅在 A 中没有重复值时才有效。此外,不允许使用空值。

回答by jayu chawla

Simplest ways me and my friend have found out... ENJOY!!

我和我的朋友发现的最简单的方法......享受!

SELECT count(*) INTO @c from station;
select ROUND((@c+1)/2) into @final; 
SELECT round(lat_n,4) from station a where @final-1=(select count(lat_n) from station b where b.lat_n > a.lat_n);

回答by Amitrajit Bose

Here is a solution that is easy to understand. Just replace Your_Columnand Your_Tableas per your requirement.

这是一个易于理解的解决方案。只需根据您的要求替换Your_ColumnYour_Table 即可

SET @r = 0;

SELECT AVG(Your_Column)
FROM (SELECT (@r := @r + 1) AS r, Your_Column FROM Your_Table ORDER BY Your_Column) Temp
WHERE
    r = (SELECT CEIL(COUNT(*) / 2) FROM Your_Table) OR
    r = (SELECT FLOOR((COUNT(*) / 2) + 1) FROM Your_Table)

Originally adopted from this thread.

最初采用此线程