在 SQL Server 中计算中值的函数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1342898/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Function to Calculate Median in SQL Server
提问by Yaakov Ellis
According to MSDN, Median is not available as an aggregate function in Transact-SQL. However, I would like to find out whether it is possible to create this functionality (using the Create Aggregatefunction, user defined function, or some other method).
根据MSDN,中值不能用作 Transact-SQL 中的聚合函数。但是,我想知道是否可以创建此功能(使用Create Aggregate函数、用户定义函数或其他一些方法)。
What would be the best way (if possible) to do this - allow for the calculation of a median value (assuming a numeric data type) in an aggregate query?
这样做的最佳方法(如果可能)是什么 - 允许在聚合查询中计算中值(假设是数字数据类型)?
采纳答案by Justin Grant
2019 UPDATE:In the 10 years since I wrote this answer, more solutions have been uncovered that may yield better results. Also, SQL Server releases since then (especially SQL 2012) have introduced new T-SQL features that can be used to calculate medians. SQL Server releases have also improved its query optimizer which may affect perf of various median solutions. Net-net, my original 2009 post is still OK but there may be better solutions on for modern SQL Server apps. Take a look at this article from 2012 which is a great resource: https://sqlperformance.com/2012/08/t-sql-queries/median
2019 年更新:在我写下这个答案的 10 年里,已经发现了更多可能产生更好结果的解决方案。此外,此后的 SQL Server 版本(尤其是 SQL 2012)引入了可用于计算中位数的新 T-SQL 功能。SQL Server 版本还改进了其查询优化器,这可能会影响各种中值解决方案的性能。Net-net,我 2009 年的原始帖子仍然可以,但对于现代 SQL Server 应用程序可能有更好的解决方案。看看这篇 2012 年的文章,这是一个很好的资源:https: //sqlperformance.com/2012/08/t-sql-queries/median
This article found the following pattern to be much, much faster than all other alternatives, at least on the simple schema they tested. This solution was 373x faster (!!!) than the slowest (PERCENTILE_CONT
) solution tested. Note that this trick requires two separate queries which may not be practical in all cases. It also requires SQL 2012 or later.
本文发现以下模式比所有其他替代方案快得多,至少在他们测试的简单模式上是这样。此解决方案比PERCENTILE_CONT
测试的最慢 ( ) 解决方案快 373 倍 (!!!) 。请注意,此技巧需要两个单独的查询,这在所有情况下可能都不实用。它还需要 SQL 2012 或更高版本。
DECLARE @c BIGINT = (SELECT COUNT(*) FROM dbo.EvenRows);
SELECT AVG(1.0 * val)
FROM (
SELECT val FROM dbo.EvenRows
ORDER BY val
OFFSET (@c - 1) / 2 ROWS
FETCH NEXT 1 + (1 - @c % 2) ROWS ONLY
) AS x;
Of course, just because one test on one schema in 2012 yielded great results, your mileage may vary, especially if you're on SQL Server 2014 or later. If perf is important for your median calculation, I'd strongly suggest trying and perf-testing several of the options recommended in that article to make sure that you've found the best one for your schema.
当然,仅仅因为 2012 年对一个架构的一项测试产生了很好的结果,您的里程可能会有所不同,特别是如果您使用的是 SQL Server 2014 或更高版本。如果 perf 对您的中值计算很重要,我强烈建议您尝试并对该文章中推荐的几个选项进行性能测试,以确保您找到了最适合您的架构的选项。
I'd also be especially careful using the (new in SQL Server 2012) function PERCENTILE_CONT
that's recommended in one of the other answersto this question, because the article linked above found this built-in function to be 373x slower than the fastest solution. It's possible that this disparity has been improved in the 7 years since, but personally I wouldn't use this function on a large table until I verified its performance vs. other solutions.
我也会特别小心地使用(SQL Server 2012 中的新功能)PERCENTILE_CONT
在这个问题的其他答案之一中推荐的函数,因为上面链接的文章发现这个内置函数比最快的解决方案慢 373 倍。从那以后的 7 年中,这种差异可能已经有所改善,但就我个人而言,在验证其性能与其他解决方案之前,我不会在大表上使用此函数。
ORIGINAL 2009 POST IS BELOW:
2009 年原始帖子如下:
There are lots of ways to do this, with dramatically varying performance. Here's one particularly well-optimized solution, from Medians, ROW_NUMBERs, and performance. This is a particularly optimal solution when it comes to actual I/Os generated during execution – it looks more costly than other solutions, but it is actually much faster.
有很多方法可以做到这一点,但性能却大不相同。这是一个经过特别优化的解决方案,来自Medians、ROW_NUMBERs 和 performance。对于执行期间生成的实际 I/O,这是一个特别理想的解决方案——它看起来比其他解决方案成本更高,但实际上要快得多。
That page also contains a discussion of other solutions and performance testing details. Note the use of a unique column as a disambiguator in case there are multiple rows with the same value of the median column.
该页面还包含对其他解决方案和性能测试细节的讨论。请注意使用唯一列作为消歧器,以防有多行具有相同的中值列值。
As with all database performance scenarios, always try to test a solution out with real data on real hardware – you never know when a change to SQL Server's optimizer or a peculiarity in your environment will make a normally-speedy solution slower.
与所有数据库性能方案一样,始终尝试使用真实硬件上的真实数据来测试解决方案——您永远不知道 SQL Server 优化器的更改或环境中的特殊性何时会使正常速度的解决方案变慢。
SELECT
CustomerId,
AVG(TotalDue)
FROM
(
SELECT
CustomerId,
TotalDue,
-- SalesOrderId in the ORDER BY is a disambiguator to break ties
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc
FROM Sales.SalesOrderHeader SOH
) x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY CustomerId
ORDER BY CustomerId;
回答by Jeff Atwood
If you're using SQL 2005 or better this is a nice, simple-ish median calculation for a single column in a table:
如果您使用的是 SQL 2005 或更高版本,这是一个很好的、简单的表中单列中值计算:
SELECT
(
(SELECT MAX(Score) FROM
(SELECT TOP 50 PERCENT Score FROM Posts ORDER BY Score) AS BottomHalf)
+
(SELECT MIN(Score) FROM
(SELECT TOP 50 PERCENT Score FROM Posts ORDER BY Score DESC) AS TopHalf)
) / 2 AS Median
回答by Simon_Weaver
In SQL Server 2012 you should use PERCENTILE_CONT:
在 SQL Server 2012 中,您应该使用PERCENTILE_CONT:
SELECT SalesOrderID, OrderQty,
PERCENTILE_CONT(0.5)
WITHIN GROUP (ORDER BY OrderQty)
OVER (PARTITION BY SalesOrderID) AS MedianCont
FROM Sales.SalesOrderDetail
WHERE SalesOrderID IN (43670, 43669, 43667, 43663)
ORDER BY SalesOrderID DESC
另见:http: //blog.sqlauthority.com/2011/11/20/sql-server-introduction-to-percentile_cont-analytic-functions-introduced-in-sql-server-2012/
回答by Sir Wobin
My original quick answer was:
我最初的快速回答是:
select max(my_column) as [my_column], quartile
from (select my_column, ntile(4) over (order by my_column) as [quartile]
from my_table) i
--where quartile = 2
group by quartile
This will give you the median and interquartile range in one fell swoop. If you really only want one row that is the median then uncomment the where clause.
这将一举为您提供中位数和四分位数范围。如果你真的只想要一行是中位数,那么取消注释 where 子句。
When you stick that into an explain plan, 60% of the work is sorting the data which is unavoidable when calculating position dependent statistics like this.
当您将其纳入解释计划时,60% 的工作是对数据进行排序,这在计算此类位置相关统计数据时是不可避免的。
I've amended the answer to follow the excellent suggestion from Robert ?ev?ík-Robajz in the comments below:
我已经修改了答案,以遵循 Robert ?ev?ík-Robajz 在以下评论中的出色建议:
;with PartitionedData as
(select my_column, ntile(10) over (order by my_column) as [percentile]
from my_table),
MinimaAndMaxima as
(select min(my_column) as [low], max(my_column) as [high], percentile
from PartitionedData
group by percentile)
select
case
when b.percentile = 10 then cast(b.high as decimal(18,2))
else cast((a.low + b.high) as decimal(18,2)) / 2
end as [value], --b.high, a.low,
b.percentile
from MinimaAndMaxima a
join MinimaAndMaxima b on (a.percentile -1 = b.percentile) or (a.percentile = 10 and b.percentile = 10)
--where b.percentile = 5
This should calculate the correct median and percentile values when you have an even number of data items. Again, uncomment the final where clause if you only want the median and not the entire percentile distribution.
当您有偶数个数据项时,这应该计算正确的中位数和百分位数。同样,如果您只想要中位数而不是整个百分位分布,请取消注释最后的 where 子句。
回答by l--''''''---------''''''''''''
Even better:
更好的是:
SELECT @Median = AVG(1.0 * val)
FROM
(
SELECT o.val, rn = ROW_NUMBER() OVER (ORDER BY o.val), c.c
FROM dbo.EvenRows AS o
CROSS JOIN (SELECT c = COUNT(*) FROM dbo.EvenRows) AS c
) AS x
WHERE rn IN ((c + 1)/2, (c + 2)/2);
From the master Himself, Itzik Ben-Gan!
来自大师本人,Itzik Ben-Gan!
回答by enkryptor
MS SQL Server 2012 (and later) has the PERCENTILE_DISC function which computes a specific percentile for sorted values. PERCENTILE_DISC (0.5) will compute the median - https://msdn.microsoft.com/en-us/library/hh231327.aspx
MS SQL Server 2012(及更高版本)具有 PERCENTILE_DISC 函数,用于计算排序值的特定百分位数。PERCENTILE_DISC (0.5) 将计算中位数 - https://msdn.microsoft.com/en-us/library/hh231327.aspx
回答by Tobbi
Simple, fast, accurate
简单、快速、准确
SELECT x.Amount
FROM (SELECT amount,
Count(1) OVER (partition BY 'A') AS TotalRows,
Row_number() OVER (ORDER BY Amount ASC) AS AmountOrder
FROM facttransaction ft) x
WHERE x.AmountOrder = Round(x.TotalRows / 2.0, 0)
回答by Rono
If you want to use the Create Aggregate function in SQL Server, this is how to do it. Doing it this way has the benefit of being able to write clean queries. Note this this process could be adapted to calculate a Percentile value fairly easily.
如果你想在 SQL Server 中使用 Create Aggregate 功能,这是如何做到的。这样做的好处是能够编写干净的查询。请注意,此过程可以适用于相当容易地计算百分比值。
Create a new Visual Studio project and set the target framework to .NET 3.5 (this is for SQL 2008, it may be different in SQL 2012). Then create a class file and put in the following code, or c# equivalent:
创建一个新的 Visual Studio 项目并将目标框架设置为 .NET 3.5(这是针对 SQL 2008 的,在 SQL 2012 中可能会有所不同)。然后创建一个类文件并放入以下代码,或 C# 等效代码:
Imports Microsoft.SqlServer.Server
Imports System.Data.SqlTypes
Imports System.IO
<Serializable>
<SqlUserDefinedAggregate(Format.UserDefined, IsInvariantToNulls:=True, IsInvariantToDuplicates:=False, _
IsInvariantToOrder:=True, MaxByteSize:=-1, IsNullIfEmpty:=True)>
Public Class Median
Implements IBinarySerialize
Private _items As List(Of Decimal)
Public Sub Init()
_items = New List(Of Decimal)()
End Sub
Public Sub Accumulate(value As SqlDecimal)
If Not value.IsNull Then
_items.Add(value.Value)
End If
End Sub
Public Sub Merge(other As Median)
If other._items IsNot Nothing Then
_items.AddRange(other._items)
End If
End Sub
Public Function Terminate() As SqlDecimal
If _items.Count <> 0 Then
Dim result As Decimal
_items = _items.OrderBy(Function(i) i).ToList()
If _items.Count Mod 2 = 0 Then
result = ((_items((_items.Count / 2) - 1)) + (_items(_items.Count / 2))) / 2@
Else
result = _items((_items.Count - 1) / 2)
End If
Return New SqlDecimal(result)
Else
Return New SqlDecimal()
End If
End Function
Public Sub Read(r As BinaryReader) Implements IBinarySerialize.Read
'deserialize it from a string
Dim list = r.ReadString()
_items = New List(Of Decimal)
For Each value In list.Split(","c)
Dim number As Decimal
If Decimal.TryParse(value, number) Then
_items.Add(number)
End If
Next
End Sub
Public Sub Write(w As BinaryWriter) Implements IBinarySerialize.Write
'serialize the list to a string
Dim list = ""
For Each item In _items
If list <> "" Then
list += ","
End If
list += item.ToString()
Next
w.Write(list)
End Sub
End Class
Then compile it and copy the DLL and PDB file to your SQL Server machine and run the following command in SQL Server:
然后编译它并将 DLL 和 PDB 文件复制到您的 SQL Server 机器并在 SQL Server 中运行以下命令:
CREATE ASSEMBLY CustomAggregate FROM '{path to your DLL}'
WITH PERMISSION_SET=SAFE;
GO
CREATE AGGREGATE Median(@value decimal(9, 3))
RETURNS decimal(9, 3)
EXTERNAL NAME [CustomAggregate].[{namespace of your DLL}.Median];
GO
You can then write a query to calculate the median like this: SELECT dbo.Median(Field) FROM Table
然后您可以编写一个查询来计算这样的中位数:SELECT dbo.Median(Field) FROM Table
回答by Jeff Sisson
Although Justin grant's solution appears solid I found that when you have a number of duplicate values within a given partition key the row numbers for the ASC duplicate values end up out of sequence so they do not properly align.
尽管 Justin Grant 的解决方案看起来很可靠,但我发现当给定分区键中有许多重复值时,ASC 重复值的行号最终会乱序,因此它们无法正确对齐。
Here is a fragment from my result:
这是我的结果的一个片段:
KEY VALUE ROWA ROWD
13 2 22 182
13 1 6 183
13 1 7 184
13 1 8 185
13 1 9 186
13 1 10 187
13 1 11 188
13 1 12 189
13 0 1 190
13 0 2 191
13 0 3 192
13 0 4 193
13 0 5 194
I used Justin's code as the basis for this solution. Although not as efficient given the use of multiple derived tables it does resolve the row ordering problem I encountered. Any improvements would be welcome as I am not that experienced in T-SQL.
我使用 Justin 的代码作为此解决方案的基础。尽管在使用多个派生表的情况下效率不高,但它确实解决了我遇到的行排序问题。欢迎任何改进,因为我在 T-SQL 方面经验不足。
SELECT PKEY, cast(AVG(VALUE)as decimal(5,2)) as MEDIANVALUE
FROM
(
SELECT PKEY,VALUE,ROWA,ROWD,
'FLAG' = (CASE WHEN ROWA IN (ROWD,ROWD-1,ROWD+1) THEN 1 ELSE 0 END)
FROM
(
SELECT
PKEY,
cast(VALUE as decimal(5,2)) as VALUE,
ROWA,
ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY ROWA DESC) as ROWD
FROM
(
SELECT
PKEY,
VALUE,
ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY VALUE ASC,PKEY ASC ) as ROWA
FROM [MTEST]
)T1
)T2
)T3
WHERE FLAG = '1'
GROUP BY PKEY
ORDER BY PKEY
回答by brian
I just came across this page while looking for a set based solution to median. After looking at some of the solutions here, I came up with the following. Hope is helps/works.
我刚刚在寻找基于集合的中位数解决方案时遇到了这个页面。在查看了这里的一些解决方案后,我想出了以下内容。希望有帮助/有效。
DECLARE @test TABLE(
i int identity(1,1),
id int,
score float
)
INSERT INTO @test (id,score) VALUES (1,10)
INSERT INTO @test (id,score) VALUES (1,11)
INSERT INTO @test (id,score) VALUES (1,15)
INSERT INTO @test (id,score) VALUES (1,19)
INSERT INTO @test (id,score) VALUES (1,20)
INSERT INTO @test (id,score) VALUES (2,20)
INSERT INTO @test (id,score) VALUES (2,21)
INSERT INTO @test (id,score) VALUES (2,25)
INSERT INTO @test (id,score) VALUES (2,29)
INSERT INTO @test (id,score) VALUES (2,30)
INSERT INTO @test (id,score) VALUES (3,20)
INSERT INTO @test (id,score) VALUES (3,21)
INSERT INTO @test (id,score) VALUES (3,25)
INSERT INTO @test (id,score) VALUES (3,29)
DECLARE @counts TABLE(
id int,
cnt int
)
INSERT INTO @counts (
id,
cnt
)
SELECT
id,
COUNT(*)
FROM
@test
GROUP BY
id
SELECT
drv.id,
drv.start,
AVG(t.score)
FROM
(
SELECT
MIN(t.i)-1 AS start,
t.id
FROM
@test t
GROUP BY
t.id
) drv
INNER JOIN @test t ON drv.id = t.id
INNER JOIN @counts c ON t.id = c.id
WHERE
t.i = ((c.cnt+1)/2)+drv.start
OR (
t.i = (((c.cnt+1)%2) * ((c.cnt+2)/2))+drv.start
AND ((c.cnt+1)%2) * ((c.cnt+2)/2) <> 0
)
GROUP BY
drv.id,
drv.start