在 SQL Server 中通过标准偏差消除异常值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3077348/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 06:36:57  来源:igfitidea点击:

Eliminating outliers by standard deviation in SQL Server

sqlsql-serversql-server-2008statistics

提问by David Pfeffer

I am trying to eliminate outliers in SQL Server 2008 by standard deviation. I would like only records that contain a value in a specific column within +/- 1 standard deviation of that column's mean.

我试图通过标准偏差消除 SQL Server 2008 中的异常值。我只想要包含特定列中值在该列平均值的 +/- 1 标准偏差内的记录。

How can I accomplish this?

我怎样才能做到这一点?

回答by amelvin

If you are assuming a bell curve distribution of events, then only 68% of values will be within 1 standard deviation away from the mean (95% are covered by 2 standard deviations).

如果您假设事件的钟形曲线分布,则只有 68% 的值与平均值相差 1 个标准差(95% 被 2 个标准差覆盖)。

I would load a variable with the standard deviation of your range (derived using stdev / stdevpsql function) and then select the values that are within the appropriate number of standard deviations.

我会加载一个具有您范围标准偏差的变量(使用stdev / stdevpsql 函数导出),然后选择适当数量的标准偏差内的值。

declare @stdtest table (colname varchar(20), colvalue int)

insert into @stdtest (colname, colvalue) values ('a', 2)
insert into @stdtest (colname, colvalue) values ('b', 4)
insert into @stdtest (colname, colvalue) values ('c', 4)
insert into @stdtest (colname, colvalue) values ('d', 4)
insert into @stdtest (colname, colvalue) values ('e', 5)
insert into @stdtest (colname, colvalue) values ('f', 5)
insert into @stdtest (colname, colvalue) values ('g', 7)
insert into @stdtest (colname, colvalue) values ('h', 9)

declare @std decimal
declare @mean decimal
declare @lower decimal
declare @higher decimal
declare @noofstds int

select @std = STDEV(colvalue), @mean = AVG(colvalue) from @stdtest

--68%
set @noofstds = 1
select @lower = @mean - (@noofstds * @std)
select @higher = @mean + (@noofstds * @std)

select @lower, @higher, * from @stdtest where colvalue between @lower and @higher

--returns rows with a colvalue between 3 and 7 inclusive

--95%
set @noofstds = 2
select @lower = @mean - (@noofstds * @std)
select @higher = @mean + (@noofstds * @std)

select @lower, @higher, * from @stdtest where colvalue between @lower and @higher

--returns rows with a colvalue between 1 and 9 inclusive

回答by Mike M.

There is an aggregate function called STDEV in SQL that will give you the standard deviation. This is the hard part- then just find the range between the mean and +/- one STDEV value.

SQL 中有一个名为 STDEV 的聚合函数,可以为您提供标准偏差。这是困难的部分——然后只需找到平均值和 +/- 一个 STDEV 值之间的范围。

This is one way you could go about doing it -

这是你可以去做的一种方式——

    create table #test
(
   testNumber int
   )

   INSERT INTO #test (testNumber)
   SELECT  2
   UNION ALL 
   SELECT 4
   UNION ALL 
   SELECT 4
   UNION ALL 
   SELECT 4
   UNION ALL 
   SELECT 5
   UNION ALL 
   SELECT 5
   UNION ALL 
   SELECT 7
   UNION ALL 
   SELECT 9

   SELECT testNumber FROM #test t
   JOIN (
    SELECT STDEV (testnumber) as [STDEV], AVG(testnumber) as mean
    FROM #test
        ) X on t.testNumber >= X.mean - X.STDEV AND t.testNumber <= X.mean + X.STDEV

回答by duffymo

I'd be careful and think about what you're doing. Throwing away outliers might mean that you're discarding information that might not fit into a pre-conceived world view that could be quite wrong. Those outliers might be "black swans" that are rare, though not as rare as you'd think, and quite significant.

我会很小心,想想你在做什么。丢弃异常值可能意味着您正在丢弃可能不符合可能完全错误的先入为主的世界观的信息。这些异常值可能是罕见的“黑天鹅”,尽管并不像您想象的那么罕见,而且非常重要。

You give no context or explanation of what you're doing. It's easy to cite a function or technique that will fulfill the needs of your particular case, but I thought it appropriate to post the caution until additional information is supplied.

你没有给出你在做什么的背景或解释。引用可以满足您特定案例需求的函数或技术很容易,但我认为在提供其他信息之前发布警告是合适的。