一维观测数据中检测异常值的 Pythonic 方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22354094/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:47:11  来源:igfitidea点击:

Pythonic way of detecting outliers in one dimensional observation data

pythonnumpymatplotlibstatisticsstatsmodels

提问by

For the given data, I want to set the outlier values (defined by 95% confidense level or 95% quantile function or anything that is required) as nan values. Following is the my data and code that I am using right now. I would be glad if someone could explain me further.

对于给定的数据,我想将异常值(由 95% 置信水平或 95% 分位数函数或任何所需的定义)设置为 nan 值。以下是我现在正在使用的数据和代码。如果有人能进一步解释我,我会很高兴。

import numpy as np, matplotlib.pyplot as plt

data = np.random.rand(1000)+5.0

plt.plot(data)
plt.xlabel('observation number')
plt.ylabel('recorded value')
plt.show()

回答by CT Zhu

Use np.percentileas @Martin suggested:

np.percentile按照@Martin 的建议使用:

percentiles = np.percentile(data, [2.5, 97.5])

# or =>, <= for within 95%
data[(percentiles[0]<data) & (percentiles[1]>data)]

# set the outliners to np.nan
data[(percentiles[0]>data) | (percentiles[1]<data)] = np.nan

回答by Joe Kington

The problem with using percentileis that the points identified as outliers is a function of your sample size.

使用的问题percentile在于,被识别为异常值的点是样本大小的函数。

There are a huge number of ways to test for outliers, and you should give some thought to how you classify them. Ideally, you should use a-priori information (e.g. "anything above/below this value is unrealistic because...")

有很多方法可以测试异常值,您应该考虑如何对它们进行分类。理想情况下,您应该使用先验信息(例如“高于/低于此值的任何内容都是不切实际的,因为……”)

However, a common, not-too-unreasonable outlier test is to remove points based on their "median absolute deviation".

然而,一个常见的、不太不合理的异常值测试是根据它们的“中值绝对偏差”删除点。

Here's an implementation for the N-dimensional case (from some code for a paper here: https://github.com/joferkington/oost_paper_code/blob/master/utilities.py):

这是 N 维情况的实现(来自此处论文的一些代码:https: //github.com/joferkington/oost_paper_code/blob/master/utilities.py):

def is_outlier(points, thresh=3.5):
    """
    Returns a boolean array with True if points are outliers and False 
    otherwise.

    Parameters:
    -----------
        points : An numobservations by numdimensions array of observations
        thresh : The modified z-score to use as a threshold. Observations with
            a modified z-score (based on the median absolute deviation) greater
            than this value will be classified as outliers.

    Returns:
    --------
        mask : A numobservations-length boolean array.

    References:
    ----------
        Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
        Handle Outliers", The ASQC Basic References in Quality Control:
        Statistical Techniques, Edward F. Mykytka, Ph.D., Editor. 
    """
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh

This is very similar to one of my previous answers, but I wanted to illustrate the sample size effect in detail.

这与我之前的一个答案非常相似,但我想详细说明样本量效应。

Let's compare a percentile-based outlier test (similar to @CTZhu's answer) with a median-absolute-deviation (MAD) test for a variety of different sample sizes:

让我们针对各种不同的样本量比较基于百分位数的异常值测试(类似于@CTZhu 的答案)和中值绝对偏差 (MAD) 测试:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def main():
    for num in [10, 50, 100, 1000]:
        # Generate some data
        x = np.random.normal(0, 0.5, num-3)

        # Add three outliers...
        x = np.r_[x, -3, -10, 12]
        plot(x)

    plt.show()

def mad_based_outlier(points, thresh=3.5):
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh

def percentile_based_outlier(data, threshold=95):
    diff = (100 - threshold) / 2.0
    minval, maxval = np.percentile(data, [diff, 100 - diff])
    return (data < minval) | (data > maxval)

def plot(x):
    fig, axes = plt.subplots(nrows=2)
    for ax, func in zip(axes, [percentile_based_outlier, mad_based_outlier]):
        sns.distplot(x, ax=ax, rug=True, hist=False)
        outliers = x[func(x)]
        ax.plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False)

    kwargs = dict(y=0.95, x=0.05, ha='left', va='top')
    axes[0].set_title('Percentile-based Outliers', **kwargs)
    axes[1].set_title('MAD-based Outliers', **kwargs)
    fig.suptitle('Comparing Outlier Tests with n={}'.format(len(x)), size=14)

main()


enter image description here

在此处输入图片说明



enter image description here

在此处输入图片说明



enter image description here

在此处输入图片说明



enter image description here

在此处输入图片说明

Notice that the MAD-based classifier works correctly regardless of sample-size, while the percentile based classifier classifies more points the larger the sample size is, regardless of whether or not they are actually outliers.

请注意,无论样本大小如何,基于 MAD 的分类器都能正常工作,而基于百分位数的分类器分类的点越多,样本量越大,无论它们是否实际上是异常值。

回答by sergeyf

I've adapted the code from http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliersand it gives the same results as Joe Kington's, but uses L1 distance instead of L2 distance, and has support for asymmetric distributions. The original R code did not have Joe's 0.6745 multiplier, so I also added that in for consistency within this thread. Not 100% sure if it's necessary, but makes the comparison apples-to-apples.

我已经改编了来自http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers的代码,它给出了与 Joe Kington 相同的结果,但使用 L1 距离而不是 L2 距离,并且支持非对称分布。原始的 R 代码没有 Joe 的 0.6745 乘数,所以我还在这个线程中添加了它以保持一致性。不是 100% 确定是否有必要,但可以进行比较。

def doubleMADsfromMedian(y,thresh=3.5):
    # warning: this function does not check for NAs
    # nor does it address issues when 
    # more than 50% of your data have identical values
    m = np.median(y)
    abs_dev = np.abs(y - m)
    left_mad = np.median(abs_dev[y <= m])
    right_mad = np.median(abs_dev[y >= m])
    y_mad = left_mad * np.ones(len(y))
    y_mad[y > m] = right_mad
    modified_z_score = 0.6745 * abs_dev / y_mad
    modified_z_score[y == m] = 0
    return modified_z_score > thresh

回答by shivangi dhakad

Detection of outliers in one dimensional data depends on its distribution

一维数据中异常值的检测取决于其分布

1-Normal Distribution:

1-正态分布

  1. Data values are almost equally distributed over the expected range :In this case you easily use all the methods that include mean ,like the confidence interval of 3 or 2 standard deviations(95% or 99.7%) accordingly for a normally distributed data (central limit theorem and sampling distribution of sample mean).I is a highly effective method. Explained in Khan Academy statistics and Probability - sampling distribution library.
  1. 数据值在预期范围内几乎均匀分布:在这种情况下,您可以轻松使用所有包括均值的方法,例如 3 或 2 个标准差(95% 或 99.7%)的置信区间相应地用于正态分布数据(中心极限)定理和样本均值的抽样分布)。我是一种非常有效的方法。在可汗学院统计和概率 - 抽样分布库中进行了解释。

One other way is prediction interval if you want confidence interval of data points rather than mean.

如果您想要数据点的置信区间而不是平均值,另一种方法是预测区间。

  1. Data values are are randomly distributed over a range: mean may not be a fair representation of the data, because the average is easily influenced by?outliers?(very small or large values in the data set that are not typical) The median is another way to measure the center of a numerical data set.

    Median Absolute deviation- a method which measures the distance of all points from the median in terms of median distance http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm- has a good explanation as explained in Joe Kington's answer above

  1. 数据值随机分布在一个范围内:均值可能不是数据的公平表示,因为平均值很容易受到“异常值”的影响?(数据集中非常小的或非常大的值不是典型的)测量数值数据集中心的方法。

    中值绝对偏差- 一种根据中值距离http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm测量所有点与中值距离的方法 - 有一个很好的解释在上面乔金顿的回答中解释了

2 - Symmetric Distribution: Again Median Absolute Deviation is a good method if the z-score calculation and threshold is changed accordingly

2 - 对称分布:如果 z 分数计算和阈值相应地改变,中值绝对偏差再次是一个好方法

Explanation : http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/

说明:http: //eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/

3 - Asymmetric Distribution : Double MAD - Double Median Absolute DeviationExplanation in the above attached link

3 - 不对称分布:Double MAD -以上附加链接中的双中值绝对偏差解释

Attaching my python code for reference :

附上我的python代码以供参考:

 def is_outlier_doubleMAD(self,points):
    """
    FOR ASSYMMETRIC DISTRIBUTION
    Returns : filtered array excluding the outliers

    Parameters : the actual data Points array

    Calculates median to divide data into 2 halves.(skew conditions handled)
    Then those two halves are treated as separate data with calculation same as for symmetric distribution.(first answer) 
    Only difference being , the thresholds are now the median distance of the right and left median with the actual data median
    """

    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    medianIndex = (points.size/2)

    leftData = np.copy(points[0:medianIndex])
    rightData = np.copy(points[medianIndex:points.size])

    median1 = np.median(leftData, axis=0)
    diff1 = np.sum((leftData - median1)**2, axis=-1)
    diff1 = np.sqrt(diff1)

    median2 = np.median(rightData, axis=0)
    diff2 = np.sum((rightData - median2)**2, axis=-1)
    diff2 = np.sqrt(diff2)

    med_abs_deviation1 = max(np.median(diff1),0.000001)
    med_abs_deviation2 = max(np.median(diff2),0.000001)

    threshold1 = ((median-median1)/med_abs_deviation1)*3
    threshold2 = ((median2-median)/med_abs_deviation2)*3

    #if any threshold is 0 -> no outliers
    if threshold1==0:
        threshold1 = sys.maxint
    if threshold2==0:
        threshold2 = sys.maxint
    #multiplied by a factor so that only the outermost points are removed
    modified_z_score1 = 0.6745 * diff1 / med_abs_deviation1
    modified_z_score2 = 0.6745 * diff2 / med_abs_deviation2

    filtered1 = []
    i = 0
    for data in modified_z_score1:
        if data < threshold1:
            filtered1.append(leftData[i])
        i += 1
    i = 0
    filtered2 = []
    for data in modified_z_score2:
        if data < threshold2:
            filtered2.append(rightData[i])
        i += 1

    filtered = filtered1 + filtered2
    return filtered

回答by jimseeve

Well a simple solution can also be, removing something which outside 2 standard deviations(or 1.96):

嗯,一个简单的解决方案也可以是,删除 2 个标准差(或 1.96)之外的东西:

import random
def outliers(tmp):
    """tmp is a list of numbers"""
    outs = []
    mean = sum(tmp)/(1.0*len(tmp))
    var = sum((tmp[i] - mean)**2 for i in range(0, len(tmp)))/(1.0*len(tmp))
    std = var**0.5
    outs = [tmp[i] for i in range(0, len(tmp)) if abs(tmp[i]-mean) > 1.96*std]
    return outs


lst = [random.randrange(-10, 55) for _ in range(40)]
print lst
print outliers(lst)