pandas Python从数据中删除异常值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36864917/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:07:31  来源:igfitidea点击:

Python remove outliers from data

python-2.7numpypandasscipy

提问by chintan s

I have a data frame as following:

我有一个数据框如下:

ID Value
A   70
A   80
B   75
C   10
B   50
A   1000
C   60
B   2000
..  ..

I would like to group this data by ID, remove the outliers from the grouped data (the ones we see from the boxplot) and then calculate mean.

我想按 ID 对这些数据进行分组,从分组数据(我们从箱线图中看到的数据)中删除异常值,然后计算平均值。

So far

迄今为止

grouped = df.groupby('ID')

statBefore = pd.DataFrame({'mean': grouped['Value'].mean(), 'median': grouped['Value'].median(), 'std' : grouped['Value'].std()})

How can I find outliers, remove them and get the statistics.

我怎样才能找到异常值,删除它们并获取统计数据。

回答by Sam

I believe the method you're referring to is to remove values > 1.5 * the interquartile range away from the median. So first, calculate your initial statistics:

我相信您所指的方法是从中位数中删除 > 1.5 * 四分位距的值。因此,首先,计算您的初始统计数据:

statBefore = pd.DataFrame({'q1': grouped['Value'].quantile(.25), \
'median': grouped['Value'].median(), 'q3' : grouped['Value'].quantile(.75)})

And then determine whether values in the original DF are outliers:

然后判断原始DF中的值是否为异常值:

def is_outlier(row):
    iq_range = statBefore.loc[row.ID]['q3'] - statBefore.loc[row.ID]['q1']
    median = statBefore.loc[row.ID]['median']
    if row.Value > (median + (1.5* iq_range)) or row.Value < (median - (1.5* iq_range)):
        return True
    else:
        return False
#apply the function to the original df:
df.loc[:, 'outlier'] = df.apply(is_outlier, axis = 1)
#filter to only non-outliers:
df_no_outliers = df[~(df.outlier)]

回答by B. M.

just do :

做就是了 :

In [187]: df[df<100].groupby('ID').agg(['mean','median','std'])
Out[187]: 
   Value                  
    mean median        std
ID                        
A   75.0   75.0   7.071068
B   62.5   62.5  17.677670
C   35.0   35.0  35.355339