pandas Python从数据中删除异常值

Question

提问by chintan s

I have a data frame as following:

我有一个数据框如下：

ID Value
A   70
A   80
B   75
C   10
B   50
A   1000
C   60
B   2000
..  ..

I would like to group this data by ID, remove the outliers from the grouped data (the ones we see from the boxplot) and then calculate mean.

我想按 ID 对这些数据进行分组，从分组数据（我们从箱线图中看到的数据）中删除异常值，然后计算平均值。

So far

迄今为止

grouped = df.groupby('ID')

statBefore = pd.DataFrame({'mean': grouped['Value'].mean(), 'median': grouped['Value'].median(), 'std' : grouped['Value'].std()})

How can I find outliers, remove them and get the statistics.

我怎样才能找到异常值，删除它们并获取统计数据。

Answer 1

回答by Sam

I believe the method you're referring to is to remove values > 1.5 * the interquartile range away from the median. So first, calculate your initial statistics:

我相信您所指的方法是从中位数中删除 > 1.5 * 四分位距的值。因此，首先，计算您的初始统计数据：

statBefore = pd.DataFrame({'q1': grouped['Value'].quantile(.25), \
'median': grouped['Value'].median(), 'q3' : grouped['Value'].quantile(.75)})

And then determine whether values in the original DF are outliers:

然后判断原始DF中的值是否为异常值：

def is_outlier(row):
    iq_range = statBefore.loc[row.ID]['q3'] - statBefore.loc[row.ID]['q1']
    median = statBefore.loc[row.ID]['median']
    if row.Value > (median + (1.5* iq_range)) or row.Value < (median - (1.5* iq_range)):
        return True
    else:
        return False
#apply the function to the original df:
df.loc[:, 'outlier'] = df.apply(is_outlier, axis = 1)
#filter to only non-outliers:
df_no_outliers = df[~(df.outlier)]

Answer 2

回答by B. M.

just do :

做就是了：

In [187]: df[df<100].groupby('ID').agg(['mean','median','std'])
Out[187]: 
   Value                  
    mean median        std
ID                        
A   75.0   75.0   7.071068
B   62.5   62.5  17.677670
C   35.0   35.0  35.355339

pandas Python从数据中删除异常值

提问by chintan s

回答by Sam

回答by B. M.

相关推荐

最近更新

标签

pandas Python从数据中删除异常值

提问by chintan s

回答by Sam

回答by B. M.

相关推荐

根据字符串条件为 Pandas 数据框列赋值

pandas 如何检索pandas Series对象中第n个元素的值？

Pandas：如何将多个数据帧引用和打印为 HTML 表格

pandas XLRDError：python 中没有名为 <'Sheet1'> 的工作表

相关推荐

最近更新

标签