pandas 如何一次性检测和删除熊猫数据帧的每一列中的异常值?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/51879812/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:57:01  来源:igfitidea点击:

How to detect and remove outliers from each column of pandas dataframe at one go?

pythonpandasdataframe

提问by astroluv

I've a pandas data frame with six columns and i know there are some outliers in each column.So i have these two lines of code which is pretty much doing what i want to do. But it's removing outliers from only one column of the dataframe. so what if i want to remove outliers from each column together??

我有一个包含六列的 Pandas 数据框,我知道每列中有一些异常值。所以我有这两行代码几乎可以做我想做的事情。但它仅从数据帧的一列中删除异常值。那么如果我想从每一列中一起删除异常值怎么办?

df = pd.DataFrame({'stlines':np.random.normal(size=533)})
df = df[np.abs(df.stlines-df.stlines.mean()) <= (2*df.stlines.std())]

what would be the elegant way to do this?

这样做的优雅方式是什么?

回答by piRSquared

The problem is that your outliers in each column may happen for varying rows(records). I'd advise you be happy with substituting np.nan

问题是每列中的异常值可能发生在不同的行(记录)中。我建议你对替换感到满意np.nan

Setup

设置

np.random.seed([3, 1415])
df = pd.DataFrame(
    np.random.normal(size=(20, 8)),
    columns=list('ABCDEFGH')
)

df

           A         B         C         D         E         F         G         H
0  -2.129724 -1.268466 -1.970500 -2.259055 -0.349286 -0.026955  0.316236  0.348782
1   0.715364  0.770763 -0.608208  0.352390 -0.352521 -0.415869 -0.911575 -0.142538
2   0.746839 -1.504157  0.611362  0.400219 -0.959443  1.494226 -0.346508 -1.471558
3   1.063243  1.062997  0.591860  0.296212 -0.774732  0.831452  1.486976  0.256220
4  -0.899906  0.375085 -0.519501  0.050101  0.949959 -1.033773  0.948247  0.733776
5   1.236118  0.155475 -1.341267  0.162864  1.258253  0.778040  1.341599 -1.636039
6  -0.195368  0.131820  2.069013  0.048729 -1.500564  0.907342  0.029326  0.066119
7  -0.728821 -2.137846  1.402702 -0.017209 -0.071309 -0.533061  1.273899  0.348510
8  -0.920391  0.348579 -0.835074 -0.225377  0.206295 -0.582825 -1.511850  1.633570
9   0.403321  0.992765  0.025249 -1.664999 -1.558044 -0.361630 -1.784971 -0.318569
10 -0.326400 -0.688203  0.506420 -0.386706 -0.368351 -0.293383 -2.086973 -0.807873
11  0.068855 -0.525141  0.745524  0.911930 -0.277785 -0.866313  1.155518  1.421480
12  1.416653 -0.120607  1.367540 -0.811585 -0.205071 -0.450472 -0.993868 -0.084107
13  2.222507  0.668158  0.463331 -0.302869  0.226355 -0.966131  1.015160 -0.329008
14 -1.070002  0.525867  0.616915  0.399136 -0.233075 -0.482919 -1.018142 -1.673869
15  0.058956  0.242391 -0.660237 -0.081101  1.690625  0.296406 -0.938197  0.225710
16 -0.352254  0.170126 -0.943541  0.627847 -0.948773  0.126131  1.162792 -0.492266
17 -0.444413 -0.028003 -0.286051  0.895515 -0.234507  1.005886 -1.350465 -0.959034
18  0.992524 -1.471428  0.270001 -1.197004 -0.324760 -1.383568  0.838075 -1.125205
19  0.024837  0.238895  0.350742 -0.541868 -0.730284  0.113695  0.068872 -0.032520


pandas.DataFrame.mask

pandas.DataFrame.mask

df.mask((df - df.mean()).abs() > 2 * df.std())

           A         B         C         D         E         F         G         H
0        NaN -1.268466       NaN       NaN -0.349286 -0.026955  0.316236  0.348782
1   0.715364  0.770763 -0.608208  0.352390 -0.352521 -0.415869 -0.911575 -0.142538
2   0.746839 -1.504157  0.611362  0.400219 -0.959443       NaN -0.346508 -1.471558
3   1.063243  1.062997  0.591860  0.296212 -0.774732  0.831452  1.486976  0.256220
4  -0.899906  0.375085 -0.519501  0.050101  0.949959 -1.033773  0.948247  0.733776
5   1.236118  0.155475 -1.341267  0.162864  1.258253  0.778040  1.341599 -1.636039
6  -0.195368  0.131820  2.069013  0.048729 -1.500564  0.907342  0.029326  0.066119
7  -0.728821       NaN  1.402702 -0.017209 -0.071309 -0.533061  1.273899  0.348510
8  -0.920391  0.348579 -0.835074 -0.225377  0.206295 -0.582825 -1.511850       NaN
9   0.403321  0.992765  0.025249 -1.664999 -1.558044 -0.361630 -1.784971 -0.318569
10 -0.326400 -0.688203  0.506420 -0.386706 -0.368351 -0.293383 -2.086973 -0.807873
11  0.068855 -0.525141  0.745524  0.911930 -0.277785 -0.866313  1.155518  1.421480
12  1.416653 -0.120607  1.367540 -0.811585 -0.205071 -0.450472 -0.993868 -0.084107
13       NaN  0.668158  0.463331 -0.302869  0.226355 -0.966131  1.015160 -0.329008
14 -1.070002  0.525867  0.616915  0.399136 -0.233075 -0.482919 -1.018142 -1.673869
15  0.058956  0.242391 -0.660237 -0.081101       NaN  0.296406 -0.938197  0.225710
16 -0.352254  0.170126 -0.943541  0.627847 -0.948773  0.126131  1.162792 -0.492266
17 -0.444413 -0.028003 -0.286051  0.895515 -0.234507  1.005886 -1.350465 -0.959034
18  0.992524 -1.471428  0.270001 -1.197004 -0.324760 -1.383568  0.838075 -1.125205
19  0.024837  0.238895  0.350742 -0.541868 -0.730284  0.113695  0.068872 -0.032520


+ dropna

+ dropna

If you only want rows for which no outliers exist for any column, you could follow up the above with dropna

如果您只想要任何列都不存在异常值的行,您可以按照上面的 dropna

df.mask((df - df.mean()).abs() > 2 * df.std()).dropna()



      A         B         C         D         E         F         G         H
1   0.715364  0.770763 -0.608208  0.352390 -0.352521 -0.415869 -0.911575 -0.142538
3   1.063243  1.062997  0.591860  0.296212 -0.774732  0.831452  1.486976  0.256220
4  -0.899906  0.375085 -0.519501  0.050101  0.949959 -1.033773  0.948247  0.733776
5   1.236118  0.155475 -1.341267  0.162864  1.258253  0.778040  1.341599 -1.636039
6  -0.195368  0.131820  2.069013  0.048729 -1.500564  0.907342  0.029326  0.066119
9   0.403321  0.992765  0.025249 -1.664999 -1.558044 -0.361630 -1.784971 -0.318569
10 -0.326400 -0.688203  0.506420 -0.386706 -0.368351 -0.293383 -2.086973 -0.807873
11  0.068855 -0.525141  0.745524  0.911930 -0.277785 -0.866313  1.155518  1.421480
12  1.416653 -0.120607  1.367540 -0.811585 -0.205071 -0.450472 -0.993868 -0.084107
14 -1.070002  0.525867  0.616915  0.399136 -0.233075 -0.482919 -1.018142 -1.673869
16 -0.352254  0.170126 -0.943541  0.627847 -0.948773  0.126131  1.162792 -0.492266
17 -0.444413 -0.028003 -0.286051  0.895515 -0.234507  1.005886 -1.350465 -0.959034
18  0.992524 -1.471428  0.270001 -1.197004 -0.324760 -1.383568  0.838075 -1.125205
19  0.024837  0.238895  0.350742 -0.541868 -0.730284  0.113695  0.068872 -0.032520

回答by YOBEN_S

Assuming you have multiple columns using all

假设您有多个列使用 all

df[df.apply(lambda x :(x-x.mean()).abs()<(2*x.std()) ).all(1)]

回答by PMende

I would do something like the following:

我会做类似以下的事情:

def filter_outliers(df, columns):
    masks = (make_mask(df, column) for column in columns)
    full_mask = np.logical_or.reduce(masks)
    return df[full_mask]

def make_mask(df, column):
    standardized = (df[column] - df[column].mean())/df[column].std()
    return standardized.abs() >= 2