pandas 如何一次性检测和删除熊猫数据帧的每一列中的异常值?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/51879812/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to detect and remove outliers from each column of pandas dataframe at one go?
提问by astroluv
I've a pandas data frame with six columns and i know there are some outliers in each column.So i have these two lines of code which is pretty much doing what i want to do. But it's removing outliers from only one column of the dataframe. so what if i want to remove outliers from each column together??
我有一个包含六列的 Pandas 数据框,我知道每列中有一些异常值。所以我有这两行代码几乎可以做我想做的事情。但它仅从数据帧的一列中删除异常值。那么如果我想从每一列中一起删除异常值怎么办?
df = pd.DataFrame({'stlines':np.random.normal(size=533)})
df = df[np.abs(df.stlines-df.stlines.mean()) <= (2*df.stlines.std())]
what would be the elegant way to do this?
这样做的优雅方式是什么?
回答by piRSquared
The problem is that your outliers in each column may happen for varying rows(records). I'd advise you be happy with substituting np.nan
问题是每列中的异常值可能发生在不同的行(记录)中。我建议你对替换感到满意np.nan
Setup
设置
np.random.seed([3, 1415])
df = pd.DataFrame(
np.random.normal(size=(20, 8)),
columns=list('ABCDEFGH')
)
df
A B C D E F G H
0 -2.129724 -1.268466 -1.970500 -2.259055 -0.349286 -0.026955 0.316236 0.348782
1 0.715364 0.770763 -0.608208 0.352390 -0.352521 -0.415869 -0.911575 -0.142538
2 0.746839 -1.504157 0.611362 0.400219 -0.959443 1.494226 -0.346508 -1.471558
3 1.063243 1.062997 0.591860 0.296212 -0.774732 0.831452 1.486976 0.256220
4 -0.899906 0.375085 -0.519501 0.050101 0.949959 -1.033773 0.948247 0.733776
5 1.236118 0.155475 -1.341267 0.162864 1.258253 0.778040 1.341599 -1.636039
6 -0.195368 0.131820 2.069013 0.048729 -1.500564 0.907342 0.029326 0.066119
7 -0.728821 -2.137846 1.402702 -0.017209 -0.071309 -0.533061 1.273899 0.348510
8 -0.920391 0.348579 -0.835074 -0.225377 0.206295 -0.582825 -1.511850 1.633570
9 0.403321 0.992765 0.025249 -1.664999 -1.558044 -0.361630 -1.784971 -0.318569
10 -0.326400 -0.688203 0.506420 -0.386706 -0.368351 -0.293383 -2.086973 -0.807873
11 0.068855 -0.525141 0.745524 0.911930 -0.277785 -0.866313 1.155518 1.421480
12 1.416653 -0.120607 1.367540 -0.811585 -0.205071 -0.450472 -0.993868 -0.084107
13 2.222507 0.668158 0.463331 -0.302869 0.226355 -0.966131 1.015160 -0.329008
14 -1.070002 0.525867 0.616915 0.399136 -0.233075 -0.482919 -1.018142 -1.673869
15 0.058956 0.242391 -0.660237 -0.081101 1.690625 0.296406 -0.938197 0.225710
16 -0.352254 0.170126 -0.943541 0.627847 -0.948773 0.126131 1.162792 -0.492266
17 -0.444413 -0.028003 -0.286051 0.895515 -0.234507 1.005886 -1.350465 -0.959034
18 0.992524 -1.471428 0.270001 -1.197004 -0.324760 -1.383568 0.838075 -1.125205
19 0.024837 0.238895 0.350742 -0.541868 -0.730284 0.113695 0.068872 -0.032520
pandas.DataFrame.mask
pandas.DataFrame.mask
df.mask((df - df.mean()).abs() > 2 * df.std())
A B C D E F G H
0 NaN -1.268466 NaN NaN -0.349286 -0.026955 0.316236 0.348782
1 0.715364 0.770763 -0.608208 0.352390 -0.352521 -0.415869 -0.911575 -0.142538
2 0.746839 -1.504157 0.611362 0.400219 -0.959443 NaN -0.346508 -1.471558
3 1.063243 1.062997 0.591860 0.296212 -0.774732 0.831452 1.486976 0.256220
4 -0.899906 0.375085 -0.519501 0.050101 0.949959 -1.033773 0.948247 0.733776
5 1.236118 0.155475 -1.341267 0.162864 1.258253 0.778040 1.341599 -1.636039
6 -0.195368 0.131820 2.069013 0.048729 -1.500564 0.907342 0.029326 0.066119
7 -0.728821 NaN 1.402702 -0.017209 -0.071309 -0.533061 1.273899 0.348510
8 -0.920391 0.348579 -0.835074 -0.225377 0.206295 -0.582825 -1.511850 NaN
9 0.403321 0.992765 0.025249 -1.664999 -1.558044 -0.361630 -1.784971 -0.318569
10 -0.326400 -0.688203 0.506420 -0.386706 -0.368351 -0.293383 -2.086973 -0.807873
11 0.068855 -0.525141 0.745524 0.911930 -0.277785 -0.866313 1.155518 1.421480
12 1.416653 -0.120607 1.367540 -0.811585 -0.205071 -0.450472 -0.993868 -0.084107
13 NaN 0.668158 0.463331 -0.302869 0.226355 -0.966131 1.015160 -0.329008
14 -1.070002 0.525867 0.616915 0.399136 -0.233075 -0.482919 -1.018142 -1.673869
15 0.058956 0.242391 -0.660237 -0.081101 NaN 0.296406 -0.938197 0.225710
16 -0.352254 0.170126 -0.943541 0.627847 -0.948773 0.126131 1.162792 -0.492266
17 -0.444413 -0.028003 -0.286051 0.895515 -0.234507 1.005886 -1.350465 -0.959034
18 0.992524 -1.471428 0.270001 -1.197004 -0.324760 -1.383568 0.838075 -1.125205
19 0.024837 0.238895 0.350742 -0.541868 -0.730284 0.113695 0.068872 -0.032520
+ dropna
+ dropna
If you only want rows for which no outliers exist for any column, you could follow up the above with dropna
如果您只想要任何列都不存在异常值的行,您可以按照上面的 dropna
df.mask((df - df.mean()).abs() > 2 * df.std()).dropna()
A B C D E F G H
1 0.715364 0.770763 -0.608208 0.352390 -0.352521 -0.415869 -0.911575 -0.142538
3 1.063243 1.062997 0.591860 0.296212 -0.774732 0.831452 1.486976 0.256220
4 -0.899906 0.375085 -0.519501 0.050101 0.949959 -1.033773 0.948247 0.733776
5 1.236118 0.155475 -1.341267 0.162864 1.258253 0.778040 1.341599 -1.636039
6 -0.195368 0.131820 2.069013 0.048729 -1.500564 0.907342 0.029326 0.066119
9 0.403321 0.992765 0.025249 -1.664999 -1.558044 -0.361630 -1.784971 -0.318569
10 -0.326400 -0.688203 0.506420 -0.386706 -0.368351 -0.293383 -2.086973 -0.807873
11 0.068855 -0.525141 0.745524 0.911930 -0.277785 -0.866313 1.155518 1.421480
12 1.416653 -0.120607 1.367540 -0.811585 -0.205071 -0.450472 -0.993868 -0.084107
14 -1.070002 0.525867 0.616915 0.399136 -0.233075 -0.482919 -1.018142 -1.673869
16 -0.352254 0.170126 -0.943541 0.627847 -0.948773 0.126131 1.162792 -0.492266
17 -0.444413 -0.028003 -0.286051 0.895515 -0.234507 1.005886 -1.350465 -0.959034
18 0.992524 -1.471428 0.270001 -1.197004 -0.324760 -1.383568 0.838075 -1.125205
19 0.024837 0.238895 0.350742 -0.541868 -0.730284 0.113695 0.068872 -0.032520
回答by YOBEN_S
Assuming you have multiple columns using all
假设您有多个列使用 all
df[df.apply(lambda x :(x-x.mean()).abs()<(2*x.std()) ).all(1)]
回答by PMende
I would do something like the following:
我会做类似以下的事情:
def filter_outliers(df, columns):
masks = (make_mask(df, column) for column in columns)
full_mask = np.logical_or.reduce(masks)
return df[full_mask]
def make_mask(df, column):
standardized = (df[column] - df[column].mean())/df[column].std()
return standardized.abs() >= 2