pandas 如何使用 IQR 从 DataFrame 中删除异常值?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/50461349/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to Remove outlier from DataFrame using IQR?
提问by Imran Ahmad Ghazali
I Have Dataframe with a lot of columns (Around 100 feature), I want to apply the interquartile method and wanted to remove the outlier from the data frame.
我有很多列(大约 100 个特征)的数据框,我想应用四分位法并想从数据框中删除异常值。
I am using this link stackOverflow
我正在使用此链接 stackOverflow
But the problem is nan of the above method is working correctly,
但问题是上述方法的 nan 工作正常,
As I am trying like this
当我像这样尝试时
Q1 = stepframe.quantile(0.25)
Q3 = stepframe.quantile(0.75)
IQR = Q3 - Q1
((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))).sum()
it is giving me this
它给了我这个
((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))).sum()
Out[35]:
Day 0
Col1 0
Col2 0
col3 0
Col4 0
Step_Count 1179
dtype: int64
I just wanted to know that, What I will do next so that all the outlier from the data frame will be removed.
我只是想知道,接下来我要做什么,以便删除数据框中的所有异常值。
if i am using this
如果我使用这个
def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out
re_dat = remove_outlier(stepframe, stepframe.columns)
I am getting this error
我收到此错误
ValueError: Cannot index with multidimensional key
in this line
在这一行
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
采纳答案by jezrael
You can use:
您可以使用:
np.random.seed(33454)
stepframe = pd.DataFrame({'a': np.random.randint(1, 200, 20),
'b': np.random.randint(1, 200, 20),
'c': np.random.randint(1, 200, 20)})
stepframe[stepframe > 150] *= 10
print (stepframe)
Q1 = stepframe.quantile(0.25)
Q3 = stepframe.quantile(0.75)
IQR = Q3 - Q1
df = stepframe[~((stepframe < (Q1 - 1.5 * IQR)) |(stepframe > (Q3 + 1.5 * IQR))).any(axis=1)]
print (df)
a b c
1 109 50 124
3 137 60 1990
4 19 138 100
5 86 83 143
6 55 23 58
7 78 145 18
8 132 39 65
9 37 146 1970
13 67 148 1880
15 124 102 21
16 93 61 56
17 84 21 25
19 34 52 126
Details:
详情:
First create boolean DataFrame
with chain by |
:
首先boolean DataFrame
用链创建|
:
print (((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))))
a b c
0 False True False
1 False False False
2 True False False
3 False False False
4 False False False
5 False False False
6 False False False
7 False False False
8 False False False
9 False False False
10 True False False
11 False True False
12 False True False
13 False False False
14 False True False
15 False False False
16 False False False
17 False False False
18 False True False
19 False False False
And then use DataFrame.any
for check at least one True
per row and last invert boolean mask by ~
:
然后DataFrame.any
用于检查True
每行至少一个和最后一个反转布尔掩码~
:
print (~((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))).any(axis=1))
0 False
1 True
2 False
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 False
11 False
12 False
13 True
14 False
15 True
16 True
17 True
18 False
19 True
dtype: bool
invert
solution with changed conditions - <
to >=
and >
to <=
, chain by &
for AND and last filter by all
for check all True
s per rows
invert
条件改变的解决方案 - <
to>=
和>
to <=
,通过&
AND 链接,最后通过过滤器all
检查True
每行的所有s
print (((stepframe >= (Q1 - 1.5 * IQR)) & (stepframe <= (Q3 + 1.5 * IQR))).all(axis=1))
0 False
1 True
2 False
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 False
11 False
12 False
13 True
14 False
15 True
16 True
17 True
18 False
19 True
dtype: bool
df = stepframe[((stepframe >= (Q1 - 1.5 * IQR))& (stepframe <= (Q3 + 1.5 * IQR))).all(axis=1)]