pandas 如何使用 IQR 从 DataFrame 中删除异常值？

Question

提问by Imran Ahmad Ghazali

I Have Dataframe with a lot of columns (Around 100 feature), I want to apply the interquartile method and wanted to remove the outlier from the data frame.

我有很多列（大约 100 个特征）的数据框，我想应用四分位法并想从数据框中删除异常值。

I am using this link stackOverflow

我正在使用此链接 stackOverflow

But the problem is nan of the above method is working correctly,

但问题是上述方法的 nan 工作正常，

As I am trying like this

当我像这样尝试时

Q1 = stepframe.quantile(0.25)
Q3 = stepframe.quantile(0.75)
IQR = Q3 - Q1
((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))).sum()

it is giving me this

它给了我这个

((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))).sum()
Out[35]: 
Day                      0
Col1                     0
Col2                     0
col3                     0
Col4                     0
Step_Count            1179
dtype: int64

I just wanted to know that, What I will do next so that all the outlier from the data frame will be removed.

我只是想知道，接下来我要做什么，以便删除数据框中的所有异常值。

if i am using this

如果我使用这个

def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low  = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out

re_dat = remove_outlier(stepframe, stepframe.columns)

I am getting this error

我收到此错误

ValueError: Cannot index with multidimensional key

in this line

在这一行

    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]

Answer 1

采纳答案by jezrael

You can use:

您可以使用：

np.random.seed(33454)
stepframe = pd.DataFrame({'a': np.random.randint(1, 200, 20), 
                          'b': np.random.randint(1, 200, 20),
                          'c': np.random.randint(1, 200, 20)})

stepframe[stepframe > 150] *= 10
print (stepframe)

Q1 = stepframe.quantile(0.25)
Q3 = stepframe.quantile(0.75)
IQR = Q3 - Q1

df = stepframe[~((stepframe < (Q1 - 1.5 * IQR)) |(stepframe > (Q3 + 1.5 * IQR))).any(axis=1)]

print (df)
      a    b     c
1   109   50   124
3   137   60  1990
4    19  138   100
5    86   83   143
6    55   23    58
7    78  145    18
8   132   39    65
9    37  146  1970
13   67  148  1880
15  124  102    21
16   93   61    56
17   84   21    25
19   34   52   126

Details:

详情：

First create boolean DataFramewith chain by |:

首先boolean DataFrame用链创建|：

print (((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))))
        a      b      c
0   False   True  False
1   False  False  False
2    True  False  False
3   False  False  False
4   False  False  False
5   False  False  False
6   False  False  False
7   False  False  False
8   False  False  False
9   False  False  False
10   True  False  False
11  False   True  False
12  False   True  False
13  False  False  False
14  False   True  False
15  False  False  False
16  False  False  False
17  False  False  False
18  False   True  False
19  False  False  False

And then use DataFrame.anyfor check at least one Trueper row and last invert boolean mask by ~:

然后DataFrame.any用于检查True每行至少一个和最后一个反转布尔掩码~：

print (~((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))).any(axis=1))
0     False
1      True
2     False
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10    False
11    False
12    False
13     True
14    False
15     True
16     True
17     True
18    False
19     True
dtype: bool

invertsolution with changed conditions - <to >=and >to <=, chain by &for AND and last filter by allfor check all Trues per rows

invert条件改变的解决方案 - <to>=和>to <=，通过&AND 链接，最后通过过滤器all检查True每行的所有s

print (((stepframe >= (Q1 - 1.5 * IQR)) & (stepframe <= (Q3 + 1.5 * IQR))).all(axis=1))
0     False
1      True
2     False
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10    False
11    False
12    False
13     True
14    False
15     True
16     True
17     True
18    False
19     True
dtype: bool


df = stepframe[((stepframe >= (Q1 - 1.5 * IQR))& (stepframe <= (Q3 + 1.5 * IQR))).all(axis=1)]

pandas 如何使用 IQR 从 DataFrame 中删除异常值？

提问by Imran Ahmad Ghazali

采纳答案by jezrael

相关推荐

最近更新

标签

pandas 如何使用 IQR 从 DataFrame 中删除异常值？

提问by Imran Ahmad Ghazali

采纳答案by jezrael

相关推荐

在 Dockerfile 中安装 Pandas

为什么在 Pandas 数据框中使用 Z-score 进行归一化会生成 NaN 列？

Pandas：将列中的列表拆分为多行

pandas 在python 3.7中安装pandas

相关推荐

最近更新

标签