pandas 如何计算 Python 中所有列的异常值？

Question

提问by Chasen Li

I have dataset with three columns in Python notebook. It seems there are too many outliers out of 1.5 times IQR. I'm think how can I count the outliers for all columns?

我在 Python 笔记本中有三列的数据集。似乎 1.5 倍 IQR 中有太多异常值。我想如何计算所有列的异常值？

If there are too many outliers, I may consider to remove the points considered as outliers for more than one feature. If so, how I can count it in that way?

如果异常值太多，我可能会考虑删除被视为多个特征的异常值的点。如果是这样，我怎么能这样算？

Thanks!

谢谢！

Answer 1

回答by ayhan

Similar to Romain X.'s answerbut operates on the DataFrame instead of Series.

类似于Romain X. 的答案，但在 DataFrame 而不是 Series 上运行。

Random data:

随机数据：

np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 5), columns=list('ABCDE'))
df.iloc[::10] += np.random.randn() * 2  # this hopefully introduces some outliers
df.head()
Out: 
          A         B         C         D         E
0  2.529517  1.165622  1.744203  3.006358  2.633023
1 -0.977278  0.950088 -0.151357 -0.103219  0.410599
2  0.144044  1.454274  0.761038  0.121675  0.443863
3  0.333674  1.494079 -0.205158  0.313068 -0.854096
4 -2.552990  0.653619  0.864436 -0.742165  2.269755

Quartile calculations:

四分位数计算：

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

And these are the numbers for each column:

这些是每列的数字：

((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).sum()
Out: 
A    1
B    0
C    0
D    1
E    2
dtype: int64

In line with seaborn's calculations:

根据 seaborn 的计算：

Note that the part before the sum ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))) is a boolean mask so you can use it directly to remove outliers. This sets them to NaN, for example:

请注意 sum ( (df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR)))之前的部分是一个布尔掩码，因此您可以直接使用它来删除异常值。这将它们设置为 NaN，例如：

mask = (df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))
df[mask] = np.nan

pandas 如何计算 Python 中所有列的异常值？

提问by Chasen Li

回答by ayhan

相关推荐

最近更新

标签

pandas 如何计算 Python 中所有列的异常值？

提问by Chasen Li

回答by ayhan

相关推荐

pandas 如何用我的 DataFrame 中的空字符串替换所有“nan”字符串？

pandas 使用read_sas后如何从pandas对象类型中的b'Text'获取文本？

pandas 行和列的熊猫风格背景渐变

pandas 按两列分组并计算熊猫中每个组合的出现次数

相关推荐

最近更新

标签