pandas 如何计算 Python 中所有列的异常值?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39068214/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to count outliers for all columns in Python?
提问by Chasen Li
I have dataset with three columns in Python notebook. It seems there are too many outliers out of 1.5 times IQR. I'm think how can I count the outliers for all columns?
我在 Python 笔记本中有三列的数据集。似乎 1.5 倍 IQR 中有太多异常值。我想如何计算所有列的异常值?
If there are too many outliers, I may consider to remove the points considered as outliers for more than one feature. If so, how I can count it in that way?
如果异常值太多,我可能会考虑删除被视为多个特征的异常值的点。如果是这样,我怎么能这样算?
Thanks!
谢谢!
回答by ayhan
Similar to Romain X.'s answerbut operates on the DataFrame instead of Series.
类似于Romain X. 的答案,但在 DataFrame 而不是 Series 上运行。
Random data:
随机数据:
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 5), columns=list('ABCDE'))
df.iloc[::10] += np.random.randn() * 2 # this hopefully introduces some outliers
df.head()
Out:
A B C D E
0 2.529517 1.165622 1.744203 3.006358 2.633023
1 -0.977278 0.950088 -0.151357 -0.103219 0.410599
2 0.144044 1.454274 0.761038 0.121675 0.443863
3 0.333674 1.494079 -0.205158 0.313068 -0.854096
4 -2.552990 0.653619 0.864436 -0.742165 2.269755
Quartile calculations:
四分位数计算:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
And these are the numbers for each column:
这些是每列的数字:
((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).sum()
Out:
A 1
B 0
C 0
D 1
E 2
dtype: int64
In line with seaborn's calculations:
根据 seaborn 的计算:
Note that the part before the sum ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))
) is a boolean mask so you can use it directly to remove outliers. This sets them to NaN, for example:
请注意 sum ( (df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))
)之前的部分是一个布尔掩码,因此您可以直接使用它来删除异常值。这将它们设置为 NaN,例如:
mask = (df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))
df[mask] = np.nan