pandas 如何在 Python 中删除异常值?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/54398554/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to remove Outliers in Python?
提问by Stanislav Jirák
I want to remove outliers from my dataset "train" for which purpose I've decided to use z-score or IQR.
我想从我的数据集“train”中删除异常值,为此我决定使用 z-score 或 IQR。
I'm running Jupyter notebook on Microsoft Python Client for SQL Server.
我在 Microsoft Python Client for SQL Server 上运行 Jupyter notebook。
I've tried for z-score:
我试过 z-score:
from scipy import stats
train[(np.abs(stats.zscore(train)) < 3).all(axis=1)]
for IQR:
对于 IQR:
Q1 = train.quantile(0.02)
Q3 = train.quantile(0.98)
IQR = Q3 - Q1
train = train[~((train < (Q1 - 1.5 * IQR)) |(train > (Q3 + 1.5 *
IQR))).any(axis=1)]
...which returns...
...返回...
for z-score:
z 分数:
TypeError: unsupported operand type(s) for /: 'str' and 'int'
类型错误:不支持 / 的操作数类型:'str' 和 'int'
for IQR:
对于 IQR:
TypeError: unorderable types: str() < float()
类型错误:无法排序的类型:str() < float()
My train dataset looks like:
我的火车数据集如下所示:
# Number of each type of column
print('Training data shape: ', train.shape)
train.dtypes.value_counts()
Training data shape: (300000, 111) int32 66 float64 30 object 15 dtype: int64
训练数据形状:(300000, 111) int32 66 float64 30 object 15 dtype: int64
Help would be appreciated.
帮助将不胜感激。
回答by Sergey Bushmanov
You're having trouble with your code because you're trying to calculate zscore
on categorical columns.
您的代码有问题,因为您正在尝试zscore
对分类列进行计算。
To avoid this, you should first separate your train into parts with numerical and categorical features:
为了避免这种情况,您应该首先将您的火车分成具有数字和分类特征的部分:
num_train = train.select_dtypes(include=["number"])
cat_train = train.select_dtypes(exclude=["number"])
and only after that calculate index of rows to keep:
并且仅在此之后计算要保留的行索引:
idx = np.all(stats.zscore(num_train) < 3, axis=1)
and finally add the two pieces together:
最后将这两部分加在一起:
train_cleaned = pd.concat([num_train.loc[idx], cat_train.loc[idx]], axis=1)
For IQR part:
对于 IQR 部分:
Q1 = num_train.quantile(0.02)
Q3 = num_train.quantile(0.98)
IQR = Q3 - Q1
idx = ~((num_train < (Q1 - 1.5 * IQR)) | (num_train > (Q3 + 1.5 * IQR))).any(axis=1)
train_cleaned = pd.concat([num_train.loc[idx], cat_train.loc[idx]], axis=1)
Please let us know if you have any further questions.
如果您有任何其他问题,请告诉我们。
PS
聚苯乙烯
As well, you might consider one more approach for dealing with outliers with pandas.DataFrame.clip, which will clip outliers on a case-by-case basis instead of dropping a row altogether.
同样,您可能会考虑另一种使用pandas.DataFrame.clip处理异常值的方法,它会根据具体情况裁剪异常值,而不是完全删除一行。