pandas 如何在 Python 中删除异常值？

Question

提问by Stanislav Jirák

I want to remove outliers from my dataset "train" for which purpose I've decided to use z-score or IQR.

我想从我的数据集“train”中删除异常值，为此我决定使用 z-score 或 IQR。

I'm running Jupyter notebook on Microsoft Python Client for SQL Server.

我在 Microsoft Python Client for SQL Server 上运行 Jupyter notebook。

I've tried for z-score:

我试过 z-score：

from scipy import stats
train[(np.abs(stats.zscore(train)) < 3).all(axis=1)]

for IQR:

对于 IQR：

Q1 = train.quantile(0.02)
Q3 = train.quantile(0.98)
IQR = Q3 - Q1
train = train[~((train < (Q1 - 1.5 * IQR)) |(train > (Q3 + 1.5 * 
IQR))).any(axis=1)]

...which returns...

...返回...

for z-score:

z 分数：

TypeError: unsupported operand type(s) for /: 'str' and 'int'

类型错误：不支持 / 的操作数类型：'str' 和 'int'

for IQR:

对于 IQR：

TypeError: unorderable types: str() < float()

类型错误：无法排序的类型：str() < float()

My train dataset looks like:

我的火车数据集如下所示：

# Number of each type of column
print('Training data shape: ', train.shape)
train.dtypes.value_counts()

Training data shape: (300000, 111) int32 66 float64 30 object 15 dtype: int64

训练数据形状：(300000, 111) int32 66 float64 30 object 15 dtype: int64

Help would be appreciated.

帮助将不胜感激。

Answer 1

回答by Sergey Bushmanov

You're having trouble with your code because you're trying to calculate zscoreon categorical columns.

您的代码有问题，因为您正在尝试zscore对分类列进行计算。

To avoid this, you should first separate your train into parts with numerical and categorical features:

为了避免这种情况，您应该首先将您的火车分成具有数字和分类特征的部分：

num_train = train.select_dtypes(include=["number"])
cat_train = train.select_dtypes(exclude=["number"])

and only after that calculate index of rows to keep:

并且仅在此之后计算要保留的行索引：

idx = np.all(stats.zscore(num_train) < 3, axis=1)

and finally add the two pieces together:

最后将这两部分加在一起：

train_cleaned = pd.concat([num_train.loc[idx], cat_train.loc[idx]], axis=1)

For IQR part:

对于 IQR 部分：

Q1 = num_train.quantile(0.02)
Q3 = num_train.quantile(0.98)
IQR = Q3 - Q1
idx = ~((num_train < (Q1 - 1.5 * IQR)) | (num_train > (Q3 + 1.5 * IQR))).any(axis=1)
train_cleaned = pd.concat([num_train.loc[idx], cat_train.loc[idx]], axis=1)

Please let us know if you have any further questions.

如果您有任何其他问题，请告诉我们。

PS

聚苯乙烯

As well, you might consider one more approach for dealing with outliers with pandas.DataFrame.clip, which will clip outliers on a case-by-case basis instead of dropping a row altogether.

同样，您可能会考虑另一种使用pandas.DataFrame.clip处理异常值的方法，它会根据具体情况裁剪异常值，而不是完全删除一行。

pandas 如何在 Python 中删除异常值？

提问by Stanislav Jirák

回答by Sergey Bushmanov

相关推荐

最近更新

标签

pandas 如何在 Python 中删除异常值？

提问by Stanislav Jirák

回答by Sergey Bushmanov

相关推荐

pandas 在数据框的整个列中应用正则表达式

Python:Pandas - 数据帧中的对象到字符串类型转换

Pandas groupby 两列并绘制

pandas df.head() 和 df.head 有什么区别？

相关推荐

最近更新

标签