pandas Python:用中值替换异常值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45386955/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:07:21  来源:igfitidea点击:

Python: replacing outliers values with median values

pythonpandasnumpy

提问by user4943236

I have a python data-frame in which there are some outlier values. I would like to replace them with the median values of the data, had those values not been there.

我有一个 python 数据框,其中有一些异常值。如果这些值不存在,我想用数据的中值替换它们。

id         Age
10236    766105
11993       288
9337        205
38189        88
35555        82
39443        75
10762        74
33847        72
21194        70
39450        70

So, I want to replace all the values > 75 with the median value of the dataset of the remaining dataset, i.e., the median value of 70,70,72,74,75.

所以,我想用剩余数据集的数据集的中值替换所有> 75的值,即 的中值70,70,72,74,75

I'm trying to do the following:

我正在尝试执行以下操作:

  1. Replace with 0, all the values that are greater than 75
  2. Replace the 0s with median value.
  1. 替换为 0,所有大于 75 的值
  2. 用中值替换 0。

But somehow, the below code not working

但不知何故,下面的代码不起作用

df['age'].replace(df.age>75,0,inplace=True)

回答by Bharath

I think this is what you are looking for, you can use loc to assign value . Then you can fill the nan

我认为这就是您要寻找的,您可以使用 loc 来分配 value 。然后就可以填nan

median = df.loc[df['Age']<75, 'Age'].median()
df.loc[df.Age > 75, 'Age'] = np.nan
df.fillna(median,inplace=True)

You can also use np.where in one line

您也可以在一行中使用 np.where

df["Age"] = np.where(df["Age"] >75, median,df['Age'])

You can also use .mask i.e

你也可以使用 .mask 即

df["Age"] = df["Age"].mask(df["Age"] >75, median)

回答by behnamoh

A more general solution I've tried lately: replace 75 with the median of the whole column and then follow a solution similar to what Bharath suggested:

我最近尝试了一个更通用的解决方案:用整列的中位数替换 75,然后遵循类似于 Bharath 建议的解决方案:

median = float(df['Age'].median())
df["Age"] = np.where(df["Age"] > median, median, df['Age'])