pandas Python:用中值替换异常值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45386955/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python: replacing outliers values with median values
提问by user4943236
I have a python data-frame in which there are some outlier values. I would like to replace them with the median values of the data, had those values not been there.
我有一个 python 数据框,其中有一些异常值。如果这些值不存在,我想用数据的中值替换它们。
id Age
10236 766105
11993 288
9337 205
38189 88
35555 82
39443 75
10762 74
33847 72
21194 70
39450 70
So, I want to replace all the values > 75 with the median value of the dataset of the remaining dataset, i.e., the median value of 70,70,72,74,75
.
所以,我想用剩余数据集的数据集的中值替换所有> 75的值,即 的中值70,70,72,74,75
。
I'm trying to do the following:
我正在尝试执行以下操作:
- Replace with 0, all the values that are greater than 75
- Replace the 0s with median value.
- 替换为 0,所有大于 75 的值
- 用中值替换 0。
But somehow, the below code not working
但不知何故,下面的代码不起作用
df['age'].replace(df.age>75,0,inplace=True)
回答by Bharath
I think this is what you are looking for, you can use loc to assign value . Then you can fill the nan
我认为这就是您要寻找的,您可以使用 loc 来分配 value 。然后就可以填nan
median = df.loc[df['Age']<75, 'Age'].median()
df.loc[df.Age > 75, 'Age'] = np.nan
df.fillna(median,inplace=True)
You can also use np.where in one line
您也可以在一行中使用 np.where
df["Age"] = np.where(df["Age"] >75, median,df['Age'])
You can also use .mask i.e
你也可以使用 .mask 即
df["Age"] = df["Age"].mask(df["Age"] >75, median)
回答by behnamoh
A more general solution I've tried lately: replace 75 with the median of the whole column and then follow a solution similar to what Bharath suggested:
我最近尝试了一个更通用的解决方案:用整列的中位数替换 75,然后遵循类似于 Bharath 建议的解决方案:
median = float(df['Age'].median())
df["Age"] = np.where(df["Age"] > median, median, df['Age'])