pandas python pandas如何从数据框中删除异常值并替换为先前记录的平均值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20887194/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
python pandas How to remove outliers from a dataframe and replace with an average value of preceding records
提问by IcemanBerlin
I have a dataframe 16k records and multiple groups of countries and other fields. I have produced an initial output of the a data that looks like the snipit below. Now i need to do some data cleansing, manipulating, remove skews or outliers and replace it with a value based on certain rules.
我有一个包含 16k 条记录和多组国家和其他领域的数据框。我已经生成了一个看起来像下面的 snipit 的数据的初始输出。现在我需要做一些数据清理、操作、删除偏斜或异常值,并根据某些规则用一个值替换它。
i.e. on the below how could i identify the skewed points (any value greater than 1) and replace them with the average of the next two records or previous record if there no later records.(in that group)
即在下面我如何识别偏斜点(任何大于 1 的值)并用接下来的两条记录或前一条记录的平均值替换它们(如果没有后续记录)。(在该组中)
So in the dataframe below I would like to replace Bill%4 for IT week1 of 1.21 with the average of week2 and week3 for IT so it is 0.81.
因此,在下面的数据框中,我想将 IT 第 1 周的 Bill%4 替换为 1.21 与 IT 的第 2 周和第 3 周的平均值,因此它是 0.81。
any tricks for this?
有什么技巧吗?
Country Week Bill%1 Bill%2 Bill%3 Bill%4 Bill%5 Bill%6
IT week1 0.94 0.88 0.85 1.21 0.77 0.75
IT week2 0.93 0.88 1.25 0.80 0.77 0.72
IT week3 0.94 1.33 0.85 0.82 0.76 0.76
IT week4 1.39 0.89 0.86 0.80 0.80 0.76
FR week1 0.92 0.86 0.82 1.18 0.75 0.73
FR week2 0.91 0.86 1.22 0.78 0.75 0.71
FR week3 0.92 1.29 0.83 0.80 0.75 0.75
FR week4 1.35 0.87 0.84 0.78 0.78 0.74
回答by tnknepp
I don't know of any built-ins to do this, but you should be able to customize this to meet your needs, no?
我不知道有任何内置程序可以执行此操作,但是您应该能够对其进行自定义以满足您的需求,不是吗?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.index = list('abcdeflght')
# Define cutoff value
cutoff = 0.90
for col in df.columns:
# Identify index locations above cutoff
outliers = df[col][ df[col]>cutoff ]
# Browse through outliers and average according to index location
for idx in outliers.index:
# Get index location
loc = df.index.get_loc(idx)
# If not one of last two values in dataframe
if loc<df.shape[0]-2:
df[col][loc] = np.mean( df[col][loc+1:loc+3] )
else:
df[col][loc] = np.mean( df[col][loc-3:loc-1] )

