pandas python pandas如何从数据框中删除异常值并替换为先前记录的平均值

Question

提问by IcemanBerlin

I have a dataframe 16k records and multiple groups of countries and other fields. I have produced an initial output of the a data that looks like the snipit below. Now i need to do some data cleansing, manipulating, remove skews or outliers and replace it with a value based on certain rules.

我有一个包含 16k 条记录和多组国家和其他领域的数据框。我已经生成了一个看起来像下面的 snipit 的数据的初始输出。现在我需要做一些数据清理、操作、删除偏斜或异常值，并根据某些规则用一个值替换它。

i.e. on the below how could i identify the skewed points (any value greater than 1) and replace them with the average of the next two records or previous record if there no later records.(in that group)

即在下面我如何识别偏斜点（任何大于 1 的值）并用接下来的两条记录或前一条记录的平均值替换它们（如果没有后续记录）。（在该组中）

So in the dataframe below I would like to replace Bill%4 for IT week1 of 1.21 with the average of week2 and week3 for IT so it is 0.81.

因此，在下面的数据框中，我想将 IT 第 1 周的 Bill%4 替换为 1.21 与 IT 的第 2 周和第 3 周的平均值，因此它是 0.81。

any tricks for this?

有什么技巧吗？

Country Week    Bill%1  Bill%2  Bill%3  Bill%4  Bill%5  Bill%6
IT     week1    0.94    0.88    0.85    1.21    0.77    0.75
IT     week2    0.93    0.88    1.25    0.80    0.77    0.72
IT     week3    0.94    1.33    0.85    0.82    0.76    0.76
IT     week4    1.39    0.89    0.86    0.80    0.80    0.76
FR     week1    0.92    0.86    0.82    1.18    0.75    0.73
FR     week2    0.91    0.86    1.22    0.78    0.75    0.71
FR     week3    0.92    1.29    0.83    0.80    0.75    0.75
FR     week4    1.35    0.87    0.84    0.78    0.78    0.74

Answer 1

回答by tnknepp

I don't know of any built-ins to do this, but you should be able to customize this to meet your needs, no?

我不知道有任何内置程序可以执行此操作，但是您应该能够对其进行自定义以满足您的需求，不是吗？

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.index = list('abcdeflght')

# Define cutoff value
cutoff = 0.90

for col in df.columns: 
    # Identify index locations above cutoff
    outliers = df[col][ df[col]>cutoff ]

    # Browse through outliers and average according to index location
    for idx in outliers.index:
        # Get index location 
        loc = df.index.get_loc(idx)

        # If not one of last two values in dataframe
        if loc<df.shape[0]-2:
            df[col][loc] = np.mean( df[col][loc+1:loc+3] )
        else: 
            df[col][loc] = np.mean( df[col][loc-3:loc-1] )

pandas python pandas如何从数据框中删除异常值并替换为先前记录的平均值

提问by IcemanBerlin

回答by tnknepp

相关推荐

最近更新

标签

pandas python pandas如何从数据框中删除异常值并替换为先前记录的平均值

提问by IcemanBerlin

回答by tnknepp

相关推荐

基于标签的索引 Pandas (.loc)

简单定制 matplotlib/pandas 条形图（标签、刻度等）

pandas 如何在pandas DataFrame中选择和删除具有重复名称的列

pandas.DataFrame.describe() 与 numpy.percentile() NaN 处理

相关推荐

最近更新

标签