使用滚动中值过滤掉 Pandas 数据框中的异常值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46964363/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:42:05  来源:igfitidea点击:

Filtering out outliers in Pandas dataframe with rolling median

pandasmedianoutliersrolling-computation

提问by p0ps1c1e

I am trying to filter out some outliers from a scatter plot of GPS elevation displacements with dates

我试图从带有日期的 GPS 高程位移散点图中过滤掉一些异常值

I'm trying to use df.rolling to compute a median and standard deviation for each window and then remove the point if it is greater than 3 standard deviations.

我正在尝试使用 df.rolling 来计算每个窗口的中值和标准偏差,然后如果它大于 3 个标准偏差,则删除该点。

However, I can't figure out a way to loop through the column and compare the the median value rolling calculated.

但是,我想不出一种方法来循环遍历该列并比较计算出的滚动中值。

Here is the code I have so far

这是我到目前为止的代码

import pandas as pd
import numpy as np

def median_filter(df, window):
    cnt = 0
    median = df['b'].rolling(window).median()
    std = df['b'].rolling(window).std()
    for row in df.b:
      #compare each value to its median




df = pd.DataFrame(np.random.randint(0,100,size=(100,2)), columns = ['a', 'b'])

median_filter(df, 10)

How can I loop through and compare each point and remove it?

如何遍历并比较每个点并将其删除?

回答by DJK

Just filter the dataframe

只需过滤数据框

df['median']= df['b'].rolling(window).median()
df['std'] = df['b'].rolling(window).std()

#filter setup
df = df[(df.b <= df['median']+3*df['std']) & (df.b >= df['median']-3*df['std'])]

回答by ako

There might well be a more pandastic way to do this - this is a bit of a hack, relying on a sorta manual way of mapping the original df's index to each rolling window. (I picked size 6). The records up and until row 6 are associated with the firstwindow; row 7 is the second window, and so on.

很可能有一种更笨拙的方法来做到这一点 - 这有点像黑客,依赖于将原始 df 的索引映射到每个滚动窗口的某种手动方式。(我选择了尺寸 6)。直到第 6 行的记录与第一个窗口相关联;第 7 行是第二个窗口,依此类推。

n = 100
df = pd.DataFrame(np.random.randint(0,n,size=(n,2)), columns = ['a','b'])

## set window size
window=6
std = 1  # I set it at just 1; with real data and larger windows, can be larger

## create df with rolling stats, upper and lower bounds
bounds = pd.DataFrame({'median':df['b'].rolling(window).median(),
'std':df['b'].rolling(window).std()})

bounds['upper']=bounds['median']+bounds['std']*std
bounds['lower']=bounds['median']-bounds['std']*std

## here, we set an identifier for each window which maps to the original df
## the first six rows are the first window; then each additional row is a new window
bounds['window_id']=np.append(np.zeros(window),np.arange(1,n-window+1))

## then we can assign the original 'b' value back to the bounds df
bounds['b']=df['b']

## and finally, keep only rows where b falls within the desired bounds
bounds.loc[bounds.eval("lower<b<upper")]

回答by Tomas Olsson

This is my take on creating a median filter:

这是我对创建中值滤波器的看法:

def median_filter(num_std=3):
    def _median_filter(x):
        _median = np.median(x)
        _std = np.std(x)
        s = x[-1]
        return s if s >= _median - num_std * _std and s <= _median + num_std * _std else np.nan
    return _median_filter

df.y.rolling(window).apply(median_filter(num_std=3), raw=True)