使用滚动中值过滤掉 Pandas 数据框中的异常值

Question

提问by p0ps1c1e

I am trying to filter out some outliers from a scatter plot of GPS elevation displacements with dates

我试图从带有日期的 GPS 高程位移散点图中过滤掉一些异常值

I'm trying to use df.rolling to compute a median and standard deviation for each window and then remove the point if it is greater than 3 standard deviations.

我正在尝试使用 df.rolling 来计算每个窗口的中值和标准偏差，然后如果它大于 3 个标准偏差，则删除该点。

However, I can't figure out a way to loop through the column and compare the the median value rolling calculated.

但是，我想不出一种方法来循环遍历该列并比较计算出的滚动中值。

Here is the code I have so far

这是我到目前为止的代码

import pandas as pd
import numpy as np

def median_filter(df, window):
    cnt = 0
    median = df['b'].rolling(window).median()
    std = df['b'].rolling(window).std()
    for row in df.b:
      #compare each value to its median




df = pd.DataFrame(np.random.randint(0,100,size=(100,2)), columns = ['a', 'b'])

median_filter(df, 10)

How can I loop through and compare each point and remove it?

如何遍历并比较每个点并将其删除？

Answer 1

回答by DJK

Just filter the dataframe

只需过滤数据框

df['median']= df['b'].rolling(window).median()
df['std'] = df['b'].rolling(window).std()

#filter setup
df = df[(df.b <= df['median']+3*df['std']) & (df.b >= df['median']-3*df['std'])]

Answer 2

回答by ako

There might well be a more pandastic way to do this - this is a bit of a hack, relying on a sorta manual way of mapping the original df's index to each rolling window. (I picked size 6). The records up and until row 6 are associated with the firstwindow; row 7 is the second window, and so on.

很可能有一种更笨拙的方法来做到这一点 - 这有点像黑客，依赖于将原始 df 的索引映射到每个滚动窗口的某种手动方式。（我选择了尺寸 6）。直到第 6 行的记录与第一个窗口相关联；第 7 行是第二个窗口，依此类推。

n = 100
df = pd.DataFrame(np.random.randint(0,n,size=(n,2)), columns = ['a','b'])

## set window size
window=6
std = 1  # I set it at just 1; with real data and larger windows, can be larger

## create df with rolling stats, upper and lower bounds
bounds = pd.DataFrame({'median':df['b'].rolling(window).median(),
'std':df['b'].rolling(window).std()})

bounds['upper']=bounds['median']+bounds['std']*std
bounds['lower']=bounds['median']-bounds['std']*std

## here, we set an identifier for each window which maps to the original df
## the first six rows are the first window; then each additional row is a new window
bounds['window_id']=np.append(np.zeros(window),np.arange(1,n-window+1))

## then we can assign the original 'b' value back to the bounds df
bounds['b']=df['b']

## and finally, keep only rows where b falls within the desired bounds
bounds.loc[bounds.eval("lower<b<upper")]

Answer 3

回答by Tomas Olsson

This is my take on creating a median filter:

这是我对创建中值滤波器的看法：

def median_filter(num_std=3):
    def _median_filter(x):
        _median = np.median(x)
        _std = np.std(x)
        s = x[-1]
        return s if s >= _median - num_std * _std and s <= _median + num_std * _std else np.nan
    return _median_filter

df.y.rolling(window).apply(median_filter(num_std=3), raw=True)

使用滚动中值过滤掉 Pandas 数据框中的异常值

提问by p0ps1c1e

回答by DJK

回答by ako

回答by Tomas Olsson

相关推荐

最近更新

标签

使用滚动中值过滤掉 Pandas 数据框中的异常值

提问by p0ps1c1e

回答by DJK

回答by ako

回答by Tomas Olsson

相关推荐

pandas cut：如何将分类标签转换为字符串（否则无法导出到 Excel）？

pandas Python - 从字符串中删除小数和零

pandas 使用来自其他数据帧的匹配值在数据帧中创建新列

如何在 Pandas 中读取大型 json？

相关推荐

最近更新

标签