在 Pandas 数据框中计算滚动 z 分数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47164950/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:44:37  来源:igfitidea点击:

Compute rolling z-score in pandas dataframe

pythonpandas

提问by user308827

Is there a open source function to compute moving z-score like https://turi.com/products/create/docs/generated/graphlab.toolkits.anomaly_detection.moving_zscore.create.html. I have access to pandas rolling_std for computing std, but want to see if it can be extended to compute rolling z scores.

是否有开源函数来计算移动 z 分数,如https://turi.com/products/create/docs/generated/graphlab.toolkits.anomaly_detection.movi​​ng_zscore.create.html。我可以使用 pandas rolling_std 来计算 std,但想看看它是否可以扩展到计算滚动 z 分数。

回答by unutbu

rolling.applywith a custom function is significantly slower than using builtin rolling functions (such as mean and std). Therefore, compute the rolling z-score from the rolling mean and rolling std:

rolling.apply使用自定义函数比使用内置滚动函数(例如 mean 和 std)要慢得多。因此,根据滚动平均值和滚动标准计算滚动 z 得分:

def zscore(x, window):
    r = x.rolling(window=window)
    m = r.mean().shift(1)
    s = r.std(ddof=0).shift(1)
    z = (x-m)/s
    return z

According to the definition given on this pagethe rolling z-score depends on the rolling mean and std just prior to the current point. The shift(1)is used above to achieve this effect.

根据本页给出的定义,滚动 z 分数取决于当前点之前的滚动平均值和标准差。在shift(1)上面用来达到这种效果。



Below, even for a small Series (of length 100), zscoreis over 5x faster than using rolling.apply. Since rolling.apply(zscore_func)calls zscore_funconce for each rolling window in essentially a Python loop, the advantage of using the Cythonized r.mean()and r.std()functions becomes even more apparent as the size of the loop increases. Thus, as the length of the Series increases, the speed advantage of zscoreincreases.

下面,即使对于小型系列(长度为 100),zscore也比使用rolling.apply. 由于本质上在 Python 循环中为每个滚动窗口rolling.apply(zscore_func)调用zscore_func一次,因此随着循环大小的增加,使用 Cythonizedr.mean()r.std()函数的优势变得更加明显。因此,随着系列长度的增加,速度优势zscore增加。

In [58]: %timeit zscore(x, N)
1000 loops, best of 3: 903 μs per loop

In [59]: %timeit zscore_using_apply(x, N)
100 loops, best of 3: 4.84 ms per loop

This is the setup used for the benchmark:

这是用于基准测试的设置:

import numpy as np
import pandas as pd
np.random.seed(2017)

def zscore(x, window):
    r = x.rolling(window=window)
    m = r.mean().shift(1)
    s = r.std(ddof=0).shift(1)
    z = (x-m)/s
    return z


def zscore_using_apply(x, window):
    def zscore_func(x):
        return (x[-1] - x[:-1].mean())/x[:-1].std(ddof=0)
    return x.rolling(window=window+1).apply(zscore_func)

N = 5
x = pd.Series((np.random.random(100) - 0.5).cumsum())

result = zscore(x, N)
alt = zscore_using_apply(x, N)

assert not ((result - alt).abs() > 1e-8).any()

回答by Varun

Let us say you have a data frame called data, which looks like this:

假设您有一个名为 data 的数据框,如下所示:

enter image description here

在此处输入图片说明

then you run the following code,

然后你运行下面的代码,

data_zscore=data.apply(lambda x: (x-x.expanding().mean())/x.expanding().std())

data_zscore=data.apply(lambda x: (xx.expanding().mean())/x.expanding().std())

enter image description herePlease note that the first row will always have NaN values as it doesn't have a standard deviation.

在此处输入图像描述请注意,第一行将始终包含 NaN 值,因为它没有标准偏差。

回答by deltascience

You should use native functions of pandas:

您应该使用Pandas的本机功能:

 # Compute rolling zscore for column ="COL" and window=window
 col_mean = df["COL"].rolling(window=window).mean()
 col_std = df["COL"].rolling(window=window).std()

 df["COL_ZSCORE"] = (df["COL"] - col_mean)/col_std