在 Pandas 数据框中计算滚动 z 分数

Question

提问by user308827

Is there a open source function to compute moving z-score like https://turi.com/products/create/docs/generated/graphlab.toolkits.anomaly_detection.moving_zscore.create.html. I have access to pandas rolling_std for computing std, but want to see if it can be extended to compute rolling z scores.

是否有开源函数来计算移动 z 分数，如https://turi.com/products/create/docs/generated/graphlab.toolkits.anomaly_detection.moving_zscore.create.html。我可以使用 pandas rolling_std 来计算 std，但想看看它是否可以扩展到计算滚动 z 分数。

Answer 1

回答by unutbu

rolling.applywith a custom function is significantly slower than using builtin rolling functions (such as mean and std). Therefore, compute the rolling z-score from the rolling mean and rolling std:

rolling.apply使用自定义函数比使用内置滚动函数（例如 mean 和 std）要慢得多。因此，根据滚动平均值和滚动标准计算滚动 z 得分：

def zscore(x, window):
    r = x.rolling(window=window)
    m = r.mean().shift(1)
    s = r.std(ddof=0).shift(1)
    z = (x-m)/s
    return z

According to the definition given on this pagethe rolling z-score depends on the rolling mean and std just prior to the current point. The shift(1)is used above to achieve this effect.

根据本页给出的定义，滚动 z 分数取决于当前点之前的滚动平均值和标准差。在shift(1)上面用来达到这种效果。

Below, even for a small Series (of length 100), zscoreis over 5x faster than using rolling.apply. Since rolling.apply(zscore_func)calls zscore_funconce for each rolling window in essentially a Python loop, the advantage of using the Cythonized r.mean()and r.std()functions becomes even more apparent as the size of the loop increases. Thus, as the length of the Series increases, the speed advantage of zscoreincreases.

下面，即使对于小型系列（长度为 100），zscore也比使用rolling.apply. 由于本质上在 Python 循环中为每个滚动窗口rolling.apply(zscore_func)调用zscore_func一次，因此随着循环大小的增加，使用 Cythonizedr.mean()和r.std()函数的优势变得更加明显。因此，随着系列长度的增加，速度优势zscore增加。

In [58]: %timeit zscore(x, N)
1000 loops, best of 3: 903 μs per loop

In [59]: %timeit zscore_using_apply(x, N)
100 loops, best of 3: 4.84 ms per loop

This is the setup used for the benchmark:

这是用于基准测试的设置：

import numpy as np
import pandas as pd
np.random.seed(2017)

def zscore(x, window):
    r = x.rolling(window=window)
    m = r.mean().shift(1)
    s = r.std(ddof=0).shift(1)
    z = (x-m)/s
    return z


def zscore_using_apply(x, window):
    def zscore_func(x):
        return (x[-1] - x[:-1].mean())/x[:-1].std(ddof=0)
    return x.rolling(window=window+1).apply(zscore_func)

N = 5
x = pd.Series((np.random.random(100) - 0.5).cumsum())

result = zscore(x, N)
alt = zscore_using_apply(x, N)

assert not ((result - alt).abs() > 1e-8).any()

Answer 2

回答by Varun

Let us say you have a data frame called data, which looks like this:

假设您有一个名为 data 的数据框，如下所示：

enter image description here

在此处输入图片说明

then you run the following code,

然后你运行下面的代码，

data_zscore=data.apply(lambda x: (x-x.expanding().mean())/x.expanding().std())

data_zscore=data.apply(lambda x: (xx.expanding().mean())/x.expanding().std())

enter image description herePlease note that the first row will always have NaN values as it doesn't have a standard deviation.

在此处输入图像描述请注意，第一行将始终包含 NaN 值，因为它没有标准偏差。

Answer 3

回答by deltascience

You should use native functions of pandas:

您应该使用Pandas的本机功能：

 # Compute rolling zscore for column ="COL" and window=window
 col_mean = df["COL"].rolling(window=window).mean()
 col_std = df["COL"].rolling(window=window).std()

 df["COL_ZSCORE"] = (df["COL"] - col_mean)/col_std

在 Pandas 数据框中计算滚动 z 分数

提问by user308827

回答by unutbu

回答by Varun

回答by deltascience

相关推荐

最近更新

标签

在 Pandas 数据框中计算滚动 z 分数

提问by user308827

回答by unutbu

回答by Varun

回答by deltascience

相关推荐

pandas 如何用一个值替换多个值python

pandas 熊猫将列表拆分为带有正则表达式的列

pandas 在python中将多个Excel文件（xlsx）附加在一起

pandas 根据空值的百分比删除熊猫数据框中的列

相关推荐

最近更新

标签