Python NumPy 版本的“指数加权移动平均线”,相当于 pandas.ewm().mean()

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42869495/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 22:13:34  来源:igfitidea点击:

NumPy version of "Exponential weighted moving average", equivalent to pandas.ewm().mean()

pythonperformancepandasnumpyvectorization

提问by RaduS

How do I get the exponential weighted moving average in NumPy just like the following in pandas?

我如何在 NumPy 中获得指数加权移动平均线,就像熊猫中的以下内容一样?

import pandas as pd
import pandas_datareader as pdr
from datetime import datetime

# Declare variables
ibm = pdr.get_data_yahoo(symbols='IBM', start=datetime(2000, 1, 1), end=datetime(2012, 1, 1)).reset_index(drop=True)['Adj Close']
windowSize = 20

# Get PANDAS exponential weighted moving average
ewm_pd = pd.DataFrame(ibm).ewm(span=windowSize, min_periods=windowSize).mean().as_matrix()

print(ewm_pd)

I tried the following with NumPy

我用 NumPy 尝试了以下操作

import numpy as np
import pandas_datareader as pdr
from datetime import datetime

# From this post: http://stackoverflow.com/a/40085052/3293881 by @Divakar
def strided_app(a, L, S): # Window len = L, Stride len/stepsize = S
    nrows = ((a.size - L) // S) + 1
    n = a.strides[0]
    return np.lib.stride_tricks.as_strided(a, shape=(nrows, L), strides=(S * n, n))

def numpyEWMA(price, windowSize):
    weights = np.exp(np.linspace(-1., 0., windowSize))
    weights /= weights.sum()

    a2D = strided_app(price, windowSize, 1)

    returnArray = np.empty((price.shape[0]))
    returnArray.fill(np.nan)
    for index in (range(a2D.shape[0])):
        returnArray[index + windowSize-1] = np.convolve(weights, a2D[index])[windowSize - 1:-windowSize + 1]
    return np.reshape(returnArray, (-1, 1))

# Declare variables
ibm = pdr.get_data_yahoo(symbols='IBM', start=datetime(2000, 1, 1), end=datetime(2012, 1, 1)).reset_index(drop=True)['Adj Close']
windowSize = 20

# Get NumPy exponential weighted moving average
ewma_np = numpyEWMA(ibm, windowSize)

print(ewma_np)

But the results are not similar as the ones in pandas.

但结果与熊猫中的结果并不相似。

Is there maybe a better approach to calculate the exponential weighted moving average directly in NumPy and get the exact same result as the pandas.ewm().mean()?

是否有更好的方法可以直接在 NumPy 中计算指数加权移动平均值并获得与 完全相同的结果pandas.ewm().mean()

At 60,000 requests on pandas solution, I get about 230 seconds. I am sure that with a pure NumPy, this can be decreased significantly.

在熊猫解决方案上有 60,000 个请求时,我得到大约 230 秒。我确信使用纯 NumPy 可以显着减少这种情况。

采纳答案by Jake Walden

Updated 08/06/2019

2019 年 8 月 6 日更新

PURE NUMPY, FAST & VECTORIZED SOLUTION FOR LARGE INPUTS

用于大输入的纯 NUMPY、快速和矢量化解决方案

outparameter for in-place computation, dtypeparameter, index orderparameter

out就地计算 dtype参数、参数、索引order参数

This function is equivalent to pandas' ewm(adjust=False).mean(), but much faster. ewm(adjust=True).mean()(the default for pandas) can produce different values at the start of the result. I am working to add the adjustfunctionality to this solution.

此函数等效于 pandas' ewm(adjust=False).mean(),但速度要快得多。ewm(adjust=True).mean()(pandas 的默认值)可以在结果的开头产生不同的值。我正在努力adjust向此解决方案添加功能。

@Divakar's answerleads to floating point precision problems when the input is too large. This is because (1-alpha)**(n+1) -> 0when n -> infand alpha -> 1, leading to divide-by-zero's and NaNvalues popping up in the calculation.

当输入太大时,@Divakar 的回答会导致浮点精度问题。这是因为(1-alpha)**(n+1) -> 0whenn -> infalpha -> 1,导致NaN计算中出现被零除和值。

Here is my fastest solution with no precision problems, nearly fully vectorized. It's gotten a little complicated but the performance is great, especially for really huge inputs. Without using in-place calculations (which is possible using the outparameter, saving memory allocation time): 3.62 seconds for 100M element input vector, 3.2ms for a 100K element input vector, and 293μs for a 5000 element input vector on a pretty old PC (results will vary with different alpha/row_sizevalues).

这是我最快的解决方案,没有精度问题,几乎完全矢量化。它变得有点复杂,但性能很棒,尤其是对于非常大的输入。不使用就地计算(可以使用out参数,节省内存分配时间):100M 元素输入向量为 3.62 秒,100K 元素输入向量为 3.2ms,5000 个元素输入向量在相当旧的 PC 上为 293μs (结果会因alpha/row_size值不同而有所不同)。

# tested with python3 & numpy 1.15.2
import numpy as np

def ewma_vectorized_safe(data, alpha, row_size=None, dtype=None, order='C', out=None):
    """
    Reshapes data before calculating EWMA, then iterates once over the rows
    to calculate the offset without precision issues
    :param data: Input data, will be flattened.
    :param alpha: scalar float in range (0,1)
        The alpha parameter for the moving average.
    :param row_size: int, optional
        The row size to use in the computation. High row sizes need higher precision,
        low values will impact performance. The optimal value depends on the
        platform and the alpha being used. Higher alpha values require lower
        row size. Default depends on dtype.
    :param dtype: optional
        Data type used for calculations. Defaults to float64 unless
        data.dtype is float32, then it will use float32.
    :param order: {'C', 'F', 'A'}, optional
        Order to use when flattening the data. Defaults to 'C'.
    :param out: ndarray, or None, optional
        A location into which the result is stored. If provided, it must have
        the same shape as the desired output. If not provided or `None`,
        a freshly-allocated array is returned.
    :return: The flattened result.
    """
    data = np.array(data, copy=False)

    if dtype is None:
        if data.dtype == np.float32:
            dtype = np.float32
        else:
            dtype = np.float
    else:
        dtype = np.dtype(dtype)

    row_size = int(row_size) if row_size is not None 
               else get_max_row_size(alpha, dtype)

    if data.size <= row_size:
        # The normal function can handle this input, use that
        return ewma_vectorized(data, alpha, dtype=dtype, order=order, out=out)

    if data.ndim > 1:
        # flatten input
        data = np.reshape(data, -1, order=order)

    if out is None:
        out = np.empty_like(data, dtype=dtype)
    else:
        assert out.shape == data.shape
        assert out.dtype == dtype

    row_n = int(data.size // row_size)  # the number of rows to use
    trailing_n = int(data.size % row_size)  # the amount of data leftover
    first_offset = data[0]

    if trailing_n > 0:
        # set temporary results to slice view of out parameter
        out_main_view = np.reshape(out[:-trailing_n], (row_n, row_size))
        data_main_view = np.reshape(data[:-trailing_n], (row_n, row_size))
    else:
        out_main_view = out
        data_main_view = data

    # get all the scaled cumulative sums with 0 offset
    ewma_vectorized_2d(data_main_view, alpha, axis=1, offset=0, dtype=dtype,
                       order='C', out=out_main_view)

    scaling_factors = (1 - alpha) ** np.arange(1, row_size + 1)
    last_scaling_factor = scaling_factors[-1]

    # create offset array
    offsets = np.empty(out_main_view.shape[0], dtype=dtype)
    offsets[0] = first_offset
    # iteratively calculate offset for each row
    for i in range(1, out_main_view.shape[0]):
        offsets[i] = offsets[i - 1] * last_scaling_factor + out_main_view[i - 1, -1]

    # add the offsets to the result
    out_main_view += offsets[:, np.newaxis] * scaling_factors[np.newaxis, :]

    if trailing_n > 0:
        # process trailing data in the 2nd slice of the out parameter
        ewma_vectorized(data[-trailing_n:], alpha, offset=out_main_view[-1, -1],
                        dtype=dtype, order='C', out=out[-trailing_n:])
    return out

def get_max_row_size(alpha, dtype=float):
    assert 0. <= alpha < 1.
    # This will return the maximum row size possible on 
    # your platform for the given dtype. I can find no impact on accuracy
    # at this value on my machine.
    # Might not be the optimal value for speed, which is hard to predict
    # due to numpy's optimizations
    # Use np.finfo(dtype).eps if you  are worried about accuracy
    # and want to be extra safe.
    epsilon = np.finfo(dtype).tiny
    # If this produces an OverflowError, make epsilon larger
    return int(np.log(epsilon)/np.log(1-alpha)) + 1

The 1D ewma function:

一维 ewma 函数:

def ewma_vectorized(data, alpha, offset=None, dtype=None, order='C', out=None):
    """
    Calculates the exponential moving average over a vector.
    Will fail for large inputs.
    :param data: Input data
    :param alpha: scalar float in range (0,1)
        The alpha parameter for the moving average.
    :param offset: optional
        The offset for the moving average, scalar. Defaults to data[0].
    :param dtype: optional
        Data type used for calculations. Defaults to float64 unless
        data.dtype is float32, then it will use float32.
    :param order: {'C', 'F', 'A'}, optional
        Order to use when flattening the data. Defaults to 'C'.
    :param out: ndarray, or None, optional
        A location into which the result is stored. If provided, it must have
        the same shape as the input. If not provided or `None`,
        a freshly-allocated array is returned.
    """
    data = np.array(data, copy=False)

    if dtype is None:
        if data.dtype == np.float32:
            dtype = np.float32
        else:
            dtype = np.float64
    else:
        dtype = np.dtype(dtype)

    if data.ndim > 1:
        # flatten input
        data = data.reshape(-1, order)

    if out is None:
        out = np.empty_like(data, dtype=dtype)
    else:
        assert out.shape == data.shape
        assert out.dtype == dtype

    if data.size < 1:
        # empty input, return empty array
        return out

    if offset is None:
        offset = data[0]

    alpha = np.array(alpha, copy=False).astype(dtype, copy=False)

    # scaling_factors -> 0 as len(data) gets large
    # this leads to divide-by-zeros below
    scaling_factors = np.power(1. - alpha, np.arange(data.size + 1, dtype=dtype),
                               dtype=dtype)
    # create cumulative sum array
    np.multiply(data, (alpha * scaling_factors[-2]) / scaling_factors[:-1],
                dtype=dtype, out=out)
    np.cumsum(out, dtype=dtype, out=out)

    # cumsums / scaling
    out /= scaling_factors[-2::-1]

    if offset != 0:
        offset = np.array(offset, copy=False).astype(dtype, copy=False)
        # add offsets
        out += offset * scaling_factors[1:]

    return out

The 2D ewma function:

二维 ewma 函数:

def ewma_vectorized_2d(data, alpha, axis=None, offset=None, dtype=None, order='C', out=None):
    """
    Calculates the exponential moving average over a given axis.
    :param data: Input data, must be 1D or 2D array.
    :param alpha: scalar float in range (0,1)
        The alpha parameter for the moving average.
    :param axis: The axis to apply the moving average on.
        If axis==None, the data is flattened.
    :param offset: optional
        The offset for the moving average. Must be scalar or a
        vector with one element for each row of data. If set to None,
        defaults to the first value of each row.
    :param dtype: optional
        Data type used for calculations. Defaults to float64 unless
        data.dtype is float32, then it will use float32.
    :param order: {'C', 'F', 'A'}, optional
        Order to use when flattening the data. Ignored if axis is not None.
    :param out: ndarray, or None, optional
        A location into which the result is stored. If provided, it must have
        the same shape as the desired output. If not provided or `None`,
        a freshly-allocated array is returned.
    """
    data = np.array(data, copy=False)

    assert data.ndim <= 2

    if dtype is None:
        if data.dtype == np.float32:
            dtype = np.float32
        else:
            dtype = np.float64
    else:
        dtype = np.dtype(dtype)

    if out is None:
        out = np.empty_like(data, dtype=dtype)
    else:
        assert out.shape == data.shape
        assert out.dtype == dtype

    if data.size < 1:
        # empty input, return empty array
        return out

    if axis is None or data.ndim < 2:
        # use 1D version
        if isinstance(offset, np.ndarray):
            offset = offset[0]
        return ewma_vectorized(data, alpha, offset, dtype=dtype, order=order,
                               out=out)

    assert -data.ndim <= axis < data.ndim

    # create reshaped data views
    out_view = out
    if axis < 0:
        axis = data.ndim - int(axis)

    if axis == 0:
        # transpose data views so columns are treated as rows
        data = data.T
        out_view = out_view.T

    if offset is None:
        # use the first element of each row as the offset
        offset = np.copy(data[:, 0])
    elif np.size(offset) == 1:
        offset = np.reshape(offset, (1,))

    alpha = np.array(alpha, copy=False).astype(dtype, copy=False)

    # calculate the moving average
    row_size = data.shape[1]
    row_n = data.shape[0]
    scaling_factors = np.power(1. - alpha, np.arange(row_size + 1, dtype=dtype),
                               dtype=dtype)
    # create a scaled cumulative sum array
    np.multiply(
        data,
        np.multiply(alpha * scaling_factors[-2], np.ones((row_n, 1), dtype=dtype),
                    dtype=dtype)
        / scaling_factors[np.newaxis, :-1],
        dtype=dtype, out=out_view
    )
    np.cumsum(out_view, axis=1, dtype=dtype, out=out_view)
    out_view /= scaling_factors[np.newaxis, -2::-1]

    if not (np.size(offset) == 1 and offset == 0):
        offset = offset.astype(dtype, copy=False)
        # add the offsets to the scaled cumulative sums
        out_view += offset[:, np.newaxis] * scaling_factors[np.newaxis, 1:]

    return out

usage:

用法:

data_n = 100000000
data = ((0.5*np.random.randn(data_n)+0.5) % 1) * 100

span = 5000  # span >= 1
alpha = 2/(span+1)  # for pandas` span parameter

# com = 1000  # com >= 0
# alpha = 1/(1+com)  # for pandas` center-of-mass parameter

# halflife = 100  # halflife > 0
# alpha = 1 - np.exp(np.log(0.5)/halflife)  # for pandas` half-life parameter

result = ewma_vectorized_safe(data, alpha)


Just a tip

只是一个提示

It is easy to calculate a 'window size' (technically exponential averages have infinite 'windows') for a given alpha, dependent on the contribution of the data in that window to the average. This is useful for example to chose how much of the start of the result to treat as unreliable due to border effects.

很容易计算给定的“窗口大小”(技术上指数平均值具有无限的“窗口”),这alpha取决于该窗口中的数据对平均值的贡献。例如,这对于选择将多少结果的起点视为由于边界效应而变得不可靠非常有用。

def window_size(alpha, sum_proportion):
    # Increases with increased sum_proportion and decreased alpha
    # solve (1-alpha)**window_size = (1-sum_proportion) for window_size        
    return int(np.log(1-sum_proportion) / np.log(1-alpha))

alpha = 0.02
sum_proportion = .99  # window covers 99% of contribution to the moving average
window = window_size(alpha, sum_proportion)  # = 227
sum_proportion = .75  # window covers 75% of contribution to the moving average
window = window_size(alpha, sum_proportion)  # = 68

The alpha = 2 / (window_size + 1.0)relation used in this thread (the 'span' option from pandas) is a very rough approximation of the inverse of the above function (with sum_proportion~=0.87). alpha = 1 - np.exp(np.log(1-sum_proportion)/window_size)is more accurate (the 'half-life' option from pandas equals this formula with sum_proportion=0.5).

alpha = 2 / (window_size + 1.0)此线程中使用的关系(来自pandas的 'span' 选项)是上述函数(使用sum_proportion~=0.87)的逆函数的非常粗略的近似。alpha = 1 - np.exp(np.log(1-sum_proportion)/window_size)更准确(熊猫的“半衰期”选项等于这个公式sum_proportion=0.5)。

In the following example, datarepresents a continuous noisy signal. cutoff_idxis the first position in resultwhere at least 99% of the value is dependent on separate values in data(i.e. less than 1% depends on data[0]). The data up to cutoff_idxis excluded from the final results because it is too dependent on the first value in data, therefore possibly skewing the average.

在下面的例子中,data代表一个连续的噪声信号。cutoff_idx是第一个位置,result其中至少 99% 的值依赖于单独的值data(即小于 1% 取决于数据 [0])。直到 的数据cutoff_idx被排除在最终结果之外,因为它太依赖于 中的第一个值data,因此可能会扭曲平均值。

result = ewma_vectorized_safe(data, alpha, chunk_size)
sum_proportion = .99
cutoff_idx = window_size(alpha, sum_proportion)
result = result[cutoff_idx:]

To illustrate the problem the above solve you can run this a few times, notice the often-appearing false start of the red line, which is skipped after cutoff_idx:

为了说明上面解决的问题,你可以运行几次,注意红线经常出现的错误开始,在之后被跳过cutoff_idx

data_n = 100000
data = np.random.rand(data_n) * 100
window = 1000
sum_proportion = .99
alpha = 1 - np.exp(np.log(1-sum_proportion)/window)

result = ewma_vectorized_safe(data, alpha)

cutoff_idx = window_size(alpha, sum_proportion)
x = np.arange(start=0, stop=result.size)

import matplotlib.pyplot as plt
plt.plot(x[:cutoff_idx+1], result[:cutoff_idx+1], '-r',
         x[cutoff_idx:], result[cutoff_idx:], '-b')
plt.show()

note that cutoff_idx==windowbecause alpha was set with the inverse of the window_size()function, with the same sum_proportion. This is similar to how pandas applies ewm(span=window, min_periods=window).

请注意,cutoff_idx==window因为 alpha 是用window_size()函数的倒数设置的,具有相同的sum_proportion. 这类似于 pandas 的应用方式ewm(span=window, min_periods=window)

回答by Divakar

I think I have finally cracked it!

我想我终于破解了它!

Here's a vectorized version of numpy_ewmafunction that's claimed to be producing the correct results from @RaduS's post-

这是一个矢量化版本的numpy_ewma函数,声称可以从以下方面产生正确的结果@RaduS's post-

def numpy_ewma_vectorized(data, window):

    alpha = 2 /(window + 1.0)
    alpha_rev = 1-alpha

    scale = 1/alpha_rev
    n = data.shape[0]

    r = np.arange(n)
    scale_arr = scale**r
    offset = data[0]*alpha_rev**(r+1)
    pw0 = alpha*alpha_rev**(n-1)

    mult = data*pw0*scale_arr
    cumsums = mult.cumsum()
    out = offset + cumsums*scale_arr[::-1]
    return out

Further boost

进一步提升

We can boost it further with some code re-use, like so -

我们可以通过重用一些代码来进一步提升它,就像这样 -

def numpy_ewma_vectorized_v2(data, window):

    alpha = 2 /(window + 1.0)
    alpha_rev = 1-alpha
    n = data.shape[0]

    pows = alpha_rev**(np.arange(n+1))

    scale_arr = 1/pows[:-1]
    offset = data[0]*pows[1:]
    pw0 = alpha*alpha_rev**(n-1)

    mult = data*pw0*scale_arr
    cumsums = mult.cumsum()
    out = offset + cumsums*scale_arr[::-1]
    return out

Runtime test

运行时测试

Let's time these two against the same loopy function for a big dataset.

让我们针对大型数据集的相同循环函数对这两个函数进行计时。

In [97]: data = np.random.randint(2,9,(5000))
    ...: window = 20
    ...:

In [98]: np.allclose(numpy_ewma(data, window), numpy_ewma_vectorized(data, window))
Out[98]: True

In [99]: np.allclose(numpy_ewma(data, window), numpy_ewma_vectorized_v2(data, window))
Out[99]: True

In [100]: %timeit numpy_ewma(data, window)
100 loops, best of 3: 6.03 ms per loop

In [101]: %timeit numpy_ewma_vectorized(data, window)
1000 loops, best of 3: 665 μs per loop

In [102]: %timeit numpy_ewma_vectorized_v2(data, window)
1000 loops, best of 3: 357 μs per loop

In [103]: 6030/357.0
Out[103]: 16.89075630252101

There is around a 17times speedup!

大约有17倍的加速!

回答by Alexander McFarlane

Fastest EWMA 23x pandas

最快的 EWMA 23x pandas

The question is strictly asking for a numpysolution, however, it seems that the OP was actually just after a pure numpysolution to speed up runtime.

问题是严格要求numpy解决方案,但是,似乎 OP 实际上只是在纯粹的numpy解决方案之后以加快运​​行时。

I solved a similar problem but instead looked towards numba.jitwhich massively speeds the compute time

我解决了一个类似的问题,但转而寻找numba.jit可以大大加快计算时间的方法

In [24]: a = np.random.random(10**7)
    ...: df = pd.Series(a)
In [25]: %timeit numpy_ewma(a, 10)               # /a/42915307/4013571
    ...: %timeit df.ewm(span=10).mean()          # pandas
    ...: %timeit numpy_ewma_vectorized_v2(a, 10) # best w/o numba: /a/42926270/4013571
    ...: %timeit _ewma(a, 10)                    # fastest accurate (below)
    ...: %timeit _ewma_infinite_hist(a, 10)      # fastest overall (below)
4.14 s ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
991 ms ± 52.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 
396 ms ± 8.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
181 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)   
39.6 ms ± 979 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Scaling down to smaller arrays of a = np.random.random(100)(results in the same order)

缩小到更小的数组a = np.random.random(100)(结果相同)

41.6 μs ± 491 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
945 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
16 μs ± 93.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.66 μs ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.14 μs ± 5.57 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

It is also worth pointing out that my functions below are identically aligned to the pandas(see the examples in docstr), whereas a few of the answers here take various different approximations. For example,

还值得指出的是,我下面的函数与pandas(参见 docstr 中的示例)完全一致,而这里的一些答案采用了各种不同的近似值。例如,

In [57]: print(pd.DataFrame([1,2,3]).ewm(span=2).mean().values.ravel())
    ...: print(numpy_ewma_vectorized_v2(np.array([1,2,3]), 2))
    ...: print(numpy_ewma(np.array([1,2,3]), 2))
[1.         1.75       2.61538462]
[1.         1.66666667 2.55555556]
[1.         1.18181818 1.51239669]

The source code which I have documented for my own library

我为自己的库记录的源代码

import numpy as np
from numba import jit
from numba import float64
from numba import int64


@jit((float64[:], int64), nopython=True, nogil=True)
def _ewma(arr_in, window):
    r"""Exponentialy weighted moving average specified by a decay ``window``
    to provide better adjustments for small windows via:

        y[t] = (x[t] + (1-a)*x[t-1] + (1-a)^2*x[t-2] + ... + (1-a)^n*x[t-n]) /
               (1 + (1-a) + (1-a)^2 + ... + (1-a)^n).

    Parameters
    ----------
    arr_in : np.ndarray, float64
        A single dimenisional numpy array
    window : int64
        The decay window, or 'span'

    Returns
    -------
    np.ndarray
        The EWMA vector, same length / shape as ``arr_in``

    Examples
    --------
    >>> import pandas as pd
    >>> a = np.arange(5, dtype=float)
    >>> exp = pd.DataFrame(a).ewm(span=10, adjust=True).mean()
    >>> np.array_equal(_ewma_infinite_hist(a, 10), exp.values.ravel())
    True
    """
    n = arr_in.shape[0]
    ewma = np.empty(n, dtype=float64)
    alpha = 2 / float(window + 1)
    w = 1
    ewma_old = arr_in[0]
    ewma[0] = ewma_old
    for i in range(1, n):
        w += (1-alpha)**i
        ewma_old = ewma_old*(1-alpha) + arr_in[i]
        ewma[i] = ewma_old / w
    return ewma


@jit((float64[:], int64), nopython=True, nogil=True)
def _ewma_infinite_hist(arr_in, window):
    r"""Exponentialy weighted moving average specified by a decay ``window``
    assuming infinite history via the recursive form:

        (2) (i)  y[0] = x[0]; and
            (ii) y[t] = a*x[t] + (1-a)*y[t-1] for t>0.

    This method is less accurate that ``_ewma`` but
    much faster:

        In [1]: import numpy as np, bars
           ...: arr = np.random.random(100000)
           ...: %timeit bars._ewma(arr, 10)
           ...: %timeit bars._ewma_infinite_hist(arr, 10)
        3.74 ms ± 60.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
        262 μs ± 1.54 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

    Parameters
    ----------
    arr_in : np.ndarray, float64
        A single dimenisional numpy array
    window : int64
        The decay window, or 'span'

    Returns
    -------
    np.ndarray
        The EWMA vector, same length / shape as ``arr_in``

    Examples
    --------
    >>> import pandas as pd
    >>> a = np.arange(5, dtype=float)
    >>> exp = pd.DataFrame(a).ewm(span=10, adjust=False).mean()
    >>> np.array_equal(_ewma_infinite_hist(a, 10), exp.values.ravel())
    True
    """
    n = arr_in.shape[0]
    ewma = np.empty(n, dtype=float64)
    alpha = 2 / float(window + 1)
    ewma[0] = arr_in[0]
    for i in range(1, n):
        ewma[i] = arr_in[i] * alpha + ewma[i-1] * (1 - alpha)
    return ewma

回答by James

Here is an implementation using NumPy that is equivalent to using df.ewm(alpha=alpha).mean(). After reading the documentation, it is just a few matrix operations. The trick is constructing the right matrices.

这是一个使用 NumPy 的实现,相当于使用df.ewm(alpha=alpha).mean(). 阅读文档后,只是一些矩阵运算。诀窍是构建正确的矩阵。

It is worth noting that because we are creating float matrices, you can quickly eat through your memory if the input array is too large.

值得注意的是,因为我们正在创建浮点矩阵,如果输入数组太大,您可能会很快吃掉您的内存。

import pandas as pd
import numpy as np

def ewma(x, alpha):
    '''
    Returns the exponentially weighted moving average of x.

    Parameters:
    -----------
    x : array-like
    alpha : float {0 <= alpha <= 1}

    Returns:
    --------
    ewma: numpy array
          the exponentially weighted moving average
    '''
    # Coerce x to an array
    x = np.array(x)
    n = x.size

    # Create an initial weight matrix of (1-alpha), and a matrix of powers
    # to raise the weights by
    w0 = np.ones(shape=(n,n)) * (1-alpha)
    p = np.vstack([np.arange(i,i-n,-1) for i in range(n)])

    # Create the weight matrix
    w = np.tril(w0**p,0)

    # Calculate the ewma
    return np.dot(w, x[::np.newaxis]) / w.sum(axis=1)

Let's test its:

让我们测试一下:

alpha = 0.55
x = np.random.randint(0,30,15)
df = pd.DataFrame(x, columns=['A'])
df.ewm(alpha=alpha).mean()

# returns:
#             A
# 0   13.000000
# 1   22.655172
# 2   20.443268
# 3   12.159796
# 4   14.871955
# 5   15.497575
# 6   20.743511
# 7   20.884818
# 8   24.250715
# 9   18.610901
# 10  17.174686
# 11  16.528564
# 12  17.337879
# 13   7.801912
# 14  12.310889

ewma(x=x, alpha=alpha)

# returns:
# array([ 13.        ,  22.65517241,  20.44326778,  12.1597964 ,
#        14.87195534,  15.4975749 ,  20.74351117,  20.88481763,
#        24.25071484,  18.61090129,  17.17468551,  16.52856393,
#        17.33787888,   7.80191235,  12.31088889])

回答by Divakar

Given alphaand windowSize, here's an approach to simulate the corresponding behavior on NumPy -

鉴于alphawindowSize,这是一种在 NumPy 上模拟相应行为的方法 -

def numpy_ewm_alpha(a, alpha, windowSize):
    wghts = (1-alpha)**np.arange(windowSize)
    wghts /= wghts.sum()
    out = np.full(df.shape[0],np.nan)
    out[windowSize-1:] = np.convolve(a,wghts,'valid')
    return out

Sample runs for verification -

用于验证的样品运行 -

In [54]: alpha = 0.55
    ...: windowSize = 20
    ...: 

In [55]: df = pd.DataFrame(np.random.randint(2,9,(100)))

In [56]: out0 = df.ewm(alpha = alpha, min_periods=windowSize).mean().as_matrix().ravel()
    ...: out1 = numpy_ewm_alpha(df.values.ravel(), alpha = alpha, windowSize = windowSize)
    ...: print "Max. error : " + str(np.nanmax(np.abs(out0 - out1)))
    ...: 
Max. error : 5.10531254605e-07

In [57]: alpha = 0.75
    ...: windowSize = 30
    ...: 

In [58]: out0 = df.ewm(alpha = alpha, min_periods=windowSize).mean().as_matrix().ravel()
    ...: out1 = numpy_ewm_alpha(df.values.ravel(), alpha = alpha, windowSize = windowSize)
    ...: print "Max. error : " + str(np.nanmax(np.abs(out0 - out1)))

Max. error : 8.881784197e-16

Runtime test on bigger dataset -

更大数据集上的运行时测试 -

In [61]: alpha = 0.55
    ...: windowSize = 20
    ...: 

In [62]: df = pd.DataFrame(np.random.randint(2,9,(10000)))

In [63]: %timeit df.ewm(alpha = alpha, min_periods=windowSize).mean()
1000 loops, best of 3: 851 μs per loop

In [64]: %timeit numpy_ewm_alpha(df.values.ravel(), alpha = alpha, windowSize = windowSize)
1000 loops, best of 3: 204 μs per loop


Further boost

进一步提升

For further performance boost we could avoid the initialization with NaNs and instead use the array outputted from np.convolve, like so -

为了进一步提升性能,我们可以避免使用 NaN 进行初始化,而是使用从 输出的数组np.convolve,如下所示 -

def numpy_ewm_alpha_v2(a, alpha, windowSize):
    wghts = (1-alpha)**np.arange(windowSize)
    wghts /= wghts.sum()
    out = np.convolve(a,wghts)
    out[:windowSize-1] = np.nan
    return out[:a.size]  

Timings -

时间——

In [117]: alpha = 0.55
     ...: windowSize = 20
     ...: 

In [118]: df = pd.DataFrame(np.random.randint(2,9,(10000)))

In [119]: %timeit numpy_ewm_alpha(df.values.ravel(), alpha = alpha, windowSize = windowSize)
1000 loops, best of 3: 204 μs per loop

In [120]: %timeit numpy_ewm_alpha_v2(df.values.ravel(), alpha = alpha, windowSize = windowSize)
10000 loops, best of 3: 195 μs per loop

回答by RaduS

Here is another solution O came up with in the meantime. It is about four times faster than the pandas solution.

这是 O 在此期间提出的另一个解决方案。它比熊猫解决方案快四倍。

def numpy_ewma(data, window):
    returnArray = np.empty((data.shape[0]))
    returnArray.fill(np.nan)
    e = data[0]
    alpha = 2 / float(window + 1)
    for s in range(data.shape[0]):
        e =  ((data[s]-e) *alpha ) + e
        returnArray[s] = e
    return returnArray

I used this formulaas a starting point. I am sure that this can be improved even more, but it is at least a starting point.

我用这个公式作为起点。我相信这可以进一步改进,但这至少是一个起点。

回答by Danny

@Divakar's answer seems to cause overflow when dealing with

@Divakar 的回答似乎在处理时导致溢出

numpy_ewma_vectorized(np.random.random(500000), 10)

What I have been using is:

我一直在使用的是:

def EMA(input, time_period=10): # For time period = 10
    t_ = time_period - 1
    ema = np.zeros_like(input,dtype=float)
    multiplier = 2.0 / (time_period + 1)
    #multiplier = 1 - multiplier
    for i in range(len(input)):
        # Special Case
        if i > t_:
            ema[i] = (input[i] - ema[i-1]) * multiplier + ema[i-1]
        else:
            ema[i] = np.mean(input[:i+1])
    return ema

However, this is way slower than the panda solution:

但是,这比熊猫解决方案慢得多:

from pandas import ewma as pd_ema
def EMA_fast(X, time_period = 10):
    out = pd_ema(X, span=time_period, min_periods=time_period)
    out[:time_period-1] = np.cumsum(X[:time_period-1]) / np.asarray(range(1,time_period))
    return out

回答by snooper77

A very simple solution that avoids numba but that is almost as fast as Alexander McFarlane's solution, especially for large arrays and large windowsizes, is to use scipy's lfilterfunction (because an EWMA is a linear filter):

避免 numba 但几乎与Alexander McFarlane's solution一样快的一个非常简单的解决方案,特别是对于大型数组和大window尺寸,是使用 scipy 的lfilter函数(因为 EWMA 是线性过滤器):

from scipy.signal import lfiltic, lfilter
# careful not to mix between scipy.signal and standard python signal 
# (https://docs.python.org/3/library/signal.html) if your code handles some processes

def ewma_linear_filter(array, window):
    alpha = 2 /(window + 1)
    b = [alpha]
    a = [1, alpha-1]
    zi = lfiltic(b, a, array[0:1], [0])
    return lfilter(b, a, array, zi=zi)[0]

Timings are as follows:

时间安排如下:

n = 10_000_000
window = 100_000
data = np.random.normal(0, 1, n)

%timeit _ewma_infinite_hist(data, window)
%timeit linear_filter(data, window)

86 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
92.6 ms ± 751 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

回答by Gabriel_F

Thanks to @Divakar's solution and that is really fast. However, it does cause overflow problem which was pointed out by @Danny. The function doesn't return correct answers when the length is greater than 13835 or so at my end.

感谢@Divakar 的解决方案,这真的很快。但是,它确实会导致@Danny 指出的溢出问题。当长度大于 13835 左右时,该函数不会返回正确答案。

The following is my solution based on Divakar's solution and pandas.ewm().mean()

以下是我基于 Divakar 的解决方案和 pandas.ewm().mean() 的解决方案

def numpy_ema(data, com=None, span=None, halflife=None, alpha=None):
"""Summary
Calculate ema with automatically-generated alpha. Weight of past effect
decreases as the length of window increasing.

# these functions reproduce the pandas result when the flag adjust=False is set.
References:
https://stackoverflow.com/questions/42869495/numpy-version-of-exponential-weighted-moving-average-equivalent-to-pandas-ewm

Args:
    data (TYPE): Description
    com (float, optional): Specify decay in terms of center of mass, alpha=1/(1+com), for com>=0
    span (float, optional): Specify decay in terms of span, alpha=2/(span+1), for span>=1
    halflife (float, optional): Specify decay in terms of half-life, alpha=1-exp(log(0.5)/halflife), for halflife>0
    alpha (float, optional): Specify smoothing factor alpha directly, 0<alpha<=1

Returns:
    TYPE: Description

Raises:
    ValueError: Description
"""
n_input = sum(map(bool, [com, span, halflife, alpha]))
if n_input != 1:
    raise ValueError(
        'com, span, halflife, and alpha are mutually exclusive')

nrow = data.shape[0]
if np.isnan(data).any() or (nrow > 13835) or (data.ndim == 2):
    df = pd.DataFrame(data)
    df_ewm = df.ewm(com=com, span=span, halflife=halflife,
                    alpha=alpha, adjust=False)
    out = df_ewm.mean().values.squeeze()
else:
    if com:
        alpha = 1 / (1 + com)
    elif span:
        alpha = 2 / (span + 1.0)
    elif halflife:
        alpha = 1 - np.exp(np.log(0.5) / halflife)

    alpha_rev = 1 - alpha
    pows = alpha_rev**(np.arange(nrow + 1))

    scale_arr = 1 / pows[:-1]
    offset = data[0] * pows[1:]
    pw0 = alpha * alpha_rev**(nrow - 1)

    mult = data * pw0 * scale_arr

    cumsums = np.cumsum(mult)
    out = offset + cumsums * scale_arr[::-1]
return out

回答by Samuel Utomo

This answer may seem irrelevant. But, for those who also need to calculate the exponentially weighted variance (and also standard deviation) with NumPy, the following solution will be useful:

这个答案似乎无关紧要。但是,对于那些还需要使用 NumPy 计算指数加权方差(以及标准差)的人,以下解决方案将很有用:

import numpy as np

def ew(a, alpha, winSize):
    _alpha = 1 - alpha
    ws = _alpha ** np.arange(winSize)
    w_sum = ws.sum()
    ew_mean = np.convolve(a, ws)[winSize - 1] / w_sum
    bias = (w_sum ** 2) / ((w_sum ** 2) - (ws ** 2).sum())
    ew_var = (np.convolve((a - ew_mean) ** 2, ws)[winSize - 1] / w_sum) * bias
    ew_std = np.sqrt(ew_var)
    return (ew_mean, ew_var, ew_std)