Python pandas ACF 和 statsmodel ACF 有什么区别?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36038927/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:19:37  来源:igfitidea点击:

What's the difference between pandas ACF and statsmodel ACF?

pythonpandasstatsmodels

提问by BML91

I'm calculating the Autocorrelation Function for a stock's returns. To do so I tested two functions, the autocorrfunction built into Pandas, and the acffunction supplied by statsmodels.tsa. This is done in the following MWE:

我正在计算股票回报的自相关函数。为此,我测试了两个函数,一个是autocorrPandas 内置的函数,另一个是acfstatsmodels.tsa. 这是在以下 MWE 中完成的:

import pandas as pd
from pandas_datareader import data
import matplotlib.pyplot as plt
import datetime
from dateutil.relativedelta import relativedelta
from statsmodels.tsa.stattools import acf, pacf

ticker = 'AAPL'
time_ago = datetime.datetime.today().date() - relativedelta(months = 6)

ticker_data = data.get_data_yahoo(ticker, time_ago)['Adj Close'].pct_change().dropna()
ticker_data_len = len(ticker_data)

ticker_data_acf_1 =  acf(ticker_data)[1:32]
ticker_data_acf_2 = [ticker_data.autocorr(i) for i in range(1,32)]

test_df = pd.DataFrame([ticker_data_acf_1, ticker_data_acf_2]).T
test_df.columns = ['Pandas Autocorr', 'Statsmodels Autocorr']
test_df.index += 1
test_df.plot(kind='bar')

What I noticed was the values they predicted weren't identical:

我注意到他们预测的值并不相同:

enter image description here

在此处输入图片说明

What accounts for this difference, and which values should be used?

造成这种差异的原因是什么,应该使用哪些值?

采纳答案by nikhase

The difference between the Pandas and Statsmodels version lie in the mean subtraction and normalization / variance division:

Pandas 和 Statsmodels 版本的区别在于均值减法和归一化/方差除法:

  • autocorrdoes nothing more than passing subseries of the original series to np.corrcoef. Inside this method, the sample mean and sample variance of these subseries are used to determine the correlation coefficient
  • acf, in contrary, uses the overall series sample mean and sample variance to determine the correlation coefficient.
  • autocorr只不过是将原始系列的子系列传递给np.corrcoef. 在该方法内部,这些子序列的样本均值和样本方差用于确定相关系数
  • acf,相反,使用整体序列样本均值和样本方差来确定相关系数。

The differences may get smaller for longer time series but are quite big for short ones.

对于较长的时间序列,差异可能会变小,但对于较短的时间序列差异很大。

Compared to Matlab, the Pandas autocorrfunction probably corresponds to doing Matlabs xcorr(cross-corr) with the (lagged) series itself, instead of Matlab's autocorr, which calculates the sample autocorrelation (guessing from the docs; I cannot validate this because I have no access to Matlab).

与 Matlab 相比,Pandasautocorr函数可能对应于xcorr使用(滞后)系列本身进行 Matlabs (交叉校正),而不是autocorr计算样本自相关的 Matlab函数(从文档中猜测;我无法验证这一点,因为我无法访问MATLAB)。

See this MWE for clarification:

请参阅此 MWE 以获取说明:

import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import acf
import matplotlib.pyplot as plt
plt.style.use("seaborn-colorblind")

def autocorr_by_hand(x, lag):
    # Slice the relevant subseries based on the lag
    y1 = x[:(len(x)-lag)]
    y2 = x[lag:]
    # Subtract the subseries means
    sum_product = np.sum((y1-np.mean(y1))*(y2-np.mean(y2)))
    # Normalize with the subseries stds
    return sum_product / ((len(x) - lag) * np.std(y1) * np.std(y2))

def acf_by_hand(x, lag):
    # Slice the relevant subseries based on the lag
    y1 = x[:(len(x)-lag)]
    y2 = x[lag:]
    # Subtract the mean of the whole series x to calculate Cov
    sum_product = np.sum((y1-np.mean(x))*(y2-np.mean(x)))
    # Normalize with var of whole series
    return sum_product / ((len(x) - lag) * np.var(x))

x = np.linspace(0,100,101)

results = {}
nlags=10
results["acf_by_hand"] = [acf_by_hand(x, lag) for lag in range(nlags)]
results["autocorr_by_hand"] = [autocorr_by_hand(x, lag) for lag in range(nlags)]
results["autocorr"] = [pd.Series(x).autocorr(lag) for lag in range(nlags)]
results["acf"] = acf(x, unbiased=True, nlags=nlags-1)

pd.DataFrame(results).plot(kind="bar", figsize=(10,5), grid=True)
plt.xlabel("lag")
plt.ylim([-1.2, 1.2])
plt.ylabel("value")
plt.show()

enter image description here

在此处输入图片说明

Statsmodels uses np.correlateto optimize this, but this is basically how it works.

Statsmodels 用来np.correlate优化它,但这基本上是它的工作原理。

回答by Marein

As suggested in comments, the problem can be decreased, but not completely resolved, by supplying unbiased=Trueto the statsmodelsfunction. Using a random input:

正如评论中所建议的,通过提供unbiased=Truestatsmodels函数可以减少问题,但不能完全解决。使用随机输入:

import statistics

import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import acf

DATA_LEN = 100
N_TESTS = 100
N_LAGS = 32

def test(unbiased):
  data = pd.Series(np.random.random(DATA_LEN))
  data_acf_1 = acf(data, unbiased=unbiased, nlags=N_LAGS)
  data_acf_2 = [data.autocorr(i) for i in range(N_LAGS+1)]
  # return difference between results
  return sum(abs(data_acf_1 - data_acf_2))

for value in (False, True):
  diffs = [test(value) for _ in range(N_TESTS)]
  print(value, statistics.mean(diffs))

Output:

输出:

False 0.464562410987
True 0.0820847168593