Python pandas ACF 和 statsmodel ACF 有什么区别？

Question

提问by BML91

I'm calculating the Autocorrelation Function for a stock's returns. To do so I tested two functions, the autocorrfunction built into Pandas, and the acffunction supplied by statsmodels.tsa. This is done in the following MWE:

我正在计算股票回报的自相关函数。为此，我测试了两个函数，一个是autocorrPandas 内置的函数，另一个是acf由statsmodels.tsa. 这是在以下 MWE 中完成的：

import pandas as pd
from pandas_datareader import data
import matplotlib.pyplot as plt
import datetime
from dateutil.relativedelta import relativedelta
from statsmodels.tsa.stattools import acf, pacf

ticker = 'AAPL'
time_ago = datetime.datetime.today().date() - relativedelta(months = 6)

ticker_data = data.get_data_yahoo(ticker, time_ago)['Adj Close'].pct_change().dropna()
ticker_data_len = len(ticker_data)

ticker_data_acf_1 =  acf(ticker_data)[1:32]
ticker_data_acf_2 = [ticker_data.autocorr(i) for i in range(1,32)]

test_df = pd.DataFrame([ticker_data_acf_1, ticker_data_acf_2]).T
test_df.columns = ['Pandas Autocorr', 'Statsmodels Autocorr']
test_df.index += 1
test_df.plot(kind='bar')

What I noticed was the values they predicted weren't identical:

我注意到他们预测的值并不相同：

What accounts for this difference, and which values should be used?

造成这种差异的原因是什么，应该使用哪些值？

Answer 1

采纳答案by nikhase

The difference between the Pandas and Statsmodels version lie in the mean subtraction and normalization / variance division:

Pandas 和 Statsmodels 版本的区别在于均值减法和归一化/方差除法：

autocorrdoes nothing more than passing subseries of the original series to np.corrcoef. Inside this method, the sample mean and sample variance of these subseries are used to determine the correlation coefficient
acf, in contrary, uses the overall series sample mean and sample variance to determine the correlation coefficient.

autocorr只不过是将原始系列的子系列传递给np.corrcoef. 在该方法内部，这些子序列的样本均值和样本方差用于确定相关系数
acf，相反，使用整体序列样本均值和样本方差来确定相关系数。

The differences may get smaller for longer time series but are quite big for short ones.

对于较长的时间序列，差异可能会变小，但对于较短的时间序列差异很大。

Compared to Matlab, the Pandas autocorrfunction probably corresponds to doing Matlabs xcorr(cross-corr) with the (lagged) series itself, instead of Matlab's autocorr, which calculates the sample autocorrelation (guessing from the docs; I cannot validate this because I have no access to Matlab).

与 Matlab 相比，Pandasautocorr函数可能对应于xcorr使用（滞后）系列本身进行 Matlabs （交叉校正），而不是autocorr计算样本自相关的 Matlab函数（从文档中猜测；我无法验证这一点，因为我无法访问MATLAB）。

See this MWE for clarification:

请参阅此 MWE 以获取说明：

import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import acf
import matplotlib.pyplot as plt
plt.style.use("seaborn-colorblind")

def autocorr_by_hand(x, lag):
    # Slice the relevant subseries based on the lag
    y1 = x[:(len(x)-lag)]
    y2 = x[lag:]
    # Subtract the subseries means
    sum_product = np.sum((y1-np.mean(y1))*(y2-np.mean(y2)))
    # Normalize with the subseries stds
    return sum_product / ((len(x) - lag) * np.std(y1) * np.std(y2))

def acf_by_hand(x, lag):
    # Slice the relevant subseries based on the lag
    y1 = x[:(len(x)-lag)]
    y2 = x[lag:]
    # Subtract the mean of the whole series x to calculate Cov
    sum_product = np.sum((y1-np.mean(x))*(y2-np.mean(x)))
    # Normalize with var of whole series
    return sum_product / ((len(x) - lag) * np.var(x))

x = np.linspace(0,100,101)

results = {}
nlags=10
results["acf_by_hand"] = [acf_by_hand(x, lag) for lag in range(nlags)]
results["autocorr_by_hand"] = [autocorr_by_hand(x, lag) for lag in range(nlags)]
results["autocorr"] = [pd.Series(x).autocorr(lag) for lag in range(nlags)]
results["acf"] = acf(x, unbiased=True, nlags=nlags-1)

pd.DataFrame(results).plot(kind="bar", figsize=(10,5), grid=True)
plt.xlabel("lag")
plt.ylim([-1.2, 1.2])
plt.ylabel("value")
plt.show()

Statsmodels uses np.correlateto optimize this, but this is basically how it works.

Statsmodels 用来np.correlate优化它，但这基本上是它的工作原理。

Answer 2

回答by Marein

As suggested in comments, the problem can be decreased, but not completely resolved, by supplying unbiased=Trueto the statsmodelsfunction. Using a random input:

正如评论中所建议的，通过提供unbiased=True给statsmodels函数可以减少问题，但不能完全解决。使用随机输入：

import statistics

import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import acf

DATA_LEN = 100
N_TESTS = 100
N_LAGS = 32

def test(unbiased):
  data = pd.Series(np.random.random(DATA_LEN))
  data_acf_1 = acf(data, unbiased=unbiased, nlags=N_LAGS)
  data_acf_2 = [data.autocorr(i) for i in range(N_LAGS+1)]
  # return difference between results
  return sum(abs(data_acf_1 - data_acf_2))

for value in (False, True):
  diffs = [test(value) for _ in range(N_TESTS)]
  print(value, statistics.mean(diffs))

Output:

输出：

False 0.464562410987
True 0.0820847168593

Python pandas ACF 和 statsmodel ACF 有什么区别？

提问by BML91

采纳答案by nikhase

回答by Marein

相关推荐

最近更新

标签

Python pandas ACF 和 statsmodel ACF 有什么区别？

提问by BML91

采纳答案by nikhase

回答by Marein

相关推荐

Python 将一列从一个 DataFrame 复制到另一个会给出 NaN 值？

Python TensorFlow：变量初始化中的“尝试使用未初始化的值”

Python 如何降级 tensorflow，可能有多个版本？

Python Numpy 将 1d 数组重塑为 1 列的 2d 数组

相关推荐

最近更新

标签