Python pandas ACF 和 statsmodel ACF 有什么区别?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36038927/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What's the difference between pandas ACF and statsmodel ACF?
提问by BML91
I'm calculating the Autocorrelation Function for a stock's returns. To do so I tested two functions, the autocorr
function built into Pandas, and the acf
function supplied by statsmodels.tsa
. This is done in the following MWE:
我正在计算股票回报的自相关函数。为此,我测试了两个函数,一个是autocorr
Pandas 内置的函数,另一个是acf
由statsmodels.tsa
. 这是在以下 MWE 中完成的:
import pandas as pd
from pandas_datareader import data
import matplotlib.pyplot as plt
import datetime
from dateutil.relativedelta import relativedelta
from statsmodels.tsa.stattools import acf, pacf
ticker = 'AAPL'
time_ago = datetime.datetime.today().date() - relativedelta(months = 6)
ticker_data = data.get_data_yahoo(ticker, time_ago)['Adj Close'].pct_change().dropna()
ticker_data_len = len(ticker_data)
ticker_data_acf_1 = acf(ticker_data)[1:32]
ticker_data_acf_2 = [ticker_data.autocorr(i) for i in range(1,32)]
test_df = pd.DataFrame([ticker_data_acf_1, ticker_data_acf_2]).T
test_df.columns = ['Pandas Autocorr', 'Statsmodels Autocorr']
test_df.index += 1
test_df.plot(kind='bar')
What I noticed was the values they predicted weren't identical:
我注意到他们预测的值并不相同:
What accounts for this difference, and which values should be used?
造成这种差异的原因是什么,应该使用哪些值?
采纳答案by nikhase
The difference between the Pandas and Statsmodels version lie in the mean subtraction and normalization / variance division:
Pandas 和 Statsmodels 版本的区别在于均值减法和归一化/方差除法:
autocorr
does nothing more than passing subseries of the original series tonp.corrcoef
. Inside this method, the sample mean and sample variance of these subseries are used to determine the correlation coefficientacf
, in contrary, uses the overall series sample mean and sample variance to determine the correlation coefficient.
autocorr
只不过是将原始系列的子系列传递给np.corrcoef
. 在该方法内部,这些子序列的样本均值和样本方差用于确定相关系数acf
,相反,使用整体序列样本均值和样本方差来确定相关系数。
The differences may get smaller for longer time series but are quite big for short ones.
对于较长的时间序列,差异可能会变小,但对于较短的时间序列差异很大。
Compared to Matlab, the Pandas autocorr
function probably corresponds to doing Matlabs xcorr
(cross-corr) with the (lagged) series itself, instead of Matlab's autocorr
, which calculates the sample autocorrelation (guessing from the docs; I cannot validate this because I have no access to Matlab).
与 Matlab 相比,Pandasautocorr
函数可能对应于xcorr
使用(滞后)系列本身进行 Matlabs (交叉校正),而不是autocorr
计算样本自相关的 Matlab函数(从文档中猜测;我无法验证这一点,因为我无法访问MATLAB)。
See this MWE for clarification:
请参阅此 MWE 以获取说明:
import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import acf
import matplotlib.pyplot as plt
plt.style.use("seaborn-colorblind")
def autocorr_by_hand(x, lag):
# Slice the relevant subseries based on the lag
y1 = x[:(len(x)-lag)]
y2 = x[lag:]
# Subtract the subseries means
sum_product = np.sum((y1-np.mean(y1))*(y2-np.mean(y2)))
# Normalize with the subseries stds
return sum_product / ((len(x) - lag) * np.std(y1) * np.std(y2))
def acf_by_hand(x, lag):
# Slice the relevant subseries based on the lag
y1 = x[:(len(x)-lag)]
y2 = x[lag:]
# Subtract the mean of the whole series x to calculate Cov
sum_product = np.sum((y1-np.mean(x))*(y2-np.mean(x)))
# Normalize with var of whole series
return sum_product / ((len(x) - lag) * np.var(x))
x = np.linspace(0,100,101)
results = {}
nlags=10
results["acf_by_hand"] = [acf_by_hand(x, lag) for lag in range(nlags)]
results["autocorr_by_hand"] = [autocorr_by_hand(x, lag) for lag in range(nlags)]
results["autocorr"] = [pd.Series(x).autocorr(lag) for lag in range(nlags)]
results["acf"] = acf(x, unbiased=True, nlags=nlags-1)
pd.DataFrame(results).plot(kind="bar", figsize=(10,5), grid=True)
plt.xlabel("lag")
plt.ylim([-1.2, 1.2])
plt.ylabel("value")
plt.show()
Statsmodels uses np.correlate
to optimize this, but this is basically how it works.
Statsmodels 用来np.correlate
优化它,但这基本上是它的工作原理。
回答by Marein
As suggested in comments, the problem can be decreased, but not completely resolved, by supplying unbiased=True
to the statsmodels
function. Using a random input:
正如评论中所建议的,通过提供unbiased=True
给statsmodels
函数可以减少问题,但不能完全解决。使用随机输入:
import statistics
import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import acf
DATA_LEN = 100
N_TESTS = 100
N_LAGS = 32
def test(unbiased):
data = pd.Series(np.random.random(DATA_LEN))
data_acf_1 = acf(data, unbiased=unbiased, nlags=N_LAGS)
data_acf_2 = [data.autocorr(i) for i in range(N_LAGS+1)]
# return difference between results
return sum(abs(data_acf_1 - data_acf_2))
for value in (False, True):
diffs = [test(value) for _ in range(N_TESTS)]
print(value, statistics.mean(diffs))
Output:
输出:
False 0.464562410987
True 0.0820847168593