Python 与熊猫的互相关(时间滞后相关)?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33171413/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Cross-correlation (time-lag-correlation) with pandas?
提问by JC_CL
I have various time series, that I want to correlate - or rather, cross-correlate - with each other, to find out at which time lag the correlation factor is the greatest.
我有各种时间序列,我想将它们相互关联 - 或者更确切地说,互相关 - 以找出相关因子在哪个时间滞后最大。
I found variousquestionsand answers/links discussing how to do it with numpy, but those would mean that I have to turn my dataframes into numpy arrays. And since my time series often cover different periods, I am afraid that I will run into chaos.
我发现了各种问题和答案/链接,讨论了如何使用 numpy 进行操作,但这意味着我必须将数据帧转换为 numpy 数组。而且由于我的时间序列经常涵盖不同的时期,我怕我会陷入混乱。
Edit
编辑
The issue I am having with all the numpy/scipy methods, is that they seem to lack awareness of the timeseries nature of my data. When I correlate a time series that starts in say 1940 with one that starts in 1970, pandas corr
knows this, whereas np.correlate
just produces a 1020 entries (length of the longer series) array full of nan.
我在所有 numpy/scipy 方法中遇到的问题是,它们似乎缺乏对我的数据的时间序列性质的认识。当我将一个从 1940 年开始的时间序列与一个从 1970 年开始的时间序列相关联时,pandascorr
知道这一点,而np.correlate
只生成一个 1020 个条目(较长系列的长度)充满 nan 的数组。
The various Q's on this subject indicate that there should be a way to solve the different length issue, but so far, I have seen no indication on how to use it for specific time periods. I just need to shift by 12 months in increments of 1, for seeing the time of maximum correlation within one year.
关于这个主题的各种 Q 表明应该有一种方法来解决不同的长度问题,但到目前为止,我还没有看到有关如何在特定时间段使用它的指示。我只需要以 1 为增量移动 12 个月,以查看一年内相关性最大的时间。
Edit2
编辑2
Some minimal sample data:
一些最小的样本数据:
import pandas as pd
import numpy as np
dfdates1 = pd.date_range('01/01/1980', '01/01/2000', freq = 'MS')
dfdata1 = (np.random.random_integers(-30,30,(len(dfdates1)))/10.0) #My real data is from measurements, but random between -3 and 3 is fitting
df1 = pd.DataFrame(dfdata1, index = dfdates1)
dfdates2 = pd.date_range('03/01/1990', '02/01/2013', freq = 'MS')
dfdata2 = (np.random.random_integers(-30,30,(len(dfdates2)))/10.0)
df2 = pd.DataFrame(dfdata2, index = dfdates2)
Due to various processing steps, those dfs end up changed into df that are indexed from 1940 to 2015. this should reproduce this:
由于各种处理步骤,这些 dfs 最终更改为从 1940 年到 2015 年索引的 df。这应该重现:
bigdates = pd.date_range('01/01/1940', '01/01/2015', freq = 'MS')
big1 = pd.DataFrame(index = bigdates)
big2 = pd.DataFrame(index = bigdates)
big1 = pd.concat([big1, df1],axis = 1)
big2 = pd.concat([big2, df2],axis = 1)
This is what I get when I correlate with pandas and shift one dataset:
这就是我与熊猫相关并移动一个数据集时得到的结果:
In [451]: corr_coeff_0 = big1[0].corr(big2[0])
In [452]: corr_coeff_0
Out[452]: 0.030543266378853299
In [453]: big2_shift = big2.shift(1)
In [454]: corr_coeff_1 = big1[0].corr(big2_shift[0])
In [455]: corr_coeff_1
Out[455]: 0.020788314779320523
And trying scipy:
并尝试 scipy:
In [456]: scicorr = scipy.signal.correlate(big1,big2,mode="full")
In [457]: scicorr
Out[457]:
array([[ nan],
[ nan],
[ nan],
...,
[ nan],
[ nan],
[ nan]])
which according to whos
is
其中根据whos
是
scicorr ndarray 1801x1: 1801 elems, type `float64`, 14408 bytes
But I'd just like to have 12 entries. /Edit2
但我只想有 12 个条目。 /编辑2
The idea I have come up with, is to implement a time-lag-correlation myself, like so:
我提出的想法是自己实现时间滞后相关性,如下所示:
corr_coeff_0 = df1['Data'].corr(df2['Data'])
df1_1month = df1.shift(1)
corr_coeff_1 = df1_1month['Data'].corr(df2['Data'])
df1_6month = df1.shift(6)
corr_coeff_6 = df1_6month['Data'].corr(df2['Data'])
...and so on
But this is probably slow, and I am probably trying to reinvent the wheel here. EditThe above approach seems to work, and I have put it into a loop, to go through all 12 months of a year, but I still would prefer a built in method.
但这可能很慢,我可能试图在这里重新发明轮子。编辑上述方法似乎有效,我已将其放入循环中,以完成一年中的所有 12 个月,但我仍然更喜欢内置方法。
回答by Daniel Watkins
As far as I can tell, there isn't a built in method that does exactlywhat you are asking. But if you look at the source code for the pandas Series method autocorr
, you can see you've got the right idea:
据我所知,没有一种内置方法可以完全满足您的要求。但是,如果您查看 pandas Series 方法的源代码autocorr
,就会发现您的想法是正确的:
def autocorr(self, lag=1):
"""
Lag-N autocorrelation
Parameters
----------
lag : int, default 1
Number of lags to apply before performing autocorrelation.
Returns
-------
autocorr : float
"""
return self.corr(self.shift(lag))
So a simple timelagged cross covariance function would be
所以一个简单的时滞交叉协方差函数将是
def crosscorr(datax, datay, lag=0):
""" Lag-N cross correlation.
Parameters
----------
lag : int, default 0
datax, datay : pandas.Series objects of equal length
Returns
----------
crosscorr : float
"""
return datax.corr(datay.shift(lag))
Then if you wanted to look at the cross correlations at each month, you could do
然后如果你想查看每个月的互相关,你可以这样做
xcov_monthly = [crosscorr(datax, datay, lag=i) for i in range(12)]
回答by Andre Araujo
There is a better approach: You can create a function that shiftedyour dataframe firstbefore calling the corr().
有一个更好的方法:您可以创建一个函数,在调用 corr() 之前先移动数据帧。
Get this dataframe like an example:
像示例一样获取此数据框:
d = {'prcp': [0.1,0.2,0.3,0.0], 'stp': [0.0,0.1,0.2,0.3]}
df = pd.DataFrame(data=d)
>>> df
prcp stp
0 0.1 0.0
1 0.2 0.1
2 0.3 0.2
3 0.0 0.3
Your function to shift others columns (except the target):
您移动其他列的功能(目标除外):
def df_shifted(df, target=None, lag=0):
if not lag and not target:
return df
new = {}
for c in df.columns:
if c == target:
new[c] = df[target]
else:
new[c] = df[c].shift(periods=lag)
return pd.DataFrame(data=new)
Supposing that your target is comparing the prcp (precipitation variable) with stp(atmospheric pressure)
假设您的目标是将 prcp(降水变量)与 stp(大气压力)进行比较
If you do at the present will be:
如果你现在这样做将是:
>>> df.corr()
prcp stp
prcp 1.0 -0.2
stp -0.2 1.0
But if you shifted 1(one) period all other columnsand keep the target(prcp):
但是,如果您移动 1(one) 期间的所有其他列并保留目标(prcp):
df_new = df_shifted(df, 'prcp', lag=-1)
>>> print df_new
prcp stp
0 0.1 0.1
1 0.2 0.2
2 0.3 0.3
3 0.0 NaN
Note that now the column stp is shift one up position at period, so if you call the corr(), will be:
请注意,现在列 stp 在一段时间内向上移动一个位置,因此如果您调用 corr(),将是:
>>> df_new.corr()
prcp stp
prcp 1.0 1.0
stp 1.0 1.0
So, you can do with lag -1, -2, -n!!
所以,你可以用滞后 -1、-2、-n 来做!!
回答by Itamar Mushkin
To build up on Andre's answer - if you only care about (lagged) correlation to the target, but want to test various lags (e.g. to see which lag gives the highest correlations), you can do something like this:
为了建立安德烈的答案 - 如果您只关心与目标的(滞后)相关性,但想要测试各种滞后(例如,查看哪个滞后提供最高的相关性),您可以执行以下操作:
lagged_correlation = pd.DataFrame.from_dict(
{x: [df[target].corr(df[x].shift(-t)) for t in range(max_lag)] for x in df.columns})
This way, each row corresponds to a different lag value, and each column corresponds to a different variable (one of them is the target itself, giving the autocorrelation).
这样,每一行对应一个不同的滞后值,每一列对应一个不同的变量(其中一个是目标本身,给出自相关)。