Python 与熊猫的互相关(时间滞后相关)?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33171413/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:55:50  来源:igfitidea点击:

Cross-correlation (time-lag-correlation) with pandas?

pythonnumpypandascorrelationcross-correlation

提问by JC_CL

I have various time series, that I want to correlate - or rather, cross-correlate - with each other, to find out at which time lag the correlation factor is the greatest.

我有各种时间序列,我想将它们相互关联 - 或者更确切地说,互相关 - 以找出相关因子在哪个时间滞后最大。

I found variousquestionsand answers/links discussing how to do it with numpy, but those would mean that I have to turn my dataframes into numpy arrays. And since my time series often cover different periods, I am afraid that I will run into chaos.

我发现了各种问题和答案/链接,讨论了如何使用 numpy 进行操作,但这意味着我必须将数据帧转换为 numpy 数组。而且由于我的时间序列经常涵盖不同的时期,我怕我会陷入混乱。

Edit

编辑

The issue I am having with all the numpy/scipy methods, is that they seem to lack awareness of the timeseries nature of my data. When I correlate a time series that starts in say 1940 with one that starts in 1970, pandas corrknows this, whereas np.correlatejust produces a 1020 entries (length of the longer series) array full of nan.

我在所有 numpy/scipy 方法中遇到的问题是,它们似乎缺乏对我的数据的时间序列性质的认识。当我将一个从 1940 年开始的时间序列与一个从 1970 年开始的时间序列相关联时,pandascorr知道这一点,而np.correlate只生成一个 1020 个条目(较长系列的长度)充满 nan 的数组。

The various Q's on this subject indicate that there should be a way to solve the different length issue, but so far, I have seen no indication on how to use it for specific time periods. I just need to shift by 12 months in increments of 1, for seeing the time of maximum correlation within one year.

关于这个主题的各种 Q 表明应该有一种方法来解决不同的长度问题,但到目前为止,我还没有看到有关如何在特定时间段使用它的指示。我只需要以 1 为增量移动 12 个月,以查看一年内相关性最大的时间。

Edit2

编辑2

Some minimal sample data:

一些最小的样本数据:

import pandas as pd
import numpy as np
dfdates1 = pd.date_range('01/01/1980', '01/01/2000', freq = 'MS')
dfdata1 = (np.random.random_integers(-30,30,(len(dfdates1)))/10.0) #My real data is from measurements, but random between -3 and 3 is fitting
df1 = pd.DataFrame(dfdata1, index = dfdates1)
dfdates2 = pd.date_range('03/01/1990', '02/01/2013', freq = 'MS')
dfdata2 = (np.random.random_integers(-30,30,(len(dfdates2)))/10.0)
df2 = pd.DataFrame(dfdata2, index = dfdates2)

Due to various processing steps, those dfs end up changed into df that are indexed from 1940 to 2015. this should reproduce this:

由于各种处理步骤,这些 dfs 最终更改为从 1940 年到 2015 年索引的 df。这应该重现:

bigdates = pd.date_range('01/01/1940', '01/01/2015', freq = 'MS')
big1 = pd.DataFrame(index = bigdates)
big2 = pd.DataFrame(index = bigdates)
big1 = pd.concat([big1, df1],axis = 1)
big2 = pd.concat([big2, df2],axis = 1)

This is what I get when I correlate with pandas and shift one dataset:

这就是我与熊猫相关并移动一个数据集时得到的结果:

In [451]: corr_coeff_0 = big1[0].corr(big2[0])
In [452]: corr_coeff_0
Out[452]: 0.030543266378853299
In [453]: big2_shift = big2.shift(1)
In [454]: corr_coeff_1 = big1[0].corr(big2_shift[0])
In [455]: corr_coeff_1
Out[455]: 0.020788314779320523

And trying scipy:

并尝试 scipy:

In [456]: scicorr = scipy.signal.correlate(big1,big2,mode="full")
In [457]: scicorr
Out[457]: 
array([[ nan],
       [ nan],
       [ nan],
       ..., 
       [ nan],
       [ nan],
       [ nan]])

which according to whosis

其中根据whos

scicorr               ndarray                       1801x1: 1801 elems, type `float64`, 14408 bytes

But I'd just like to have 12 entries. /Edit2

但我只想有 12 个条目。 /编辑2

The idea I have come up with, is to implement a time-lag-correlation myself, like so:

我提出的想法是自己实现时间滞后相关性,如下所示:

corr_coeff_0 = df1['Data'].corr(df2['Data'])
df1_1month = df1.shift(1)
corr_coeff_1 = df1_1month['Data'].corr(df2['Data'])
df1_6month = df1.shift(6)
corr_coeff_6 = df1_6month['Data'].corr(df2['Data'])
...and so on

But this is probably slow, and I am probably trying to reinvent the wheel here. EditThe above approach seems to work, and I have put it into a loop, to go through all 12 months of a year, but I still would prefer a built in method.

但这可能很慢,我可能试图在这里重新发明轮子。编辑上述方法似乎有效,我已将其放入循环中,以完成一年中的所有 12 个月,但我仍然更喜欢内置方法。

回答by Daniel Watkins

As far as I can tell, there isn't a built in method that does exactlywhat you are asking. But if you look at the source code for the pandas Series method autocorr, you can see you've got the right idea:

据我所知,没有一种内置方法可以完全满足您的要求。但是,如果您查看 pandas Series 方法的源代码autocorr,就会发现您的想法是正确的:

def autocorr(self, lag=1):
    """
    Lag-N autocorrelation

    Parameters
    ----------
    lag : int, default 1
        Number of lags to apply before performing autocorrelation.

    Returns
    -------
    autocorr : float
    """
    return self.corr(self.shift(lag))

So a simple timelagged cross covariance function would be

所以一个简单的时滞交叉协方差函数将是

def crosscorr(datax, datay, lag=0):
    """ Lag-N cross correlation. 
    Parameters
    ----------
    lag : int, default 0
    datax, datay : pandas.Series objects of equal length

    Returns
    ----------
    crosscorr : float
    """
    return datax.corr(datay.shift(lag))

Then if you wanted to look at the cross correlations at each month, you could do

然后如果你想查看每个月的互相关,你可以这样做

 xcov_monthly = [crosscorr(datax, datay, lag=i) for i in range(12)]

回答by Andre Araujo

There is a better approach: You can create a function that shiftedyour dataframe firstbefore calling the corr().

有一个更好的方法:您可以创建一个函数,在调用 corr() 之前先移动数据帧

Get this dataframe like an example:

像示例一样获取此数据框:

d = {'prcp': [0.1,0.2,0.3,0.0], 'stp': [0.0,0.1,0.2,0.3]}
df = pd.DataFrame(data=d)

>>> df
   prcp  stp
0   0.1  0.0
1   0.2  0.1
2   0.3  0.2
3   0.0  0.3

Your function to shift others columns (except the target):

您移动其他列的功能(目标除外):

def df_shifted(df, target=None, lag=0):
    if not lag and not target:
        return df       
    new = {}
    for c in df.columns:
        if c == target:
            new[c] = df[target]
        else:
            new[c] = df[c].shift(periods=lag)
    return  pd.DataFrame(data=new)

Supposing that your target is comparing the prcp (precipitation variable) with stp(atmospheric pressure)

假设您的目标是将 prcp(降水变量)与 stp(大气压力)进行比较

If you do at the present will be:

如果你现在这样做将是:

>>> df.corr()
      prcp  stp
prcp   1.0 -0.2
stp   -0.2  1.0

But if you shifted 1(one) period all other columnsand keep the target(prcp):

但是,如果您移动 1(one) 期间的所有其他列并保留目标(prcp):

df_new = df_shifted(df, 'prcp', lag=-1)

>>> print df_new
   prcp  stp
0   0.1  0.1
1   0.2  0.2
2   0.3  0.3
3   0.0  NaN

Note that now the column stp is shift one up position at period, so if you call the corr(), will be:

请注意,现在列 stp 在一段时间内向上移动一个位置,因此如果您调用 corr(),将是:

>>> df_new.corr()
      prcp  stp
prcp   1.0  1.0
stp    1.0  1.0

So, you can do with lag -1, -2, -n!!

所以,你可以用滞后 -1、-2、-n 来做!!

回答by Itamar Mushkin

To build up on Andre's answer - if you only care about (lagged) correlation to the target, but want to test various lags (e.g. to see which lag gives the highest correlations), you can do something like this:

为了建立安德烈的答案 - 如果您只关心与目标的(滞后)相关性,但想要测试各种滞后(例如,查看哪个滞后提供最高的相关性),您可以执行以下操作:

lagged_correlation = pd.DataFrame.from_dict(
    {x: [df[target].corr(df[x].shift(-t)) for t in range(max_lag)] for x in df.columns})

This way, each row corresponds to a different lag value, and each column corresponds to a different variable (one of them is the target itself, giving the autocorrelation).

这样,每一行对应一个不同的滞后值,每一列对应一个不同的变量(其中一个是目标本身,给出自相关)。