如何使用 Pandas 获得两个时间序列之间的相关性

Question

提问by user814005

I have two sets of temperature date, which have readings at regular (but different) time intervals. I'm trying to get the correlation between these two sets of data.

我有两组温度日期，它们在规则（但不同）的时间间隔内有读数。我试图获得这两组数据之间的相关性。

I've been playing with Pandasto try to do this. I've created two timeseries, and am using TimeSeriesA.corr(TimeSeriesB). However, if the times in the 2 timeSeries do not match up exactly (they're generally off by seconds), I get Null as an answer. I could get a decent answer if I could:

我一直在玩Pandas来尝试做到这一点。我创建了两个时间序列，并且正在使用TimeSeriesA.corr(TimeSeriesB). 但是，如果 2 timeSeries 中的时间不完全匹配（它们通常以秒为单位关闭），我会得到 Null 作为答案。如果可以的话，我可以得到一个体面的答案：

a) Interpolate/fill missing times in each TimeSeries (I know this is possible in Pandas, I just don't know how to do it)

a) 在每个 TimeSeries 中插入/填充缺失的时间（我知道这在 Pandas 中是可能的，我只是不知道该怎么做）

b) strip the seconds out of python datetime objects (Set seconds to 00, without changing minutes). I'd lose a degree of accuracy, but not a huge amount

b) 从 python datetime 对象中去除秒数（将秒数设置为 00，不更改分钟数）。我会失去一定程度的准确性，但不会很大

c) Use something else in Pandas to get the correlation between two timeSeries

c) 在 Pandas 中使用其他东西来获取两个 timeSeries 之间的相关性

d) Use something in python to get the correlation between two lists of floats, each float having a corresponding datetime object, taking into account the time.

d) 在python中使用一些东西来获取两个浮点数列表之间的相关性，每个浮点数都有一个对应的日期时间对象，考虑到时间。

Anyone have any suggestions?

有人有什么建议吗？

Answer 1

采纳答案by Wes McKinney

You have a number of options using pandas, but you have to make a decision about how it makes sense to align the data given that they don't occur at the same instants.

使用 Pandas 有多种选择，但您必须决定对齐数据的意义，因为它们不是在同一时刻发生的。

Use the values "as of" the times in one of the time series, here's an example:

使用时间序列之一中的“截至”时间值，这是一个示例：

    In [15]: ts
    Out[15]: 
    2000-01-03 00:00:00    -0.722808451504
    2000-01-04 00:00:00    0.0125041039477
    2000-01-05 00:00:00    0.777515530539
    2000-01-06 00:00:00    -0.35714026263
    2000-01-07 00:00:00    -1.55213541118
    2000-01-10 00:00:00    -0.508166334892
    2000-01-11 00:00:00    0.58016097981
    2000-01-12 00:00:00    1.50766289013
    2000-01-13 00:00:00    -1.11114968643
    2000-01-14 00:00:00    0.259320239297



    In [16]: ts2
    Out[16]: 
    2000-01-03 00:00:30    1.05595278907
    2000-01-04 00:00:30    -0.568961755792
    2000-01-05 00:00:30    0.660511172645
    2000-01-06 00:00:30    -0.0327384421979
    2000-01-07 00:00:30    0.158094407533
    2000-01-10 00:00:30    -0.321679671377
    2000-01-11 00:00:30    0.977286027619
    2000-01-12 00:00:30    -0.603541295894
    2000-01-13 00:00:30    1.15993249209
    2000-01-14 00:00:30    -0.229379534767

you can see these are off by 30 seconds. The reindexfunction enables you to align data while filling forward values (getting the "as of" value):

您可以看到这些已关闭 30 秒。该reindex函数使您可以在填充前向值时对齐数据（获取“as of”值）：

    In [17]: ts.reindex(ts2.index, method='pad')
    Out[17]: 
    2000-01-03 00:00:30    -0.722808451504
    2000-01-04 00:00:30    0.0125041039477
    2000-01-05 00:00:30    0.777515530539
    2000-01-06 00:00:30    -0.35714026263
    2000-01-07 00:00:30    -1.55213541118
    2000-01-10 00:00:30    -0.508166334892
    2000-01-11 00:00:30    0.58016097981
    2000-01-12 00:00:30    1.50766289013
    2000-01-13 00:00:30    -1.11114968643
    2000-01-14 00:00:30    0.259320239297

    In [18]: ts2.corr(ts.reindex(ts2.index, method='pad'))
    Out[18]: -0.31004148593302283

note that 'pad' is also aliased by 'ffill' (but only in the very latest version of pandas on GitHub as of this time!).

请注意，'pad' 也被 'ffill' 别名化（但目前仅在 GitHub 上最新版本的 Pandas 中！）。

Strip seconds out of all your datetimes. The best way to do this is to use rename

从所有日期时间中去除秒数。最好的方法是使用rename

    In [25]: ts2.rename(lambda date: date.replace(second=0))
    Out[25]: 
    2000-01-03 00:00:00    1.05595278907
    2000-01-04 00:00:00    -0.568961755792
    2000-01-05 00:00:00    0.660511172645
    2000-01-06 00:00:00    -0.0327384421979
    2000-01-07 00:00:00    0.158094407533
    2000-01-10 00:00:00    -0.321679671377
    2000-01-11 00:00:00    0.977286027619
    2000-01-12 00:00:00    -0.603541295894
    2000-01-13 00:00:00    1.15993249209
    2000-01-14 00:00:00    -0.229379534767

Note that if rename causes there to be duplicate dates an Exceptionwill be thrown.

请注意，如果重命名导致重复日期，Exception则将抛出。

For something a little more advanced, suppose you wanted to correlate the mean value for each minute (where you have multiple observations per second):

对于更高级的东西，假设您想关联每分钟的平均值（每秒有多个观察值）：

    In [31]: ts_mean = ts.groupby(lambda date: date.replace(second=0)).mean()

    In [32]: ts2_mean = ts2.groupby(lambda date: date.replace(second=0)).mean()

    In [33]: ts_mean.corr(ts2_mean)
    Out[33]: -0.31004148593302283

These last code snippets may not work if you don't have the latest code from https://github.com/wesm/pandas. If .mean()doesn't work on a GroupByobject per above try .agg(np.mean)

如果您没有来自https://github.com/wesm/pandas的最新代码，这些最后的代码片段可能无法工作。如果上面.mean()的GroupBy对象不起作用，请尝试.agg(np.mean)

Hope this helps!

希望这可以帮助！

如何使用 Pandas 获得两个时间序列之间的相关性

提问by user814005

采纳答案by Wes McKinney

相关推荐

最近更新

标签

如何使用 Pandas 获得两个时间序列之间的相关性

提问by user814005

采纳答案by Wes McKinney

相关推荐

wpf FormattedText.FormttedText 已过时。使用 PixelsPerDip 覆盖

C# WPF 数据网格行背景色

wpf 处置 ChromiumWebBrowser 时，CefSharp.BrowserSubprocess 不会关闭

wpf c# 使用 LiveCharts 将图表转换为图像

相关推荐

最近更新

标签