pandas 与熊猫的时间序列相关性
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48727450/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Time series correlation with pandas
提问by Davide
I have some Particulate Matter sensors and CSVs with time series like:
我有一些带有时间序列的微粒物质传感器和 CSV,例如:
Sensor A:
传感器 A:
date value
date
2017-11-30 00:00:00 30/11/17 0.00 49
2017-11-30 00:02:00 30/11/17 0.02 51
2017-11-30 00:03:00 30/11/17 0.03 54
2017-11-30 00:05:00 30/11/17 0.05 57
2017-11-30 00:07:00 30/11/17 0.07 53
2017-11-30 00:08:00 30/11/17 0.08 55
2017-11-30 00:10:00 30/11/17 0.10 55
2017-11-30 00:12:00 30/11/17 0.12 58
2017-11-30 00:13:00 30/11/17 0.13 57
2017-11-30 00:15:00 30/11/17 0.15 58
....
2018-02-06 09:30:00 6/2/18 9.30 33
2018-02-06 09:32:00 6/2/18 9.32 31
2018-02-06 09:33:00 6/2/18 9.33 34
2018-02-06 09:35:00 6/2/18 9.35 32
2018-02-06 09:37:00 6/2/18 9.37 33
2018-02-06 09:38:00 6/2/18 9.38 30
I set date as index with:
我将日期设置为索引:
df.index = pd.to_datetime(df['date'], format='%d/%m/%y %H.%M')
I would like to correlate different time windows between data from the same sensor AND from different sensor in similar time windows. I expect to know if I have same increase/decrease behaviour in some part of the day/days. After setting "date index" I'm able to get "All PM value from 9am to 10am everyday from sensor A"
我想将来自同一传感器的数据和来自类似时间窗口的不同传感器的数据之间的不同时间窗口相关联。我希望知道我在一天/几天的某些部分是否有相同的增加/减少行为。设置“日期索引”后,我可以从传感器 A 获得“每天上午 9 点到上午 10 点的所有 PM 值”
df.between_time('9:00','10:00')
1) Problem 1: How to check correlation from same sensor but different days: I filtered data 9/10am from two days in two DataFrame, but not always they're taken exactly at the same minute. I may have situations like this:
1)问题 1:如何检查来自同一传感器但不同日期的相关性:我在两个 DataFrame 中的两天上午 9/10 点过滤了数据,但并不总是在同一分钟内获取它们。我可能有这样的情况:
01-01-2018 (df01 - I removed data column)
2018-01-01 09:05:00 11
2018-01-01 09:07:00 11
2018-01-01 09:09:00 10
....
02-01-2018 (df02)
2018-02-01 09:05:00 67
2018-02-01 09:07:00 68
2018-02-01 09:08:00 67
....
Should I rename data column? I actually care that the third value from 01/01/2018 will correlate with the third value from the second window.
我应该重命名数据列吗?我实际上关心 01/01/2018 的第三个值将与第二个窗口的第三个值相关联。
df01.corr(df02)
returns
回报
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
2) Problem 2: Correlate between different sensorsIn this case I have 2 CVS files with PM values from two sensors. As Problem1 I would like to correlate same time windows from them. Even in this case I expect some "Casual lag" between data but errors between minutes are fine and I want to check just values 'at right position'. Example:
2)问题 2:不同传感器之间的关联在这种情况下,我有 2 个 CVS 文件,其中包含来自两个传感器的 PM 值。作为问题1,我想从它们关联相同的时间窗口。即使在这种情况下,我预计数据之间会有一些“偶然滞后”,但分钟之间的错误很好,我只想检查“正确位置”的值。例子:
Sensor A:
date value
date
2017-11-30 00:00:00 30/11/17 0.00 49
2017-11-30 00:02:00 30/11/17 0.02 51
2017-11-30 00:03:00 30/11/17 0.03 54
2017-11-30 00:05:00 30/11/17 0.05 57
Sensor B:
date value
date
2017-11-30 00:00:00 30/11/17 0.00 1
2017-11-30 00:02:00 30/11/17 0.02 40
2017-11-30 00:04:00 30/11/17 0.03 11
2017-11-30 00:05:00 30/11/17 0.05 57
AxB
date valueA valueB
date
2017-11-30 00:00:00 30/11/17 0.00 49 1
2017-11-30 00:02:00 30/11/17 0.02 51 40
2017-11-30 00:03:00 30/11/17 0.03 54 11
2017-11-30 00:05:00 30/11/17 0.05 57 57
Thank you in advance
先感谢您
回答by Peter Leimbigler
I'll try to address both of your questions together. This looks like a job for pd.merge_asof()
, which merges on nearest-matching keys, rather than only on exact keys.
我会尽量一起解决你的两个问题。这看起来像是 的工作pd.merge_asof()
,它合并最近匹配的键,而不仅仅是精确的键。
Example data
示例数据
df1
date value
30/11/17 0.00 51
30/11/17 0.02 53
30/11/17 0.05 65
30/11/17 0.08 58
df2
date value
30/11/17 0.01 61
30/11/17 0.02 63
30/11/17 0.04 65
30/11/17 0.07 68
Preprocessing
预处理
df1.date = pd.to_datetime(df1.date, format='%d/%m/%y %H.%M')
df2.date = pd.to_datetime(df2.date, format='%d/%m/%y %H.%M')
df1.set_index('date', inplace=True)
df2.set_index('date', inplace=True)
df1
value
date
2017-11-30 00:00:00 51
2017-11-30 00:02:00 53
2017-11-30 00:05:00 65
2017-11-30 00:08:00 58
df2
value
date
2017-11-30 00:01:00 61
2017-11-30 00:02:00 63
2017-11-30 00:04:00 65
2017-11-30 00:07:00 68
Merge DataFrames on nearest index match
在最近的索引匹配上合并数据帧
merged = pd.merge_asof(df1, df2, left_index=True, right_index=True, direction='nearest')
merged
value_x value_y
date
2017-11-30 00:00:00 51 61
2017-11-30 00:02:00 53 63
2017-11-30 00:05:00 65 65
2017-11-30 00:08:00 58 68
Correlations
相关性
Note that df.corr()
doesn't accept data as an argument, so df1.corr(df2)
doesn't work. The corr
method computes pairwise correlation of the columns in the DataFrame you call it on (docs).
请注意,df.corr()
它不接受数据作为参数,因此df1.corr(df2)
不起作用。该corr
方法计算您调用它的 DataFrame 中列的成对相关性 ( docs)。
merged.corr()
value_x value_y
value_x 1.000000 0.612873
value_y 0.612873 1.000000
Notes
笔记
The above usage of pd.merge_asof
keeps the index of df1
; each row in df1
receives its closest match in df2
, with replacement, so if df2
ever has fewer rows than df1
, the result of merge_asof
will contain duplicate values from df2
. And the result will have the same number of rows as df1
.
上述用法pd.merge_asof
保持索引df1
; 中的每一行都df1
接收其最接近的匹配项df2
,并带有替换,因此如果df2
行数少于df1
,则结果merge_asof
将包含来自 的重复值df2
。结果将具有与 相同的行数df1
。
You mentioned that you really only care to compare rows by relative position, e.g., compare the 3rd value of df1
to the 3rd value of df2
. Instead of using merge_asof
, you could simply ignore the time index once you've used it to obtain time periods of interest, and access the underlying numpy arrays with df.values
:
您提到您真的只关心按相对位置比较行,例如,将 的第三个值df1
与 的第三个值进行比较df2
。merge_asof
一旦您使用它来获取感兴趣的时间段,您可以简单地忽略时间索引,而不是使用,并使用以下方法访问底层 numpy 数组df.values
:
# Get a 2D array of shape (4, 1)
df1.values
array([[51],
[53],
[65],
[58]])
# Get a 1D array of shape (4,)
df1.values.flatten()
array([51, 53, 65, 58])
# numpy correlation matrix
pd.np.corrcoef(df1.values.flatten(), df2.values.flatten())
array([[1. , 0.61287265],
[0.61287265, 1. ]])