pandas 与熊猫的时间序列相关性

Question

提问by Davide

I have some Particulate Matter sensors and CSVs with time series like:

我有一些带有时间序列的微粒物质传感器和 CSV，例如：

Sensor A:

传感器 A：

                     date           value
date                                     
2017-11-30 00:00:00  30/11/17 0.00     49
2017-11-30 00:02:00  30/11/17 0.02     51
2017-11-30 00:03:00  30/11/17 0.03     54
2017-11-30 00:05:00  30/11/17 0.05     57
2017-11-30 00:07:00  30/11/17 0.07     53
2017-11-30 00:08:00  30/11/17 0.08     55
2017-11-30 00:10:00  30/11/17 0.10     55
2017-11-30 00:12:00  30/11/17 0.12     58
2017-11-30 00:13:00  30/11/17 0.13     57
2017-11-30 00:15:00  30/11/17 0.15     58
....
2018-02-06 09:30:00    6/2/18 9.30     33
2018-02-06 09:32:00    6/2/18 9.32     31
2018-02-06 09:33:00    6/2/18 9.33     34
2018-02-06 09:35:00    6/2/18 9.35     32
2018-02-06 09:37:00    6/2/18 9.37     33
2018-02-06 09:38:00    6/2/18 9.38     30

I set date as index with:

我将日期设置为索引：

df.index = pd.to_datetime(df['date'], format='%d/%m/%y %H.%M')

I would like to correlate different time windows between data from the same sensor AND from different sensor in similar time windows. I expect to know if I have same increase/decrease behaviour in some part of the day/days. After setting "date index" I'm able to get "All PM value from 9am to 10am everyday from sensor A"

我想将来自同一传感器的数据和来自类似时间窗口的不同传感器的数据之间的不同时间窗口相关联。我希望知道我在一天/几天的某些部分是否有相同的增加/减少行为。设置“日期索引”后，我可以从传感器 A 获得“每天上午 9 点到上午 10 点的所有 PM 值”

df.between_time('9:00','10:00')

1) Problem 1: How to check correlation from same sensor but different days: I filtered data 9/10am from two days in two DataFrame, but not always they're taken exactly at the same minute. I may have situations like this:

1)问题 1：如何检查来自同一传感器但不同日期的相关性：我在两个 DataFrame 中的两天上午 9/10 点过滤了数据，但并不总是在同一分钟内获取它们。我可能有这样的情况：

01-01-2018 (df01 - I removed data column)
2018-01-01 09:05:00     11
2018-01-01 09:07:00     11
2018-01-01 09:09:00     10
....


02-01-2018 (df02)
2018-02-01 09:05:00     67
2018-02-01 09:07:00     68
2018-02-01 09:08:00     67
....

Should I rename data column? I actually care that the third value from 01/01/2018 will correlate with the third value from the second window.

我应该重命名数据列吗？我实际上关心 01/01/2018 的第三个值将与第二个窗口的第三个值相关联。

df01.corr(df02)

returns

回报

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

2) Problem 2: Correlate between different sensorsIn this case I have 2 CVS files with PM values from two sensors. As Problem1 I would like to correlate same time windows from them. Even in this case I expect some "Casual lag" between data but errors between minutes are fine and I want to check just values 'at right position'. Example:

2)问题 2：不同传感器之间的关联在这种情况下，我有 2 个 CVS 文件，其中包含来自两个传感器的 PM 值。作为问题1，我想从它们关联相同的时间窗口。即使在这种情况下，我预计数据之间会有一些“偶然滞后”，但分钟之间的错误很好，我只想检查“正确位置”的值。例子：

Sensor A:
                         date           value
    date                                     
    2017-11-30 00:00:00  30/11/17 0.00     49
    2017-11-30 00:02:00  30/11/17 0.02     51
    2017-11-30 00:03:00  30/11/17 0.03     54
    2017-11-30 00:05:00  30/11/17 0.05     57

Sensor B:
                         date           value
    date                                     
    2017-11-30 00:00:00  30/11/17 0.00     1
    2017-11-30 00:02:00  30/11/17 0.02     40
    2017-11-30 00:04:00  30/11/17 0.03     11
    2017-11-30 00:05:00  30/11/17 0.05     57

AxB
                         date           valueA    valueB
    date                                     
    2017-11-30 00:00:00  30/11/17 0.00     49       1
    2017-11-30 00:02:00  30/11/17 0.02     51       40
    2017-11-30 00:03:00  30/11/17 0.03     54       11
    2017-11-30 00:05:00  30/11/17 0.05     57       57

Thank you in advance

先感谢您

Answer 1

回答by Peter Leimbigler

I'll try to address both of your questions together. This looks like a job for pd.merge_asof(), which merges on nearest-matching keys, rather than only on exact keys.

我会尽量一起解决你的两个问题。这看起来像是的工作pd.merge_asof()，它合并最近匹配的键，而不仅仅是精确的键。

Example data

示例数据

df1
date            value
30/11/17 0.00   51
30/11/17 0.02   53
30/11/17 0.05   65
30/11/17 0.08   58

df2
date            value
30/11/17 0.01   61
30/11/17 0.02   63
30/11/17 0.04   65
30/11/17 0.07   68

Preprocessing

预处理

df1.date = pd.to_datetime(df1.date, format='%d/%m/%y %H.%M')
df2.date = pd.to_datetime(df2.date, format='%d/%m/%y %H.%M')
df1.set_index('date', inplace=True)
df2.set_index('date', inplace=True)

df1
                     value
date
2017-11-30 00:00:00     51
2017-11-30 00:02:00     53
2017-11-30 00:05:00     65
2017-11-30 00:08:00     58

df2
                     value
date
2017-11-30 00:01:00     61
2017-11-30 00:02:00     63
2017-11-30 00:04:00     65
2017-11-30 00:07:00     68

Merge DataFrames on nearest index match

在最近的索引匹配上合并数据帧

merged = pd.merge_asof(df1, df2, left_index=True, right_index=True, direction='nearest')
merged
                         value_x  value_y
date
2017-11-30 00:00:00       51       61
2017-11-30 00:02:00       53       63
2017-11-30 00:05:00       65       65
2017-11-30 00:08:00       58       68

Correlations

Notes

笔记

The above usage of pd.merge_asofkeeps the index of df1; each row in df1receives its closest match in df2, with replacement, so if df2ever has fewer rows than df1, the result of merge_asofwill contain duplicate values from df2. And the result will have the same number of rows as df1.

上述用法pd.merge_asof保持索引df1; 中的每一行都df1接收其最接近的匹配项df2，并带有替换，因此如果df2行数少于df1，则结果merge_asof将包含来自的重复值df2。结果将具有与相同的行数df1。

You mentioned that you really only care to compare rows by relative position, e.g., compare the 3rd value of df1to the 3rd value of df2. Instead of using merge_asof, you could simply ignore the time index once you've used it to obtain time periods of interest, and access the underlying numpy arrays with df.values:

您提到您真的只关心按相对位置比较行，例如，将的第三个值df1与的第三个值进行比较df2。merge_asof一旦您使用它来获取感兴趣的时间段，您可以简单地忽略时间索引，而不是使用，并使用以下方法访问底层 numpy 数组df.values：

# Get a 2D array of shape (4, 1)
df1.values
array([[51],
       [53],
       [65],
       [58]])

# Get a 1D array of shape (4,)
df1.values.flatten()
array([51, 53, 65, 58])

# numpy correlation matrix
pd.np.corrcoef(df1.values.flatten(), df2.values.flatten())
array([[1.        , 0.61287265],
       [0.61287265, 1.        ]])

pandas 与熊猫的时间序列相关性

提问by Davide

回答by Peter Leimbigler

Example data

示例数据

Preprocessing

预处理

Merge DataFrames on nearest index match

在最近的索引匹配上合并数据帧

Correlations

相关性

Notes

笔记

相关推荐

最近更新

标签

pandas 与熊猫的时间序列相关性

提问by Davide

回答by Peter Leimbigler

Example data

示例数据

Preprocessing

预处理

Merge DataFrames on nearest index match

在最近的索引匹配上合并数据帧

Correlations

相关性

Notes

笔记

相关推荐

pandas 熊猫在数据框中有条件地选择特定列，另一个条件会导致串联

pandas Panda - Fillna - TypeError：无法使用空键标记索引

pandas 熊猫的滚动差异

pandas 如何使用 XlsxWriter 冻结顶行和第一列？

相关推荐

最近更新

标签