pandas.merge：匹配最近的时间戳>=时间戳系列

Question

提问by Tom Bennett

I have two dataframes, both of which contain an irregularly spaced, millisecond resolution timestamp column. My goal here is to match up the rows so that for each matched row, 1) the first time stamp is always smaller or equal to the second timestamp, and 2) the matched timestamps are the closest for all pairs of timestamps satisfying 1).

我有两个数据帧，它们都包含一个不规则间隔的毫秒分辨率时间戳列。我的目标是匹配行，以便对于每个匹配的行，1）第一个时间戳始终小于或等于第二个时间戳，并且 2）匹配的时间戳是所有满足 1) 的时间戳对中最接近的。

Is there any way to do this with pandas.merge?

有没有办法用 pandas.merge 做到这一点？

Answer 1

回答by HYRY

merge()can't do this kind of join, but you can use searchsorted():

merge()不能做这种连接，但你可以使用searchsorted()：

Create some random timestamps: t1, t2, there are in ascending order:

创建一些随机时间戳：t1, t2，按升序排列：

import pandas as pd
import numpy as np
np.random.seed(0)

base = np.array(["2013-01-01 00:00:00"], "datetime64[ns]")

a = (np.random.rand(30)*1000000*1000).astype(np.int64)*1000000
t1 = base + a
t1.sort()

b = (np.random.rand(10)*1000000*1000).astype(np.int64)*1000000
t2 = base + b
t2.sort()

call searchsorted()to find index in t1for every value in t2:

调用searchsorted()查找索引中t1的每个值t2：

idx = np.searchsorted(t1, t2) - 1
mask = idx >= 0

df = pd.DataFrame({"t1":t1[idx][mask], "t2":t2[mask]})

here is the output:

这是输出：

                         t1                         t2
0 2013-01-02 06:49:13.287000 2013-01-03 16:29:15.612000
1 2013-01-05 16:33:07.211000 2013-01-05 21:42:30.332000
2 2013-01-07 04:47:24.561000 2013-01-07 04:53:53.948000
3 2013-01-07 14:26:03.376000 2013-01-07 17:01:35.722000
4 2013-01-07 14:26:03.376000 2013-01-07 18:22:13.996000
5 2013-01-07 14:26:03.376000 2013-01-07 18:33:55.497000
6 2013-01-08 02:24:54.113000 2013-01-08 12:23:40.299000
7 2013-01-08 21:39:49.366000 2013-01-09 14:03:53.689000
8 2013-01-11 08:06:36.638000 2013-01-11 13:09:08.078000

To view this result by graph:

要通过图形查看此结果：

import pylab as pl
pl.figure(figsize=(18, 4))
pl.vlines(pd.Series(t1), 0, 1, colors="g", lw=1)
pl.vlines(df.t1, 0.3, 0.7, colors="r", lw=2)
pl.vlines(df.t2, 0.3, 0.7, colors="b", lw=2)
pl.margins(0.02)

output:

输出：

enter image description here

在此处输入图片说明

The green lines are t1, blue lines are t2, red lines are selected from t1for every t2.

绿线是t1，蓝线是t2，红线是从t1每一个中选择的t2。

Answer 2

回答by cdarlint

Pandas now has the function merge_asof, doing exactly what was described in the accepted answer.

Pandas 现在具有功能merge_asof，完全按照已接受的答案中的描述进行操作。

Answer 3

回答by Yaron

I used a different way than HYRY:

我使用了与 HYRY 不同的方式：

do a regular merge with outer join (how='outer');
sort it by date;
use fillna(method='pad') to take fill just the columns you need and 'pad' if you would like to take the previous filled row;
drop all the rows you don't need from the outer join.

与外连接进行常规合并（how='outer'）；
按日期排序；
使用 fillna(method='pad') 仅填充您需要的列，如果您想使用前一个填充的行，则使用 'pad'；
从外部联接中删除所有不需要的行。

All this can be written in few lines:

所有这些都可以写成几行：

df=pd.merge(df0, df1, on='Date', how='outer')   
df=df.sort(['Date'], ascending=[1])
headertofill=list(df1.columns.values)
df[headertofill]=df[headertofill].fillna(method='pad')
df=df[pd.isnull(df[var_from_df0_only])==False]

Answer 4

回答by George Sovetov

Here is a simpler and more general method.

这是一个更简单和更通用的方法。

# data and signal are want we want to merge
keys = ['channel', 'timestamp']  # Could be simply ['timestamp']
index = data.loc[keys].set_index(keys).index  # Make index from columns to merge on
padded = signal.reindex(index, method='pad')  # Key step -- reindex with filling
joined = data.join(padded, on=keys)  # Join to data if needed

pandas.merge：匹配最近的时间戳>=时间戳系列

提问by Tom Bennett

回答by HYRY

回答by cdarlint

回答by Yaron

回答by George Sovetov

相关推荐

最近更新

标签

pandas.merge：匹配最近的时间戳>=时间戳系列

提问by Tom Bennett

回答by HYRY

回答by cdarlint

回答by Yaron

回答by George Sovetov

相关推荐

python pandas 添加前导零以使所有月份均为 2 位数

从 CSV 或 Pandas DataFrame 自动 PostgreSQL CREATE TABLE 和 INSERT

pandas python索引越界错误

pandas 熊猫，将系列连接到 DF 作为行

相关推荐

最近更新

标签