pandas 基于不完全匹配的时间戳合并熊猫

Question

提问by trench

What methods are available to merge columns which have timestamps that do not exactly match?

有哪些方法可以合并时间戳不完全匹配的列？

DF1:

DF1：

date    start_time  employee_id session_id
01/01/2016  01/01/2016 06:03:13 7261824 871631182

DF2:

DF2：

date    start_time  employee_id session_id
01/01/2016  01/01/2016 06:03:37 7261824 871631182

I could join on the ['date', 'employee_id', 'session_id'], but sometimes the same employee will have multiple identical sessions on the same date which causes duplicates. I could drop the rows where this takes place, but I would lose valid sessions if I did.

我可以在 ['date', 'employee_id', 'session_id'] 上加入，但有时同一个员工会在同一日期有多个相同的会话，这会导致重复。我可以删除发生这种情况的行，但如果这样做，我会丢失有效的会话。

Is there an efficient way to join if the timestamp of DF1 is <5 minutes from the timestamp of DF2, and the session_id and employee_id also match? If there is a matching record, then the timestamp will always be slightly later than DF1 because an event is triggered at some future point.

如果DF1的时间戳距离DF2的时间戳<5分钟，并且session_id和employee_id也匹配，是否有一种有效的加入方式？如果有匹配的记录，那么时间戳将始终比 DF1 稍晚，因为在未来某个时间点会触发事件。

['employee_id', 'session_id', 'timestamp<5minutes']

Edit- I assumed someone would have run into this issue before.

编辑- 我假设有人之前会遇到这个问题。

I was thinking of doing this:

我想这样做：

Take my timestamp on each dataframe
Create a column which is the timestamp + 5 minutes (rounded)
Create a column which is the timestamp - 5 minutes (rounded)

Create a 10 minute interval string to join the files on

df1['low_time'] = df1['start_time'] - timedelta(minutes=5)
df1['high_time'] = df1['start_time'] + timedelta(minutes=5)
df1['interval_string'] = df1['low_time'].astype(str) + df1['high_time'].astype(str)

在每个数据帧上获取我的时间戳
创建一列时间戳 + 5 分钟（四舍五入）
创建一个时间戳列 - 5 分钟（四舍五入）

创建一个 10 分钟的间隔字符串来加入文件

df1['low_time'] = df1['start_time'] - timedelta(minutes=5)
df1['high_time'] = df1['start_time'] + timedelta(minutes=5)
df1['interval_string'] = df1['low_time'].astype(str) + df1['high_time'].astype(str)

Does someone know how to round those 5 minute intervals to the nearest 5 minute mark?

有人知道如何将这 5 分钟间隔四舍五入到最接近的 5 分钟标记吗？

02:59:37 - 5 min = 02:55:00

02:59:37 - 5 分钟 = 02:55:00

02:59:37 + 5 min = 03:05:00

02:59:37 + 5 分钟 = 03:05:00

interval_string = '02:55:00-03:05:00'

pd.merge(df1, df2, how = 'left', on = ['employee_id', 'session_id', 'date', 'interval_string']

Does anyone know how to round the time like that? This seems like it could work. You still match based on the date, employee, and session, and then you look for times which are basically within the same 10 minute interval or range

有谁知道如何像这样绕过时间？这似乎可以工作。您仍然根据日期、员工和会话进行匹配，然后查找基本上在相同 10 分钟间隔或范围内的时间

Answer 1

采纳答案by Igor Raush

Consider the following mini-version of your problem:

考虑您的问题的以下迷你版本：

from io import StringIO
from pandas import read_csv, to_datetime

# how close do sessions have to be to be considered equal? (in minutes)
threshold = 5

# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]

# index column (above combination)
ixc = 'date_start_time'

df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)

df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)

which gives

这使

>>> df1
      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:03:00      7261824   871631183
2 2016-01-01 11:01:00      7261824   871631184
3 2016-01-01 14:01:00      7261824   871631185
>>> df2
      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:04:00      7261824   871631184
3 2016-01-01 14:10:00      7261824   871631185

You would like to treat df2[0:3]as duplicates of df1[0:3]when merging (since they are respectively less than 5 minutes apart), but treat df1[3]and df2[3]as separate sessions.

您希望在合并时将其df2[0:3]视为重复项df1[0:3]（因为它们分别相隔不到 5 分钟），但将df1[3]和df2[3]视为单独的会话。

Solution 1: Interval Matching

方案一：区间匹配

This is essentially what you are suggesting in your edit. You want to map timestamps in both tables to a 10-minute interval centered on the timestamp rounded to the nearest 5 minutes.

这基本上就是您在编辑中建议的内容。您希望将两个表中的时间戳映射到以四舍五入到最接近的 5 分钟的时间戳为中心的 10 分钟间隔。

Each interval can be represented uniquely by its midpoint, so you can merge the data frames on the timestamp rounded to the nearest 5 minutes. For example:

每个间隔都可以由其中点唯一表示，因此您可以合并时间戳上的数据帧，四舍五入到最接近的 5 分钟。例如：

import numpy as np

# half-threshold in nanoseconds
threshold_ns = threshold * 60 * 1e9

# compute "interval" to which each session belongs
df1['interval'] = to_datetime(np.round(df1.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
df2['interval'] = to_datetime(np.round(df2.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)

# join
cols = ['interval', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]

which prints

哪个打印

             interval  employee_id  session_id
0 2016-01-01 02:05:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:00:00      7261824   871631184
3 2016-01-01 14:00:00      7261824   871631185
4 2016-01-01 11:05:00      7261824   871631184
5 2016-01-01 14:10:00      7261824   871631185

Note that this is not totally correct. The sessions df1[2]and df2[2]are not treated as duplicates although they are only 3 minutes apart. This is because they were on different sides of the interval boundary.

请注意，这并不完全正确。尽管相隔仅 3 分钟，但会话df1[2]和df2[2]不会被视为重复。这是因为它们位于区间边界的不同侧。

Solution 2: One-to-one matching

方案二：一对一匹配

Here is another approach which depends on the condition that sessions in df1have either zero or one duplicates in df2.

这是另一种方法，它取决于会话在中df1具有零个或一个重复项的条件df2。

We replace timestamps in df1with the closest timestamp in df2which matches on employee_idand session_idandis less than 5 minutes away.

我们替换时间戳df1与最接近的时间戳df2这对匹配employee_id和session_id并且是不到5分钟的路程。

from datetime import timedelta

# get closest match from "df2" to row from "df1" (as long as it's below the threshold)
def closest(row):
    matches = df2.loc[(df2.employee_id == row.employee_id) &
                      (df2.session_id == row.session_id)]

    deltas = matches.date_start_time - row.date_start_time
    deltas = deltas.loc[deltas <= timedelta(minutes=threshold)]

    try:
        return matches.loc[deltas.idxmin()]
    except ValueError:  # no items
        return row

# replace timestamps in "df1" with closest timestamps in "df2"
df1 = df1.apply(closest, axis=1)

# join
cols = ['date_start_time', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]

which prints

哪个打印

      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:04:00      7261824   871631184
3 2016-01-01 14:01:00      7261824   871631185
4 2016-01-01 14:10:00      7261824   871631185

This approach is significantly slower, since you have to search through the entirety of df2for each row in df1. What I have written can probably be optimized further, but this will still take a long time on large datasets.

这种方法是显著慢，因为你必须通过的全部搜索df2中每行df1。我写的内容可能可以进一步优化，但是在大型数据集上这仍然需要很长时间。

Answer 2

回答by osonuyi

I would try using this method in pandas:

我会尝试在Pandas中使用这种方法：

pandas.merge_asof()

The parameters of interest for you would be direction,tolerance,left_on, and right_on

感兴趣的参数，你会是direction，tolerance，left_on，和right_on

Building off @Igor answer:

建立@Igor 回答：

import pandas as pd
from pandas import read_csv
from io import StringIO

# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]

# index column (above combination)
ixc = 'date_start_time'

df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)

df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)



df1['date_start_time'] = pd.to_datetime(df1['date_start_time'])
df2['date_start_time'] = pd.to_datetime(df2['date_start_time'])

# converting this to the index so we can preserve the date_start_time columns so you can validate the merging logic
df1.index = df1['date_start_time']
df2.index = df2['date_start_time']
# the magic happens below, check the direction and tolerance arguments
tol = pd.Timedelta('5 minute')
pd.merge_asof(left=df1,right=df2,right_index=True,left_index=True,direction='nearest',tolerance=tol)

output

输出

date_start_time date_start_time_x   employee_id_x   session_id_x    date_start_time_y   employee_id_y   session_id_y

2016-01-01 02:03:00 2016-01-01 02:03:00 7261824 871631182   2016-01-01 02:03:00 7261824.0   871631182.0
2016-01-01 06:03:00 2016-01-01 06:03:00 7261824 871631183   2016-01-01 06:05:00 7261824.0   871631183.0
2016-01-01 11:01:00 2016-01-01 11:01:00 7261824 871631184   2016-01-01 11:04:00 7261824.0   871631184.0
2016-01-01 14:01:00 2016-01-01 14:01:00 7261824 871631185   NaT NaN NaN

pandas 基于不完全匹配的时间戳合并熊猫

提问by trench

采纳答案by Igor Raush

Solution 1: Interval Matching

方案一：区间匹配

Solution 2: One-to-one matching

方案二：一对一匹配

回答by osonuyi

output

输出

相关推荐

最近更新

标签

pandas 基于不完全匹配的时间戳合并熊猫

提问by trench

采纳答案by Igor Raush

Solution 1: Interval Matching

方案一：区间匹配

Solution 2: One-to-one matching

方案二：一对一匹配

回答by osonuyi

output

输出

相关推荐

Pandas：按日期和另一个变量的中位数分组

Pandas：按四舍五入的浮点数分组

Pandas：当组中的值满足所需条件时从数据中删除组

pandas 在数据框中添加缺少的日期索引

相关推荐

最近更新

标签