pandas 基于不完全匹配的时间戳合并熊猫
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34880539/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas merging based on a timestamp which do not match exactly
提问by trench
What methods are available to merge columns which have timestamps that do not exactly match?
有哪些方法可以合并时间戳不完全匹配的列?
DF1:
DF1:
date start_time employee_id session_id
01/01/2016 01/01/2016 06:03:13 7261824 871631182
DF2:
DF2:
date start_time employee_id session_id
01/01/2016 01/01/2016 06:03:37 7261824 871631182
I could join on the ['date', 'employee_id', 'session_id'], but sometimes the same employee will have multiple identical sessions on the same date which causes duplicates. I could drop the rows where this takes place, but I would lose valid sessions if I did.
我可以在 ['date', 'employee_id', 'session_id'] 上加入,但有时同一个员工会在同一日期有多个相同的会话,这会导致重复。我可以删除发生这种情况的行,但如果这样做,我会丢失有效的会话。
Is there an efficient way to join if the timestamp of DF1 is <5 minutes from the timestamp of DF2, and the session_id and employee_id also match? If there is a matching record, then the timestamp will always be slightly later than DF1 because an event is triggered at some future point.
如果DF1的时间戳距离DF2的时间戳<5分钟,并且session_id和employee_id也匹配,是否有一种有效的加入方式?如果有匹配的记录,那么时间戳将始终比 DF1 稍晚,因为在未来某个时间点会触发事件。
['employee_id', 'session_id', 'timestamp<5minutes']
Edit- I assumed someone would have run into this issue before.
编辑- 我假设有人之前会遇到这个问题。
I was thinking of doing this:
我想这样做:
- Take my timestamp on each dataframe
- Create a column which is the timestamp + 5 minutes (rounded)
- Create a column which is the timestamp - 5 minutes (rounded)
Create a 10 minute interval string to join the files on
df1['low_time'] = df1['start_time'] - timedelta(minutes=5) df1['high_time'] = df1['start_time'] + timedelta(minutes=5) df1['interval_string'] = df1['low_time'].astype(str) + df1['high_time'].astype(str)
- 在每个数据帧上获取我的时间戳
- 创建一列时间戳 + 5 分钟(四舍五入)
- 创建一个时间戳列 - 5 分钟(四舍五入)
创建一个 10 分钟的间隔字符串来加入文件
df1['low_time'] = df1['start_time'] - timedelta(minutes=5) df1['high_time'] = df1['start_time'] + timedelta(minutes=5) df1['interval_string'] = df1['low_time'].astype(str) + df1['high_time'].astype(str)
Does someone know how to round those 5 minute intervals to the nearest 5 minute mark?
有人知道如何将这 5 分钟间隔四舍五入到最接近的 5 分钟标记吗?
02:59:37 - 5 min = 02:55:00
02:59:37 - 5 分钟 = 02:55:00
02:59:37 + 5 min = 03:05:00
02:59:37 + 5 分钟 = 03:05:00
interval_string = '02:55:00-03:05:00'
interval_string = '02:55:00-03:05:00'
pd.merge(df1, df2, how = 'left', on = ['employee_id', 'session_id', 'date', 'interval_string']
Does anyone know how to round the time like that? This seems like it could work. You still match based on the date, employee, and session, and then you look for times which are basically within the same 10 minute interval or range
有谁知道如何像这样绕过时间?这似乎可以工作。您仍然根据日期、员工和会话进行匹配,然后查找基本上在相同 10 分钟间隔或范围内的时间
采纳答案by Igor Raush
Consider the following mini-version of your problem:
考虑您的问题的以下迷你版本:
from io import StringIO
from pandas import read_csv, to_datetime
# how close do sessions have to be to be considered equal? (in minutes)
threshold = 5
# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]
# index column (above combination)
ixc = 'date_start_time'
df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)
df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)
which gives
这使
>>> df1
date_start_time employee_id session_id
0 2016-01-01 02:03:00 7261824 871631182
1 2016-01-01 06:03:00 7261824 871631183
2 2016-01-01 11:01:00 7261824 871631184
3 2016-01-01 14:01:00 7261824 871631185
>>> df2
date_start_time employee_id session_id
0 2016-01-01 02:03:00 7261824 871631182
1 2016-01-01 06:05:00 7261824 871631183
2 2016-01-01 11:04:00 7261824 871631184
3 2016-01-01 14:10:00 7261824 871631185
You would like to treat df2[0:3]
as duplicates of df1[0:3]
when merging (since they are respectively less than 5 minutes apart), but treat df1[3]
and df2[3]
as separate sessions.
您希望在合并时将其df2[0:3]
视为重复项df1[0:3]
(因为它们分别相隔不到 5 分钟),但将df1[3]
和df2[3]
视为单独的会话。
Solution 1: Interval Matching
方案一:区间匹配
This is essentially what you are suggesting in your edit. You want to map timestamps in both tables to a 10-minute interval centered on the timestamp rounded to the nearest 5 minutes.
这基本上就是您在编辑中建议的内容。您希望将两个表中的时间戳映射到以四舍五入到最接近的 5 分钟的时间戳为中心的 10 分钟间隔。
Each interval can be represented uniquely by its midpoint, so you can merge the data frames on the timestamp rounded to the nearest 5 minutes. For example:
每个间隔都可以由其中点唯一表示,因此您可以合并时间戳上的数据帧,四舍五入到最接近的 5 分钟。例如:
import numpy as np
# half-threshold in nanoseconds
threshold_ns = threshold * 60 * 1e9
# compute "interval" to which each session belongs
df1['interval'] = to_datetime(np.round(df1.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
df2['interval'] = to_datetime(np.round(df2.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
# join
cols = ['interval', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]
which prints
哪个打印
interval employee_id session_id
0 2016-01-01 02:05:00 7261824 871631182
1 2016-01-01 06:05:00 7261824 871631183
2 2016-01-01 11:00:00 7261824 871631184
3 2016-01-01 14:00:00 7261824 871631185
4 2016-01-01 11:05:00 7261824 871631184
5 2016-01-01 14:10:00 7261824 871631185
Note that this is not totally correct. The sessions df1[2]
and df2[2]
are not treated as duplicates although they are only 3 minutes apart. This is because they were on different sides of the interval boundary.
请注意,这并不完全正确。尽管相隔仅 3 分钟,但会话df1[2]
和df2[2]
不会被视为重复。这是因为它们位于区间边界的不同侧。
Solution 2: One-to-one matching
方案二:一对一匹配
Here is another approach which depends on the condition that sessions in df1
have either zero or one duplicates in df2
.
这是另一种方法,它取决于会话在 中df1
具有零个或一个重复项的条件df2
。
We replace timestamps in df1
with the closest timestamp in df2
which matches on employee_id
and session_id
andis less than 5 minutes away.
我们替换时间戳df1
与最接近的时间戳df2
这对匹配employee_id
和session_id
并且是不到5分钟的路程。
from datetime import timedelta
# get closest match from "df2" to row from "df1" (as long as it's below the threshold)
def closest(row):
matches = df2.loc[(df2.employee_id == row.employee_id) &
(df2.session_id == row.session_id)]
deltas = matches.date_start_time - row.date_start_time
deltas = deltas.loc[deltas <= timedelta(minutes=threshold)]
try:
return matches.loc[deltas.idxmin()]
except ValueError: # no items
return row
# replace timestamps in "df1" with closest timestamps in "df2"
df1 = df1.apply(closest, axis=1)
# join
cols = ['date_start_time', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]
which prints
哪个打印
date_start_time employee_id session_id
0 2016-01-01 02:03:00 7261824 871631182
1 2016-01-01 06:05:00 7261824 871631183
2 2016-01-01 11:04:00 7261824 871631184
3 2016-01-01 14:01:00 7261824 871631185
4 2016-01-01 14:10:00 7261824 871631185
This approach is significantly slower, since you have to search through the entirety of df2
for each row in df1
. What I have written can probably be optimized further, but this will still take a long time on large datasets.
这种方法是显著慢,因为你必须通过的全部搜索df2
中每行df1
。我写的内容可能可以进一步优化,但是在大型数据集上这仍然需要很长时间。
回答by osonuyi
I would try using this method in pandas:
我会尝试在Pandas中使用这种方法:
The parameters of interest for you would be direction
,tolerance
,left_on
, and right_on
感兴趣的参数,你会是direction
,tolerance
,left_on
,和right_on
Building off @Igor answer:
建立@Igor 回答:
import pandas as pd
from pandas import read_csv
from io import StringIO
# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]
# index column (above combination)
ixc = 'date_start_time'
df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)
df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)
df1['date_start_time'] = pd.to_datetime(df1['date_start_time'])
df2['date_start_time'] = pd.to_datetime(df2['date_start_time'])
# converting this to the index so we can preserve the date_start_time columns so you can validate the merging logic
df1.index = df1['date_start_time']
df2.index = df2['date_start_time']
# the magic happens below, check the direction and tolerance arguments
tol = pd.Timedelta('5 minute')
pd.merge_asof(left=df1,right=df2,right_index=True,left_index=True,direction='nearest',tolerance=tol)
output
输出
date_start_time date_start_time_x employee_id_x session_id_x date_start_time_y employee_id_y session_id_y
2016-01-01 02:03:00 2016-01-01 02:03:00 7261824 871631182 2016-01-01 02:03:00 7261824.0 871631182.0
2016-01-01 06:03:00 2016-01-01 06:03:00 7261824 871631183 2016-01-01 06:05:00 7261824.0 871631183.0
2016-01-01 11:01:00 2016-01-01 11:01:00 7261824 871631184 2016-01-01 11:04:00 7261824.0 871631184.0
2016-01-01 14:01:00 2016-01-01 14:01:00 7261824 871631185 NaT NaN NaN