pandas 在python中查找日期范围重叠
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42462218/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Find date range overlap in python
提问by Edouard
I am trying to find an more efficient way of finding overlapping data ranges (start/end dates provided per row) in a dataframe based on a specific column (id).
我试图在基于特定列 (id) 的数据框中找到一种更有效的方法来查找重叠数据范围(每行提供的开始/结束日期)。
Dataframe is sorted on 'from' column
数据框按“来自”列排序
I think there is a way to avoid "double" apply function like I did...
我认为有一种方法可以像我一样避免“双重”应用功能......
import pandas as pd
from datetime import datetime
df = pd.DataFrame(columns=['id','from','to'], index=range(5), \
data=[[878,'2006-01-01','2007-10-01'],
[878,'2007-10-02','2008-12-01'],
[878,'2008-12-02','2010-04-03'],
[879,'2010-04-04','2199-05-11'],
[879,'2016-05-12','2199-12-31']])
df['from'] = pd.to_datetime(df['from'])
df['to'] = pd.to_datetime(df['to'])
id from to
0 878 2006-01-01 2007-10-01
1 878 2007-10-02 2008-12-01
2 878 2008-12-02 2010-04-03
3 879 2010-04-04 2199-05-11
4 879 2016-05-12 2199-12-31
I used the "apply" function to loop on all groups and within each group, I use "apply" per row:
我使用“apply”函数在所有组上循环,在每个组内,我每行使用“apply”:
def check_date_by_id(df):
df['prevFrom'] = df['from'].shift()
df['prevTo'] = df['to'].shift()
def check_date_by_row(x):
if pd.isnull(x.prevFrom) or pd.isnull(x.prevTo):
x['overlap'] = False
return x
latest_start = max(x['from'], x.prevFrom)
earliest_end = min(x['to'], x.prevTo)
x['overlap'] = int((earliest_end - latest_start).days) + 1 > 0
return x
return df.apply(check_date_by_row, axis=1).drop(['prevFrom','prevTo'], axis=1)
df.groupby('id').apply(check_date_by_id)
id from to overlap
0 878 2006-01-01 2007-10-01 False
1 878 2007-10-02 2008-12-01 False
2 878 2008-12-02 2010-04-03 False
3 879 2010-04-04 2199-05-11 False
4 879 2016-05-12 2199-12-31 True
My code was inspired from the following links :
我的代码灵感来自以下链接:
回答by miradulo
You could just shift the to
column and perform a direct subtraction of the datetimes.
您可以只移动to
列并直接减去日期时间。
df['overlap'] = (df['to'].shift()-df['from']) > timedelta(0)
Applying this while grouping by id
may look like
在分组时应用它id
可能看起来像
df['overlap'] = (df.groupby('id')
.apply(lambda x: (x['to'].shift() - x['from']) > timedelta(0))
.reset_index(level=0, drop=True))
Demo
演示
>>> df
id from to
0 878 2006-01-01 2007-10-01
1 878 2007-10-02 2008-12-01
2 878 2008-12-02 2010-04-03
3 879 2010-04-04 2199-05-11
4 879 2016-05-12 2199-12-31
>>> df['overlap'] = (df.groupby('id')
.apply(lambda x: (x['to'].shift() - x['from']) > timedelta(0))
.reset_index(level=0, drop=True))
>>> df
id from to overlap
0 878 2006-01-01 2007-10-01 False
1 878 2007-10-02 2008-12-01 False
2 878 2008-12-02 2010-04-03 False
3 879 2010-04-04 2199-05-11 False
4 879 2016-05-12 2199-12-31 True
回答by Adam Zeldin
Another solution. This could be rewritten to leverage Interval.overlaps in pandas 24 and later.
另一种解决方案。这可以重写以利用 pandas 24 及更高版本中的 Interval.overlaps。
def overlapping_groups(group):
if len(group) > 1:
for index, row in group.iterrows():
for index2, row2 in group.drop(index).iterrows():
int1 = pd.Interval(row2['start_date'],row2['end_date'], closed = 'both')
if row['start_date'] in int1:
return row['id']
if row['end_date'] in int1:
return row['id']
gcols = ['id']
group_output = df.groupby(gcols,group_keys=False).apply(overlapping_groups)
ids_with_overlap = set(group_output[~group_output.isnull()].reset_index(drop = True))
df[df['id'].isin(ids_with_overlap)]
回答by farghal
You can sort the from
column and then simply check if it overlaps with a previous to
column or not using rolling apply function which is very efficient.
您可以对from
列进行排序,然后简单地检查它是否与前一to
列重叠或不使用非常有效的滚动应用功能。
df['from'] = pd.DatetimeIndex(df['from']).astype(np.int64)
df['to'] = pd.DatetimeIndex(df['to']).astype(np.int64)
sdf = df.sort_values(by='from')
sdf[["from", "to"]].stack().rolling(window=2).apply(lambda r: 1 if r[1] >= r[0] else 0).unstack()
Now the overlapping periods are the ones with from=0.0
现在重叠的时期是那些 from=0.0
from to
0 NaN 1.0
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
4 0.0 1.0