pandas 在python中查找日期范围重叠

Question

提问by Edouard

I am trying to find an more efficient way of finding overlapping data ranges (start/end dates provided per row) in a dataframe based on a specific column (id).

我试图在基于特定列 (id) 的数据框中找到一种更有效的方法来查找重叠数据范围（每行提供的开始/结束日期）。

Dataframe is sorted on 'from' column

数据框按“来自”列排序

I think there is a way to avoid "double" apply function like I did...

我认为有一种方法可以像我一样避免“双重”应用功能......

import pandas as pd
from datetime import datetime

df = pd.DataFrame(columns=['id','from','to'], index=range(5), \
                  data=[[878,'2006-01-01','2007-10-01'],
                        [878,'2007-10-02','2008-12-01'],
                        [878,'2008-12-02','2010-04-03'],
                        [879,'2010-04-04','2199-05-11'],
                        [879,'2016-05-12','2199-12-31']])

df['from'] = pd.to_datetime(df['from'])
df['to'] = pd.to_datetime(df['to'])


    id  from        to
0   878 2006-01-01  2007-10-01
1   878 2007-10-02  2008-12-01
2   878 2008-12-02  2010-04-03
3   879 2010-04-04  2199-05-11
4   879 2016-05-12  2199-12-31

I used the "apply" function to loop on all groups and within each group, I use "apply" per row:

我使用“apply”函数在所有组上循环，在每个组内，我每行使用“apply”：

def check_date_by_id(df):

    df['prevFrom'] = df['from'].shift()
    df['prevTo'] = df['to'].shift()

    def check_date_by_row(x):

        if pd.isnull(x.prevFrom) or pd.isnull(x.prevTo):
            x['overlap'] = False
            return x

        latest_start = max(x['from'], x.prevFrom)
        earliest_end = min(x['to'], x.prevTo)
        x['overlap'] = int((earliest_end - latest_start).days) + 1 > 0
        return x

    return df.apply(check_date_by_row, axis=1).drop(['prevFrom','prevTo'], axis=1)

df.groupby('id').apply(check_date_by_id)

    id  from        to          overlap
0   878 2006-01-01  2007-10-01  False
1   878 2007-10-02  2008-12-01  False
2   878 2008-12-02  2010-04-03  False
3   879 2010-04-04  2199-05-11  False
4   879 2016-05-12  2199-12-31  True

My code was inspired from the following links :

我的代码灵感来自以下链接：

Answer 1

回答by miradulo

You could just shift the tocolumn and perform a direct subtraction of the datetimes.

您可以只移动to列并直接减去日期时间。

df['overlap'] = (df['to'].shift()-df['from']) > timedelta(0)

Applying this while grouping by idmay look like

在分组时应用它id可能看起来像

df['overlap'] = (df.groupby('id')
                   .apply(lambda x: (x['to'].shift() - x['from']) > timedelta(0))
                   .reset_index(level=0, drop=True))

Demo

演示

>>> df
    id       from         to
0  878 2006-01-01 2007-10-01
1  878 2007-10-02 2008-12-01
2  878 2008-12-02 2010-04-03
3  879 2010-04-04 2199-05-11
4  879 2016-05-12 2199-12-31

>>> df['overlap'] = (df.groupby('id')
                       .apply(lambda x: (x['to'].shift() - x['from']) > timedelta(0))
                       .reset_index(level=0, drop=True))

>>> df
    id       from         to overlap
0  878 2006-01-01 2007-10-01   False
1  878 2007-10-02 2008-12-01   False
2  878 2008-12-02 2010-04-03   False
3  879 2010-04-04 2199-05-11   False
4  879 2016-05-12 2199-12-31    True

Answer 2

回答by Adam Zeldin

Another solution. This could be rewritten to leverage Interval.overlaps in pandas 24 and later.

另一种解决方案。这可以重写以利用 pandas 24 及更高版本中的 Interval.overlaps。

def overlapping_groups(group):
    if len(group) > 1:
      for index, row in group.iterrows():
        for index2, row2 in group.drop(index).iterrows():
          int1 = pd.Interval(row2['start_date'],row2['end_date'], closed = 'both')
          if row['start_date'] in int1:
            return row['id']
          if row['end_date'] in int1:
            return row['id']

gcols = ['id']
group_output = df.groupby(gcols,group_keys=False).apply(overlapping_groups)
ids_with_overlap = set(group_output[~group_output.isnull()].reset_index(drop = True))
df[df['id'].isin(ids_with_overlap)]

Answer 3

回答by farghal

You can sort the fromcolumn and then simply check if it overlaps with a previous tocolumn or not using rolling apply function which is very efficient.

您可以对from列进行排序，然后简单地检查它是否与前一to列重叠或不使用非常有效的滚动应用功能。

df['from'] = pd.DatetimeIndex(df['from']).astype(np.int64)
df['to'] = pd.DatetimeIndex(df['to']).astype(np.int64)

sdf = df.sort_values(by='from')
sdf[["from", "to"]].stack().rolling(window=2).apply(lambda r: 1 if r[1] >= r[0] else 0).unstack()

Now the overlapping periods are the ones with from=0.0

现在重叠的时期是那些 from=0.0

   from   to
0   NaN  1.0
1   1.0  1.0
2   1.0  1.0
3   1.0  1.0
4   0.0  1.0

pandas 在python中查找日期范围重叠

提问by Edouard

回答by miradulo

回答by Adam Zeldin

回答by farghal

相关推荐

最近更新

标签

pandas 在python中查找日期范围重叠

提问by Edouard

回答by miradulo

回答by Adam Zeldin

回答by farghal

相关推荐

pandas 使用熊猫获取所有日期时间类型的列？

Pandas 根据列中的值将字符串映射到 int

pandas 如何使用 matplotlib 为特定日期和时间绘制来自 csv 的数据？

pandas ValueError：feature_names 不匹配：在 predict() 函数中的 xgboost

相关推荐

最近更新

标签