pandas 使用熊猫按日期范围分组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46839032/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:40:04  来源:igfitidea点击:

Grouping by date range with pandas

pythonpandasdatetimegroup-bypandas-groupby

提问by eljusticiero67

I am looking to group by two columns: user_id and date; however, if the dates are close enough, I want to be able to consider the two entries part of the same group and group accordingly. Date is m-d-y

我希望按两列分组:user_id 和 date;但是,如果日期足够接近,我希望能够相应地考虑同一组和组的两个条目部分。日期是 mdy

user_id     date       val
1           1-1-17     1
2           1-1-17     1
3           1-1-17     1
1           1-1-17     1
1           1-2-17     1
2           1-2-17     1
2           1-10-17    1
3           2-1-17     1

The grouping would group by user_id and dates +/- 3 days from each other. so the group by summing val would look like:

分组将按 user_id 和彼此相距 +/- 3 天的日期分组。所以通过总结 val 的组看起来像:

user_id     date       sum(val)
1           1-2-17     3
2           1-2-17     2
2           1-10-17    1
3           1-1-17     1
3           2-1-17     1

Any way someone could think of that this could be done (somewhat) easily? I know there are some problematic aspects of this. for example, what to do if the dates string together endlessly with three days apart. but the exact data im using only has 2 values per person..

任何人都可以想到这可以(有点)轻松地完成?我知道这有一些问题。例如,如果日期无休止地串在一起,相隔三天,该怎么办。但我使用的确切数据每人只有 2 个值。

Thanks!

谢谢!

回答by cs95

I'd convert this to a datetimecolumn and then use pd.TimeGrouper:

我会将其转换为一datetime列,然后使用pd.TimeGrouper

dates =  pd.to_datetime(df.date, format='%m-%d-%y')
print(dates)
0   2017-01-01
1   2017-01-01
2   2017-01-01
3   2017-01-01
4   2017-01-02
5   2017-01-02
6   2017-01-10
7   2017-02-01
Name: date, dtype: datetime64[ns]

df = (df.assign(date=dates).set_index('date')
        .groupby(['user_id', pd.TimeGrouper('3D')])
        .sum()
        .reset_index())    
print(df)
   user_id       date  val
0        1 2017-01-01    3
1        2 2017-01-01    2
2        2 2017-01-10    1
3        3 2017-01-01    1
4        3 2017-01-31    1


Similar solution using pd.Grouper:

类似的解决方案使用pd.Grouper

df = (df.assign(date=dates)
        .groupby(['user_id', pd.Grouper(key='date', freq='3D')])
        .sum()
        .reset_index())
print(df)
   user_id       date  val
0        1 2017-01-01    3
1        2 2017-01-01    2
2        2 2017-01-10    1
3        3 2017-01-01    1
4        3 2017-01-31    1

Update: TimeGrouperwill be deprecated in future versions of pandas, so Grouperwould be preferred in this scenario (thanks for the heads up, Vaishali!).

更新:TimeGrouper将在未来版本的Pandas中被弃用,因此Grouper在这种情况下将是首选(感谢提醒,Vaishali!)。

回答by YOBEN_S

I come with a very ugly solution but still work...

我带来了一个非常丑陋的解决方案,但仍然有效......

df=df.sort_values(['user_id','date'])
df['Key']=df.sort_values(['user_id','date']).groupby('user_id')['date'].diff().dt.days.lt(3).ne(True).cumsum()
df.groupby(['user_id','Key'],as_index=False).agg({'val':'sum','date':'first'})

Out[586]: 
   user_id  Key  val       date
0        1    1    3 2017-01-01
1        2    2    2 2017-01-01
2        2    3    1 2017-01-10
3        3    4    1 2017-01-01
4        3    5    1 2017-02-01