pandas 使用熊猫按日期范围分组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46839032/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Grouping by date range with pandas
提问by eljusticiero67
I am looking to group by two columns: user_id and date; however, if the dates are close enough, I want to be able to consider the two entries part of the same group and group accordingly. Date is m-d-y
我希望按两列分组:user_id 和 date;但是,如果日期足够接近,我希望能够相应地考虑同一组和组的两个条目部分。日期是 mdy
user_id date val
1 1-1-17 1
2 1-1-17 1
3 1-1-17 1
1 1-1-17 1
1 1-2-17 1
2 1-2-17 1
2 1-10-17 1
3 2-1-17 1
The grouping would group by user_id and dates +/- 3 days from each other. so the group by summing val would look like:
分组将按 user_id 和彼此相距 +/- 3 天的日期分组。所以通过总结 val 的组看起来像:
user_id date sum(val)
1 1-2-17 3
2 1-2-17 2
2 1-10-17 1
3 1-1-17 1
3 2-1-17 1
Any way someone could think of that this could be done (somewhat) easily? I know there are some problematic aspects of this. for example, what to do if the dates string together endlessly with three days apart. but the exact data im using only has 2 values per person..
任何人都可以想到这可以(有点)轻松地完成?我知道这有一些问题。例如,如果日期无休止地串在一起,相隔三天,该怎么办。但我使用的确切数据每人只有 2 个值。
Thanks!
谢谢!
回答by cs95
I'd convert this to a datetime
column and then use pd.TimeGrouper
:
我会将其转换为一datetime
列,然后使用pd.TimeGrouper
:
dates = pd.to_datetime(df.date, format='%m-%d-%y')
print(dates)
0 2017-01-01
1 2017-01-01
2 2017-01-01
3 2017-01-01
4 2017-01-02
5 2017-01-02
6 2017-01-10
7 2017-02-01
Name: date, dtype: datetime64[ns]
df = (df.assign(date=dates).set_index('date')
.groupby(['user_id', pd.TimeGrouper('3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Similar solution using pd.Grouper
:
类似的解决方案使用pd.Grouper
:
df = (df.assign(date=dates)
.groupby(['user_id', pd.Grouper(key='date', freq='3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Update: TimeGrouper
will be deprecated in future versions of pandas, so Grouper
would be preferred in this scenario (thanks for the heads up, Vaishali!).
更新:TimeGrouper
将在未来版本的Pandas中被弃用,因此Grouper
在这种情况下将是首选(感谢提醒,Vaishali!)。
回答by YOBEN_S
I come with a very ugly solution but still work...
我带来了一个非常丑陋的解决方案,但仍然有效......
df=df.sort_values(['user_id','date'])
df['Key']=df.sort_values(['user_id','date']).groupby('user_id')['date'].diff().dt.days.lt(3).ne(True).cumsum()
df.groupby(['user_id','Key'],as_index=False).agg({'val':'sum','date':'first'})
Out[586]:
user_id Key val date
0 1 1 3 2017-01-01
1 2 2 2 2017-01-01
2 2 3 1 2017-01-10
3 3 4 1 2017-01-01
4 3 5 1 2017-02-01