pandas 如何使用熊猫按组计算时间差?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40804265/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:30:23  来源:igfitidea点击:

How to calculate time difference by group using pandas?

pythonsortingpandasdifferencetimedelta

提问by Hyman

Problem

问题

I want to calculate diffby group. And I don't know how to sort the timecolumn so that each group results are sorted and positive.

我想diff按组计算。而且我不知道如何对time列进行排序,以便每组结果都排序并为正。

The original data :

原始数据:

In [37]: df 
Out[37]:
  id                time
0  A 2016-11-25 16:32:17
1  A 2016-11-25 16:36:04
2  A 2016-11-25 16:35:29
3  B 2016-11-25 16:35:24
4  B 2016-11-25 16:35:46

The result I want

我想要的结果

Out[40]:
   id   time
0  A   00:35
1  A   03:12
2  B   00:22

notice: the type of time col is timedelta64[ns]

注意:时间 col 的类型是 timedelta64[ns]

Trying

In [38]: df['time'].diff(1)
Out[38]:
0                 NaT
1            00:03:47
2   -1 days +23:59:25
3   -1 days +23:59:55
4            00:00:22
Name: time, dtype: timedelta64[ns]

Don't get desired result.

得不到想要的结果。

Hope

希望

Not only solve the problem but the code can run fast because there are 50 million rows.

不仅解决了问题,而且代码可以运行得很快,因为有 5000 万行。

回答by jezrael

You can use sort_valueswith groupbyand aggregating diff:

您可以使用sort_valueswithgroupby和聚合diff

df['diff'] = df.sort_values(['id','time']).groupby('id')['time'].diff()
print (df)
  id                time     diff
0  A 2016-11-25 16:32:17      NaT
1  A 2016-11-25 16:36:04 00:00:35
2  A 2016-11-25 16:35:29 00:03:12
3  B 2016-11-25 16:35:24      NaT
4  B 2016-11-25 16:35:46 00:00:22

If need remove rows with NaTin column diffuse dropna:

如果有必要删除行NaT中列diff使用dropna

df = df.dropna(subset=['diff'])
print (df)
  id                time     diff
2  A 2016-11-25 16:35:29 00:03:12
1  A 2016-11-25 16:36:04 00:00:35
4  B 2016-11-25 16:35:46 00:00:22

You can also overwrite column:

您还可以覆盖列:

df.time = df.sort_values(['id','time']).groupby('id')['time'].diff()
print (df)
  id     time
0  A      NaT
1  A 00:00:35
2  A 00:03:12
3  B      NaT
4  B 00:00:22


df.time = df.sort_values(['id','time']).groupby('id')['time'].diff()
df = df.dropna(subset=['time'])
print (df)
  id     time
1  A 00:00:35
2  A 00:03:12
4  B 00:00:22