pandas 在熊猫中按时间分组的更快方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17288636/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:57:17  来源:igfitidea点击:

Faster way to groupby time of day in pandas

pythondatetimetimegroup-bypandas

提问by joeb1415

I have a time series of several days of 1-minute data, and would like to average it across all days by time of day.

我有几天的 1 分钟数据的时间序列,并希望按一天中的时间平均所有天。

This is very slow:

这是非常缓慢的:

from datetime import datetime
from pandas import date_range, Series
time_ind = date_range(datetime(2013, 1, 1), datetime(2013, 1, 10), freq='1min')
all_data = Series(randn(len(time_ind)), time_ind)
time_mean = all_data.groupby(lambda x: x.time()).mean()

Takes almost a minute to run!

运行需要将近一分钟!

While something like:

虽然类似:

time_mean = all_data.groupby(lambda x: x.minute).mean()

takes only a fraction of a second.

只需要几分之一秒。

Is there a faster way to group by time of day?

有没有更快的方法来按一天中的时间分组?

Any idea why this is so slow?

知道为什么这这么慢吗?

采纳答案by bmu

Both your "lambda-version" and the time property introduced in version 0.11seems to be slow in version 0.11.0:

您的“lambda-version”和0.11 版中引入的 time 属性在 0.11.0版中似乎都很慢:

In [4]: %timeit all_data.groupby(all_data.index.time).mean()
1 loops, best of 3: 11.8 s per loop

In [5]: %timeit all_data.groupby(lambda x: x.time()).mean()
Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <type 'exceptions.RuntimeError'> ignored
Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <type 'exceptions.RuntimeError'> ignored
Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <type 'exceptions.RuntimeError'> ignored
1 loops, best of 3: 11.8 s per loop

With the current master both methods are considerably faster:

使用当前的 master,这两种方法都快得多:

In [1]: pd.version.version
Out[1]: '0.11.1.dev-06cd915'

In [5]: %timeit all_data.groupby(lambda x: x.time()).mean()
1 loops, best of 3: 215 ms per loop

In [6]: %timeit all_data.groupby(all_data.index.time).mean()
10 loops, best of 3: 113 ms per loop
'0.11.1.dev-06cd915'

So you can either update to a master or wait for 0.11.1 which should be released this month.

所以你可以更新到 master 或者等待本月应该发布的 0.11.1。

回答by Andy Hayden

It's faster to groupby the hour/minute/.. attributes rather than .time. Here's Jeff's baseline:

按小时/分钟/.. 属性而不是 .. 属性分组更快.time。这是杰夫的基线:

In [11]: %timeit all_data.groupby(all_data.index.time).mean()
1 loops, best of 3: 202 ms per loop

and without time it's much faster (the fewer attributes the faster it is):

没有时间它会快得多(属性越少,速度越快):

In [12]: %timeit all_data.groupby(all_data.index.hour).mean()
100 loops, best of 3: 5.53 ms per loop

In [13]: %timeit all_data.groupby([all_data.index.hour, all_data.index.minute, all_data.index.second, all_data.index.microsecond]).mean()
10 loops, best of 3: 20.8 ms per loop

Note: time objects don't accept a nanosecond (but that's DatetimeIndex's resolution).

注意:时间对象不接受纳秒(但这是 DatetimeIndex 的分辨率)。

We should probably convert the index to have time objects to make this comparison fair:

我们可能应该将索引转换为具有时间对象以使这种比较公平:

In [21]: res = all_data.groupby([all_data.index.hour, all_data.index.minute, all_data.index.second, all_data.index.microsecond]).mean()

In [22]: %timeit res.index.map(lambda t: datetime.time(*t))
1000 loops, best of 3: 1.39 ms per loop

In [23]: res.index = res.index.map(lambda t: datetime.time(*t))

So it's around 10 times faster for maximum resolution, and you can easily make it coarser (and faster) e.g. groupby just the hour and minute..

因此,对于最大分辨率而言,它大约快 10 倍,并且您可以轻松地使其更粗糙(更快),例如按小时和分钟分组。