pandas 在熊猫中按时间分组的更快方法

Question

提问by joeb1415

I have a time series of several days of 1-minute data, and would like to average it across all days by time of day.

我有几天的 1 分钟数据的时间序列，并希望按一天中的时间平均所有天。

This is very slow:

这是非常缓慢的：

from datetime import datetime
from pandas import date_range, Series
time_ind = date_range(datetime(2013, 1, 1), datetime(2013, 1, 10), freq='1min')
all_data = Series(randn(len(time_ind)), time_ind)
time_mean = all_data.groupby(lambda x: x.time()).mean()

Takes almost a minute to run!

运行需要将近一分钟！

While something like:

虽然类似：

time_mean = all_data.groupby(lambda x: x.minute).mean()

takes only a fraction of a second.

只需要几分之一秒。

Is there a faster way to group by time of day?

有没有更快的方法来按一天中的时间分组？

Any idea why this is so slow?

知道为什么这这么慢吗？

Answer 1

采纳答案by bmu

Both your "lambda-version" and the time property introduced in version 0.11seems to be slow in version 0.11.0:

您的“lambda-version”和0.11 版中引入的 time 属性在 0.11.0版中似乎都很慢：

In [4]: %timeit all_data.groupby(all_data.index.time).mean()
1 loops, best of 3: 11.8 s per loop

In [5]: %timeit all_data.groupby(lambda x: x.time()).mean()
Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <type 'exceptions.RuntimeError'> ignored
Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <type 'exceptions.RuntimeError'> ignored
Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <type 'exceptions.RuntimeError'> ignored
1 loops, best of 3: 11.8 s per loop

With the current master both methods are considerably faster:

使用当前的 master，这两种方法都快得多：

In [1]: pd.version.version
Out[1]: '0.11.1.dev-06cd915'

In [5]: %timeit all_data.groupby(lambda x: x.time()).mean()
1 loops, best of 3: 215 ms per loop

In [6]: %timeit all_data.groupby(all_data.index.time).mean()
10 loops, best of 3: 113 ms per loop
'0.11.1.dev-06cd915'

So you can either update to a master or wait for 0.11.1 which should be released this month.

所以你可以更新到 master 或者等待本月应该发布的 0.11.1。

Answer 2

回答by Andy Hayden

It's faster to groupby the hour/minute/.. attributes rather than .time. Here's Jeff's baseline:

按小时/分钟/.. 属性而不是 .. 属性分组更快.time。这是杰夫的基线：

In [11]: %timeit all_data.groupby(all_data.index.time).mean()
1 loops, best of 3: 202 ms per loop

and without time it's much faster (the fewer attributes the faster it is):

没有时间它会快得多（属性越少，速度越快）：

In [12]: %timeit all_data.groupby(all_data.index.hour).mean()
100 loops, best of 3: 5.53 ms per loop

In [13]: %timeit all_data.groupby([all_data.index.hour, all_data.index.minute, all_data.index.second, all_data.index.microsecond]).mean()
10 loops, best of 3: 20.8 ms per loop

Note: time objects don't accept a nanosecond (but that's DatetimeIndex's resolution).

注意：时间对象不接受纳秒（但这是 DatetimeIndex 的分辨率）。

We should probably convert the index to have time objects to make this comparison fair:

我们可能应该将索引转换为具有时间对象以使这种比较公平：

In [21]: res = all_data.groupby([all_data.index.hour, all_data.index.minute, all_data.index.second, all_data.index.microsecond]).mean()

In [22]: %timeit res.index.map(lambda t: datetime.time(*t))
1000 loops, best of 3: 1.39 ms per loop

In [23]: res.index = res.index.map(lambda t: datetime.time(*t))

So it's around 10 times faster for maximum resolution, and you can easily make it coarser (and faster) e.g. groupby just the hour and minute..

因此，对于最大分辨率而言，它大约快 10 倍，并且您可以轻松地使其更粗糙（更快），例如按小时和分钟分组。

pandas 在熊猫中按时间分组的更快方法

提问by joeb1415

采纳答案by bmu

回答by Andy Hayden

相关推荐

最近更新

标签

pandas 在熊猫中按时间分组的更快方法

提问by joeb1415

采纳答案by bmu

回答by Andy Hayden

相关推荐

使用另一个系列过滤 Pandas 数据框

pandas 在 Python 中绘制直方图的时间序列

使用 Pandas 从订单的时间序列创建订单簿的快照？

pandas 熊猫在一个尺度上绘制两个图

相关推荐

最近更新

标签