pandas 如何在熊猫时间序列中基于 5 分钟的间隔创建组 ID?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23966152/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to create a group ID based on 5 minutes interval in pandas timeseries?
提问by user3576212
I have a timeseries dataframe dflooks like this (the time seris happen within same day, but across different hours:
我有一个时间序列数据框df看起来像这样(时间序列 发生在同一天,但跨越不同的时间:
id val
time
2014-04-03 16:01:53 23 14389
2014-04-03 16:01:54 28 14391
2014-04-03 16:05:55 24 14393
2014-04-03 16:06:25 23 14395
2014-04-03 16:07:01 23 14395
2014-04-03 16:10:09 23 14395
2014-04-03 16:10:23 26 14397
2014-04-03 16:10:57 26 14397
2014-04-03 16:11:10 26 14397
I need to create group every 5 minutes from starting from 16:00:00. That is all the rows with in the range 16:00:00to 16:05:00, its value of the new column periodis 1. (the number of rows within each group is irregular, so i can't simply cut the group)
我需要从16:00:00. 即在16:00:00to范围内的所有行16:05:00,其新列的period值为 1。(每个组内的行数不规则,所以我不能简单地剪切该组)
Eventually, the data should look like this:
最终,数据应如下所示:
id val period
time
2014-04-03 16:01:53 23 14389 1
2014-04-03 16:01:54 28 14391 1
2014-04-03 16:05:55 24 14393 2
2014-04-03 16:06:25 23 14395 2
2014-04-03 16:07:01 23 14395 2
2014-04-03 16:10:09 23 14395 3
2014-04-03 16:10:23 26 14397 3
2014-04-03 16:10:57 26 14397 3
2014-04-03 16:11:10 26 14397 3
The purpose is to perform some groupbyoperation, but the operation I need to do is not included in pd.resample(how=' ')method. So I have to create a periodcolumn to identify each group, then do df.groupby('period').apply(myfunc).
目的是执行一些groupby操作,但是我需要做的操作没有包含在pd.resample(how=' ')方法中。所以我必须创建一个period列来标识每个组,然后执行df.groupby('period').apply(myfunc).
Any help or comments are highly appreciated.
任何帮助或评论都非常感谢。
Thanks!
谢谢!
采纳答案by Karl D.
You can use the TimeGrouperfunction in a groupy/apply. With a TimeGrouperyou don't need to create your period column. I know you're not trying to compute the mean but I will use it as an example:
您可以TimeGrouper在groupy/apply. 使用 aTimeGrouper您不需要创建您的期间列。我知道你不是想计算平均值,但我会用它作为例子:
>>> df.groupby(pd.TimeGrouper('5Min'))['val'].mean()
time
2014-04-03 16:00:00 14390.000000
2014-04-03 16:05:00 14394.333333
2014-04-03 16:10:00 14396.500000
Or an example with an explicit apply:
或者一个显式的例子apply:
>>> df.groupby(pd.TimeGrouper('5Min'))['val'].apply(lambda x: len(x) > 3)
time
2014-04-03 16:00:00 False
2014-04-03 16:05:00 False
2014-04-03 16:10:00 True
Doctstring for TimeGrouper:
文档字符串TimeGrouper:
Docstring for resample:class TimeGrouper@21
TimeGrouper(self, freq = 'Min', closed = None, label = None,
how = 'mean', nperiods = None, axis = 0, fill_method = None,
limit = None, loffset = None, kind = None, convention = None, base = 0,
**kwargs)
Custom groupby class for time-interval grouping
Parameters
----------
freq : pandas date offset or offset alias for identifying bin edges
closed : closed end of interval; left or right
label : interval boundary to use for labeling; left or right
nperiods : optional, integer
convention : {'start', 'end', 'e', 's'}
If axis is PeriodIndex
Notes
-----
Use begin, end, nperiods to generate intervals that cannot be derived
directly from the associated object
Edit
编辑
I don't know of an elegant way to create the period column, but the following will work:
我不知道创建周期列的优雅方法,但以下方法可行:
>>> new = df.groupby(pd.TimeGrouper('5Min'),as_index=False).apply(lambda x: x['val'])
>>> df['period'] = new.index.get_level_values(0)
>>> df
id val period
time
2014-04-03 16:01:53 23 14389 0
2014-04-03 16:01:54 28 14391 0
2014-04-03 16:05:55 24 14393 1
2014-04-03 16:06:25 23 14395 1
2014-04-03 16:07:01 23 14395 1
2014-04-03 16:10:09 23 14395 2
2014-04-03 16:10:23 26 14397 2
2014-04-03 16:10:57 26 14397 2
2014-04-03 16:11:10 26 14397 2
It works because the groupby here with as_index=False actually returns the period column you want as the part of the multiindex and I just grab that part of the multiindex and assign to a new column in the orginal dataframe. You could do anything in the apply, I just want the index:
它之所以有效,是因为带有 as_index=False 的 groupby 实际上返回了您想要的周期列作为多索引的一部分,我只是获取多索引的那部分并分配给原始数据帧中的新列。你可以在申请中做任何事情,我只想要索引:
>>> new
time
0 2014-04-03 16:01:53 14389
2014-04-03 16:01:54 14391
1 2014-04-03 16:05:55 14393
2014-04-03 16:06:25 14395
2014-04-03 16:07:01 14395
2 2014-04-03 16:10:09 14395
2014-04-03 16:10:23 14397
2014-04-03 16:10:57 14397
2014-04-03 16:11:10 14397
>>> new.index.get_level_values(0)
Int64Index([0, 0, 1, 1, 1, 2, 2, 2, 2], dtype='int64')
回答by pbreach
Depending on what your doing if I understand the question right can be done a lot more easily just using the resample method
如果我理解正确的问题,则取决于您的行为,只需使用 resample 方法就可以更轻松地完成
#Get some data
index = pd.DatetimeIndex(start='2013-01-01 00:00', end='2013-01-31 00:00', freq='min')
a = np.random.randint(20, high=30, size=(len(index),1))
b = np.random.randint(14440, high=14449, size=(len(index),1))
df = pd.DataFrame(np.concatenate((a,b), axis=1), index=index, columns=['id','val'])
df.head()
Out[34]:
id val
2013-01-01 00:00:00 20 14446
2013-01-01 00:01:00 25 14443
2013-01-01 00:02:00 25 14448
2013-01-01 00:03:00 20 14445
2013-01-01 00:04:00 28 14442
#Define function for variance
import numpy as np
def pyfun(X):
if X.shape[0] <= 1:
result = nan
else:
total = 0
for x in X:
total = total + x
mean = float(total) / X.shape[0]
total = 0
for x in X:
total = total + (mean-x)**2
result = float(total) / (X.shape[0]-1)
return result
#Try it out
df.resample('5min', how=pyfun)
Out[53]:
id val
2013-01-01 00:00:00 12.3 5.7
2013-01-01 00:05:00 9.3 7.3
2013-01-01 00:10:00 4.7 0.8
2013-01-01 00:15:00 10.8 10.3
2013-01-01 00:20:00 11.5 1.5
Well that was easy. This is for your own functions but if you want to use a function from a library then all you need to do is specify the function in the how keyword
那很容易。这是针对您自己的函数,但是如果您想使用库中的函数,那么您需要做的就是在 how 关键字中指定该函数
df.resample('5min', how=np.var).head()
Out[54]:
id val
2013-01-01 00:00:00 12.3 5.7
2013-01-01 00:05:00 9.3 7.3
2013-01-01 00:10:00 4.7 0.8
2013-01-01 00:15:00 10.8 10.3
2013-01-01 00:20:00 11.5 1.5

