pandas 使用自定义周期重新采样
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15408156/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Resampling with custom periods
提问by Rutger Kassies
Is there a 'cookbook' way of resampling a DataFrame with (semi)irregular periods?
是否有一种“食谱”方式来重新采样具有(半)不规则周期的数据帧?
I have a dataset at a daily interval and want it to resample to what sometimes (in scientific literature) is named dekad's. I dont think there is a proper English term for it but its basically chopping a month in three ~ten-day parts where the third is a remainder of anything between 8 and 11 days.
我每天都有一个数据集,并希望它重新采样到有时(在科学文献中)被命名为 dekad 的数据。我不认为有一个合适的英语术语,但它基本上将一个月分为三个 ~10 天的部分,其中第三个是 8 到 11 天之间的任何剩余部分。
I came up with two solutions myself, a specific one for this case and a more general one for any irregular periods. But both arent really good, so im curiuous how others handle these type of situations.
我自己想出了两个解决方案,一个针对这种情况的特定解决方案,一个针对任何不规则时期的更通用的解决方案。但两者都不是很好,所以我很好奇其他人如何处理这些类型的情况。
Lets start with creating some sample data:
让我们从创建一些示例数据开始:
import pandas as pd
begin = pd.datetime(2013,1,1)
end = pd.datetime(2013,2,20)
dtrange = pd.date_range(begin, end)
p1 = np.random.rand(len(dtrange)) + 5
p2 = np.random.rand(len(dtrange)) + 10
df = pd.DataFrame({'p1': p1, 'p2': p2}, index=dtrange)
The first thing i came up with is grouping by individual months (YYYYMM) and then slicing it manually. Like:
我想出的第一件事是按个别月份(YYYYMM)分组,然后手动切片。喜欢:
def to_dec1(data, func):
# create the indexes, start of the ~10day period
idx1 = pd.datetime(data.index[0].year, data.index[0].month, 1)
idx2 = idx1 + datetime.timedelta(days=10)
idx3 = idx2 + datetime.timedelta(days=10)
# slice the period and perform function
oneday = datetime.timedelta(days=1)
fir = func(data.ix[:idx2 - oneday].values, axis=0)
sec = func(data.ix[idx2:idx3 - oneday].values, axis=0)
thi = func(data.ix[idx3:].values, axis=0)
return pd.DataFrame([fir,sec,thi], index=[idx1,idx2,idx3], columns=data.columns)
dfmean = df.groupby(lambda x: x.strftime('%Y%m'), group_keys=False).apply(to_dec1, np.mean)
Which results in:
结果是:
print dfmean
p1 p2
2013-01-01 5.436778 10.409845
2013-01-11 5.534509 10.482231
2013-01-21 5.449058 10.454777
2013-02-01 5.685700 10.422697
2013-02-11 5.578137 10.532180
2013-02-21 NaN NaN
Note that you always get a full month of 'dekads' in return, its not a problem and easy to remove if needed.
请注意,您总是会得到整整一个月的“dekads”作为回报,这不是问题,并且在需要时可以轻松删除。
The other solution works by providing a range of dates at which you chop up the DataFrame and perform a function on each segment. Its more flexible in terms of the periods you want.
另一种解决方案的工作原理是提供一系列日期,您可以在该日期范围内分割 DataFrame 并在每个段上执行一个功能。它在您想要的时期方面更加灵活。
def to_dec2(data, dts, func):
chucks = []
for n,start in enumerate(dts[:-1]):
end = dts[n+1] - datetime.timedelta(days=1)
chucks.append(func(data.ix[start:end].values, axis=0))
return pd.DataFrame(chucks, index=dts[:-1], columns=data.columns)
dfmean2 = to_dec2(df, dfmean.index, np.mean)
Note that im using the index of the previous result as the range of dates to save some time 'building' it myself.
请注意,我使用前一个结果的索引作为日期范围,以节省自己“构建”它的时间。
What would be the best way of handling these cases? Is there perhaps a bit more build-in method in Pandas?
处理这些案件的最佳方式是什么?Pandas 中是否有更多内置方法?
采纳答案by HYRY
If you use numpy 1.7, you can use datetime64 & timedelta64 arrays to do the calculation:
如果使用 numpy 1.7,则可以使用 datetime64 和 timedelta64 数组进行计算:
create the sample data:
创建示例数据:
import pandas as pd
import numpy as np
begin = pd.datetime(2013,1,1)
end = pd.datetime(2013,2,20)
dtrange = pd.date_range(begin, end)
p1 = np.random.rand(len(dtrange)) + 5
p2 = np.random.rand(len(dtrange)) + 10
df = pd.DataFrame({'p1': p1, 'p2': p2}, index=dtrange)
calculate the dekad's date:
计算 dekad 的日期:
d = df.index.day - np.clip((df.index.day-1) // 10, 0, 2)*10 - 1
date = df.index.values - np.array(d, dtype="timedelta64[D]")
df.groupby(date).mean()
The output is:
输出是:
p1 p2
2013-01-01 5.413795 10.445640
2013-01-11 5.516063 10.491339
2013-01-21 5.539676 10.528745
2013-02-01 5.783467 10.478001
2013-02-11 5.358787 10.579149
回答by Jeff
Using HYRY's data and solution up to the computation of the dvariable, we can also do the following in pandas 0.11-dev or later (regardless of numpy version):
使用 HYRY 的数据和解决方案直到计算d变量,我们还可以在 pandas 0.11-dev 或更高版本(无论 numpy 版本)中执行以下操作:
In [18]: from datetime import timedelta
In [23]: pd.Series([ timedelta(int(i)) for i in d ])
Out[23]:
0 00:00:00
1 1 days, 00:00:00
2 2 days, 00:00:00
3 3 days, 00:00:00
4 4 days, 00:00:00
5 5 days, 00:00:00
6 6 days, 00:00:00
7 7 days, 00:00:00
8 8 days, 00:00:00
9 9 days, 00:00:00
10 00:00:00
47 6 days, 00:00:00
48 7 days, 00:00:00
49 8 days, 00:00:00
50 9 days, 00:00:00
Length: 51, dtype: timedelta64[ns]
The date is constructed similary to above
日期的构造与上述类似
date = pd.Series(df.index) - pd.Series([ timedelta(int(i)) for i in d ])
df.groupby(date.values).mean()

