pandas 使用自定义周期重新采样

Question

提问by Rutger Kassies

Is there a 'cookbook' way of resampling a DataFrame with (semi)irregular periods?

是否有一种“食谱”方式来重新采样具有（半）不规则周期的数据帧？

I have a dataset at a daily interval and want it to resample to what sometimes (in scientific literature) is named dekad's. I dont think there is a proper English term for it but its basically chopping a month in three ~ten-day parts where the third is a remainder of anything between 8 and 11 days.

我每天都有一个数据集，并希望它重新采样到有时（在科学文献中）被命名为 dekad 的数据。我不认为有一个合适的英语术语，但它基本上将一个月分为三个 ~10 天的部分，其中第三个是 8 到 11 天之间的任何剩余部分。

I came up with two solutions myself, a specific one for this case and a more general one for any irregular periods. But both arent really good, so im curiuous how others handle these type of situations.

我自己想出了两个解决方案，一个针对这种情况的特定解决方案，一个针对任何不规则时期的更通用的解决方案。但两者都不是很好，所以我很好奇其他人如何处理这些类型的情况。

Lets start with creating some sample data:

让我们从创建一些示例数据开始：

import pandas as pd

begin = pd.datetime(2013,1,1)
end = pd.datetime(2013,2,20)

dtrange = pd.date_range(begin, end)

p1 = np.random.rand(len(dtrange)) + 5
p2 = np.random.rand(len(dtrange)) + 10

df = pd.DataFrame({'p1': p1, 'p2': p2}, index=dtrange)

The first thing i came up with is grouping by individual months (YYYYMM) and then slicing it manually. Like:

我想出的第一件事是按个别月份（YYYYMM）分组，然后手动切片。喜欢：

def to_dec1(data, func):

    # create the indexes, start of the ~10day period
    idx1 = pd.datetime(data.index[0].year, data.index[0].month, 1)
    idx2 = idx1 + datetime.timedelta(days=10)
    idx3 = idx2 + datetime.timedelta(days=10)

    # slice the period and perform function
    oneday = datetime.timedelta(days=1)
    fir = func(data.ix[:idx2 - oneday].values, axis=0)
    sec = func(data.ix[idx2:idx3 - oneday].values, axis=0)
    thi = func(data.ix[idx3:].values, axis=0)

    return pd.DataFrame([fir,sec,thi], index=[idx1,idx2,idx3], columns=data.columns)

dfmean = df.groupby(lambda x: x.strftime('%Y%m'), group_keys=False).apply(to_dec1, np.mean)

Which results in:

结果是：

print dfmean

                  p1         p2
2013-01-01  5.436778  10.409845
2013-01-11  5.534509  10.482231
2013-01-21  5.449058  10.454777
2013-02-01  5.685700  10.422697
2013-02-11  5.578137  10.532180
2013-02-21       NaN        NaN

Note that you always get a full month of 'dekads' in return, its not a problem and easy to remove if needed.

请注意，您总是会得到整整一个月的“dekads”作为回报，这不是问题，并且在需要时可以轻松删除。

The other solution works by providing a range of dates at which you chop up the DataFrame and perform a function on each segment. Its more flexible in terms of the periods you want.

另一种解决方案的工作原理是提供一系列日期，您可以在该日期范围内分割 DataFrame 并在每个段上执行一个功能。它在您想要的时期方面更加灵活。

def to_dec2(data, dts, func):

    chucks = []
    for n,start in enumerate(dts[:-1]):

        end = dts[n+1] - datetime.timedelta(days=1)
        chucks.append(func(data.ix[start:end].values, axis=0))

    return pd.DataFrame(chucks, index=dts[:-1], columns=data.columns)

dfmean2 = to_dec2(df, dfmean.index, np.mean)

Note that im using the index of the previous result as the range of dates to save some time 'building' it myself.

请注意，我使用前一个结果的索引作为日期范围，以节省自己“构建”它的时间。

What would be the best way of handling these cases? Is there perhaps a bit more build-in method in Pandas?

处理这些案件的最佳方式是什么？Pandas 中是否有更多内置方法？

Answer 1

采纳答案by HYRY

If you use numpy 1.7, you can use datetime64 & timedelta64 arrays to do the calculation:

如果使用 numpy 1.7，则可以使用 datetime64 和 timedelta64 数组进行计算：

create the sample data:

创建示例数据：

import pandas as pd
import numpy as np

begin = pd.datetime(2013,1,1)
end = pd.datetime(2013,2,20)

dtrange = pd.date_range(begin, end)

p1 = np.random.rand(len(dtrange)) + 5
p2 = np.random.rand(len(dtrange)) + 10

df = pd.DataFrame({'p1': p1, 'p2': p2}, index=dtrange)

calculate the dekad's date:

计算 dekad 的日期：

d = df.index.day - np.clip((df.index.day-1) // 10, 0, 2)*10 - 1
date = df.index.values - np.array(d, dtype="timedelta64[D]")
df.groupby(date).mean()

The output is:

输出是：

                 p1         p2
2013-01-01  5.413795  10.445640
2013-01-11  5.516063  10.491339
2013-01-21  5.539676  10.528745
2013-02-01  5.783467  10.478001
2013-02-11  5.358787  10.579149

Answer 2

回答by Jeff

Using HYRY's data and solution up to the computation of the dvariable, we can also do the following in pandas 0.11-dev or later (regardless of numpy version):

使用 HYRY 的数据和解决方案直到计算d变量，我们还可以在 pandas 0.11-dev 或更高版本（无论 numpy 版本）中执行以下操作：

In [18]: from datetime import timedelta

In [23]: pd.Series([ timedelta(int(i)) for i in d ])
Out[23]: 
0             00:00:00
1     1 days, 00:00:00
2     2 days, 00:00:00
3     3 days, 00:00:00
4     4 days, 00:00:00
5     5 days, 00:00:00
6     6 days, 00:00:00
7     7 days, 00:00:00
8     8 days, 00:00:00
9     9 days, 00:00:00
10            00:00:00

47    6 days, 00:00:00
48    7 days, 00:00:00
49    8 days, 00:00:00
50    9 days, 00:00:00
Length: 51, dtype: timedelta64[ns]

The date is constructed similary to above

日期的构造与上述类似

date = pd.Series(df.index) - pd.Series([ timedelta(int(i)) for i in d ])
df.groupby(date.values).mean()

pandas 使用自定义周期重新采样

提问by Rutger Kassies

采纳答案by HYRY

回答by Jeff

相关推荐

最近更新

标签

pandas 使用自定义周期重新采样

提问by Rutger Kassies

采纳答案by HYRY

回答by Jeff

相关推荐

在 Pandas to_html 中格式化输出数据

如何从 Pandas 绘图函数返回一个 matplotlib.figure.Figure 对象？

基于列标签在 Pandas 中重塑数据框

pandas 如何从另一个数据帧中减去一个数据帧？

相关推荐

最近更新

标签