Pandas 相当于整数索引的重采样

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37396264/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:16:40  来源:igfitidea点击:

Pandas' equivalent of resample for integer index

pythonpandasresampling

提问by TomCho

I'm looking for a pandas equivalent of the resamplemethod for a dataframe whose isn't a DatetimeIndexbut an array of integers, or maybe even floats.

我正在寻找一个 Pandas 等价resample于一个数据帧的方法,它不是一个DatetimeIndex整数数组,甚至可能是一个浮点数。

I know that for some cases (this one, for example) the resample method can be substituted easily by a reindex and interpolation, but for some cases (I think) it can't.

我知道在某些情况下(例如这个),可以通过重新索引和插值轻松替换重新采样方法,但在某些情况下(我认为)不能。

For example, if I have

例如,如果我有

df = pd.DataFrame(np.random.randn(10,2))
withdates = df.set_index(pd.date_range('2012-01-01', periods=10))
withdates.resample('5D', np.std)

this gives me

这给了我

                   0         1
2012-01-01  1.184582  0.492113
2012-01-06  0.533134  0.982562

but I can't produce the same result with dfand resample. So I'm looking for something that would work as

但我无法使用df和重新采样产生相同的结果。所以我正在寻找可以作为的东西

 df.resample(5, np.std)

and that would give me

那会给我

          0         1
0  1.184582  0.492113
5  0.533134  0.982562

Does such a method exist? The only way I was able to create this method was by manually separating dfinto smaller dataframes, applying np.stdand then concatenating everything back, which I find pretty slow and not smart at all.

有这样的方法吗?我能够创建这种方法的唯一方法是手动df分成较小的数据帧,应用np.std然后将所有内容连接回去,我发现这很慢而且一点也不聪明。

Cheers

干杯

采纳答案by piRSquared

Setup

设置

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

You need to create the labels to group by yourself. I'd use:

您需要自己创建要分组的标签。我会用:

(df.index.to_series() / 5).astype(int)

To get you a series of values like [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ...]Then use this in a groupby

为您提供一系列值,例如[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ...]然后在groupby

You'll also need to specify the index for the new dataframe. I'd use:

您还需要为新数据框指定索引。我会用:

df.index[4::5]

To get a the current index starting at the 5th position (hence the 4) and every 5th position after that. It will look like [4, 9, 14, 19]. I could've done this as df.index[::5]to get the starting positions but I went with ending positions.

获取从第 5 个位置(因此是4)开始的当前索引,之后每第 5 个位置。它看起来像[4, 9, 14, 19]. 我本来可以这样做df.index[::5]以获得起始位置,但我选择了结束位置。

Solution

解决方案

# assign as variable because I'm going to use it more than once.
s = (df.index.to_series() / 5).astype(int)

df.groupby(s).std().set_index(s.index[4::5])

Looks like:

好像:

           A         B
4   0.198019  0.320451
9   0.329750  0.408232
14  0.293297  0.223991
19  0.095633  0.376390

Other considerations

其他注意事项

This is for the equivalent of down sampling. We haven't addressed up sampling.

这相当于下采样。我们还没有解决抽样问题。

To go back from what we've produced to a dataframe index by something more frequent, we can use reindexlike so:

要通过更频繁的方式从我们生成的数据返回到数据帧索引,我们可以reindex像这样使用:

# assign what we've done above to df_down
df_down = df.groupby(s).std().set_index(s.index[4::5])

df_up = df_down.reindex(range(20)).bfill()

Looks like:

好像:

           A         B
0   0.198019  0.320451
1   0.198019  0.320451
2   0.198019  0.320451
3   0.198019  0.320451
4   0.198019  0.320451
5   0.329750  0.408232
6   0.329750  0.408232
7   0.329750  0.408232
8   0.329750  0.408232
9   0.329750  0.408232
10  0.293297  0.223991
11  0.293297  0.223991
12  0.293297  0.223991
13  0.293297  0.223991
14  0.293297  0.223991
15  0.095633  0.376390
16  0.095633  0.376390
17  0.095633  0.376390
18  0.095633  0.376390
19  0.095633  0.376390

We could also use other things to reindexby like range(0, 20, 2)to up sample to even integer indices.

我们也可以使用其他的事情reindex由像range(0, 20, 2)到了样品,甚至整数索引。

回答by TomCho

Alternative, this is one thing that can be done

替代方案,这是可以做的一件事

def resample(df, rule, how=None, **kwargs):
    import pandas as pd
    if how==None:
        import numpy as np
        how = np.mean

    if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):
        return df.resample(rule, how, **kwargs)
    else:
        idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)
        aux = df.groupby(idx).apply(how)
        aux = aux.set_index(bins[:-1])
        return aux

回答by kidpixo

@piSquared solution is really nice, but I don't like picking index per hand at reindexing.

@piSquared 解决方案非常好,但我不喜欢在重新索引时每手挑选索引。

This should works too for each kind of downsampling (float index too) and automatically pick the mean of the index in each range:

这也适用于每种下采样(浮动索引)并自动选择每个范围内的索引平均值:

df = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])
df.index.name = 'crazy_index'

s = (df.index.to_series() / 10).astype(int)

Now you can pick the function you want to calculate in each sub group at your will:

现在您可以随意在每个子组中选择要计算的函数:

# calculate std() in each group
df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

                    A         B
crazy_index
3.667539     0.276986  0.317642
14.275074    0.248700  0.372551
25.054042    0.254860  0.297586

# calculate median() in each group
df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )
Out[38]:
                    A         B
crazy_index
3.667539     0.454654  0.521649
14.275074    0.451265  0.490125
25.054042    0.489326  0.622781

EDIT : There were some errors in s indexing, now it is correct & working.

编辑:s 索引中有一些错误,现在它是正确的并且可以工作。