如何通过 2x2 平均内核对 Pandas 数据帧进行下采样
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/18825412/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to downsample a pandas dataframe by 2x2 averaging kernel
提问by gc5
I am trying to downsample a pandas dataframe in order to reduce granularity. In example, I want to reduce this dataframe:
我正在尝试对 Pandas 数据帧进行下采样以减少粒度。例如,我想减少这个数据框:
1  2  3  4
2  4  3  3
2  2  1  3
3  1  3  2
to this (downsampling to obtain a 2x2 dataframe using mean):
对此(使用均值进行下采样以获得 2x2 数据帧):
2.25  3.25
2     2.25
Is there a builtin way or efficient way to do it or I have to write it on my own?
有没有内置的方法或有效的方法来做到这一点,或者我必须自己编写?
Thanks
谢谢
回答by Andy Hayden
One option is to use groupby twice. Once for the index:
一种选择是使用 groupby 两次。一次为索引:
In [11]: df.groupby(lambda x: x/2).mean()
Out[11]:
     0    1  2    3
0  1.5  3.0  3  3.5
1  2.5  1.5  2  2.5
and once for the columns:
一次是列:
In [12]: df.groupby(lambda x: x/2).mean().groupby(lambda y: y/2, axis=1).mean()
Out[12]:
      0     1
0  2.25  3.25
1  2.00  2.25
Note: A solution which only calculated the mean once might be preferable... one option is to stack, groupby, mean, and unstack, but atmthis is a little fiddly.
注意:只计算一次平均值的解决方案可能更可取……一种选择是堆叠、分组、均值和取消堆叠,但atm这有点繁琐。
This seems significantly faster than Vicktor's solution:
这似乎比Vicktor 的解决方案快得多:
In [21]: df = pd.DataFrame(np.random.randn(100, 100))
In [22]: %timeit df.groupby(lambda x: x/2).mean().groupby(lambda y: y/2, axis=1).mean()
1000 loops, best of 3: 1.64 ms per loop
In [23]: %timeit viktor()
1 loops, best of 3: 822 ms per loop
In fact, Viktor's solution crashes my (underpowered) laptop for larger DataFrames:
事实上,Viktor 的解决方案使我的(动力不足的)笔记本电脑因更大的 DataFrame 而崩溃:
In [31]: df = pd.DataFrame(np.random.randn(1000, 1000))
In [32]: %timeit df.groupby(lambda x: x/2).mean().groupby(lambda y: y/2, axis=1).mean()
10 loops, best of 3: 42.9 ms per loop
In [33]: %timeit viktor()
# crashes
As Viktor points out, this doesn't work with non-integer index, if this was wanted, you could just store them as temp variables and feed them back in after:
正如 Viktor 指出的那样,这不适用于非整数索引,如果需要,您可以将它们存储为临时变量并在之后将它们反馈:
df_index, df_cols, df.index, df.columns = df.index, df.columns, np.arange(len(df.index)), np.arange(len(df.columns))
res = df.groupby(...
res.index, res.columns = df_index[::2], df_cols[::2]
回答by Viktor Kerkez
You can use the rolling_meanfunction applied twice, first on the columns and then on the rows, and then slice the results:
您可以使用rolling_mean两次应用的函数,首先在列上,然后在行上,然后对结果进行切片:
rbs = 2 # row block size
cbs = 2 # column block size
pd.rolling_mean(pd.rolling_mean(df.T, cbs, center=True)[cbs-1::cbs].T,
                rbs)[rbs-1::rbs]
Which gives the same result you want, except the index will be different (but you can fix this using .reset_index(drop=True)):
这给出了您想要的相同结果,除了索引会有所不同(但您可以使用 解决此问题.reset_index(drop=True)):
      1     3
1  2.25  3.25
3  2.00  2.25
Timing info:
时间信息:
In [11]: df = pd.DataFrame(np.random.randn(100, 100))
In [12]: %%timeit
         pd.rolling_mean(pd.rolling_mean(df.T, 2, center=True)[1::2].T, 2)[1::2]
100 loops, best of 3: 4.75 ms per loop
In [13]: %%timeit
         df.groupby(lambda x: x/2).mean().groupby(lambda y: y/2, axis=1).mean()
100 loops, best of 3: 932 μs per loop
So it's around 5x slower than the groupby not 800x :)
所以它比 groupby 慢 5 倍左右,而不是 800 倍 :)

