如何通过 2x2 平均内核对 Pandas 数据帧进行下采样

Question

提问by gc5

I am trying to downsample a pandas dataframe in order to reduce granularity. In example, I want to reduce this dataframe:

我正在尝试对 Pandas 数据帧进行下采样以减少粒度。例如，我想减少这个数据框：

to this (downsampling to obtain a 2x2 dataframe using mean):

对此（使用均值进行下采样以获得 2x2 数据帧）：

2.25  3.25
2     2.25

Is there a builtin way or efficient way to do it or I have to write it on my own?

有没有内置的方法或有效的方法来做到这一点，或者我必须自己编写？

Thanks

谢谢

Answer 1

回答by Andy Hayden

One option is to use groupby twice. Once for the index:

一种选择是使用 groupby 两次。一次为索引：

In [11]: df.groupby(lambda x: x/2).mean()
Out[11]:
     0    1  2    3
0  1.5  3.0  3  3.5
1  2.5  1.5  2  2.5

and once for the columns:

一次是列：

In [12]: df.groupby(lambda x: x/2).mean().groupby(lambda y: y/2, axis=1).mean()
Out[12]:
      0     1
0  2.25  3.25
1  2.00  2.25

Note: A solution which only calculated the mean once might be preferable... one option is to stack, groupby, mean, and unstack, but atmthis is a little fiddly.

注意：只计算一次平均值的解决方案可能更可取……一种选择是堆叠、分组、均值和取消堆叠，但atm这有点繁琐。

This seems significantly faster than Vicktor's solution:

这似乎比Vicktor 的解决方案快得多：

In [21]: df = pd.DataFrame(np.random.randn(100, 100))

In [22]: %timeit df.groupby(lambda x: x/2).mean().groupby(lambda y: y/2, axis=1).mean()
1000 loops, best of 3: 1.64 ms per loop

In [23]: %timeit viktor()
1 loops, best of 3: 822 ms per loop

In fact, Viktor's solution crashes my (underpowered) laptop for larger DataFrames:

事实上，Viktor 的解决方案使我的（动力不足的）笔记本电脑因更大的 DataFrame 而崩溃：

In [31]: df = pd.DataFrame(np.random.randn(1000, 1000))

In [32]: %timeit df.groupby(lambda x: x/2).mean().groupby(lambda y: y/2, axis=1).mean()
10 loops, best of 3: 42.9 ms per loop

In [33]: %timeit viktor()
# crashes

As Viktor points out, this doesn't work with non-integer index, if this was wanted, you could just store them as temp variables and feed them back in after:

正如 Viktor 指出的那样，这不适用于非整数索引，如果需要，您可以将它们存储为临时变量并在之后将它们反馈：

df_index, df_cols, df.index, df.columns = df.index, df.columns, np.arange(len(df.index)), np.arange(len(df.columns))
res = df.groupby(...
res.index, res.columns = df_index[::2], df_cols[::2]

Answer 2

回答by Viktor Kerkez

You can use the rolling_meanfunction applied twice, first on the columns and then on the rows, and then slice the results:

您可以使用rolling_mean两次应用的函数，首先在列上，然后在行上，然后对结果进行切片：

rbs = 2 # row block size
cbs = 2 # column block size
pd.rolling_mean(pd.rolling_mean(df.T, cbs, center=True)[cbs-1::cbs].T,
                rbs)[rbs-1::rbs]

Which gives the same result you want, except the index will be different (but you can fix this using .reset_index(drop=True)):

这给出了您想要的相同结果，除了索引会有所不同（但您可以使用解决此问题.reset_index(drop=True)）：

      1     3
1  2.25  3.25
3  2.00  2.25

Timing info:

时间信息：

In [11]: df = pd.DataFrame(np.random.randn(100, 100))
In [12]: %%timeit
         pd.rolling_mean(pd.rolling_mean(df.T, 2, center=True)[1::2].T, 2)[1::2]
100 loops, best of 3: 4.75 ms per loop
In [13]: %%timeit
         df.groupby(lambda x: x/2).mean().groupby(lambda y: y/2, axis=1).mean()
100 loops, best of 3: 932 μs per loop

So it's around 5x slower than the groupby not 800x :)

所以它比 groupby 慢 5 倍左右，而不是 800 倍 :)

如何通过 2x2 平均内核对 Pandas 数据帧进行下采样

提问by gc5

回答by Andy Hayden

回答by Viktor Kerkez

相关推荐

最近更新

标签

如何通过 2x2 平均内核对 Pandas 数据帧进行下采样

提问by gc5

回答by Andy Hayden

回答by Viktor Kerkez

相关推荐

pandas 当“索引长度不匹配”时，将索引从 DataFrame 复制到第二帧

pandas 如何使用每个离散值创建条形图/直方图？

Py Pandas .format(dataframe)

pandas 比较熊猫系列在包含 nan 时是否相等？

相关推荐

最近更新

标签