pandas 计算表中每 x 行的平均值并创建新表
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36810595/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Calculate average of every x rows in a table and create new table
提问by Gnu
I have a long table of data (~200 rows by 50 columns) and I need to create a code that can calculate the mean values of every two rows and for each column in the table with the final output being a new table of the mean values. This is obviously crazy to do in Excel! I use python3 and I am aware of some similar questions:here, hereand here. But none of these helps as I need some elegant code to work with multiple columns and produces an organised data table. By the way my original datatable has been imported using pandas and is defined as a dataframe but could not find an easy way to do this in pandas. Help is much appreciated.
我有一个很长的数据表(~200 行 x 50 列),我需要创建一个代码来计算表中每两行和每一列的平均值,最终输出是一个新的平均值表值。这在 Excel 中显然很疯狂!我使用 python3 并且我知道一些类似的问题:here,here和here。但是这些都没有帮助,因为我需要一些优雅的代码来处理多列并生成一个有组织的数据表。顺便说一下,我的原始数据表是使用 Pandas 导入的,并被定义为一个数据框,但在 Pandas 中找不到一种简单的方法来做到这一点。非常感谢帮助。
An example of the table (short version) is:
表格的一个例子(简短版本)是:
a b c d
2 50 25 26
4 11 38 44
6 33 16 25
8 37 27 25
10 28 48 32
12 47 35 45
14 8 16 7
16 12 16 30
18 22 39 29
20 9 15 47
Expected mean table:
预期均值表:
a b c d
3 30.5 31.5 35
7 35 21.5 25
11 37.5 41.5 38.5
15 10 16 18.5
19 15.5 27 38
回答by ayhan
You can create an artificial group using df.index//2
(or as @DSM pointed out, using np.arange(len(df))//2
- so that it works for all indices) and then use groupby:
您可以使用df.index//2
(或如@DSM 指出的那样,使用np.arange(len(df))//2
- 以便它适用于所有索引)创建一个人工组,然后使用 groupby:
df.groupby(np.arange(len(df))//2).mean()
Out[13]:
a b c d
0 3.0 30.5 31.5 35.0
1 7.0 35.0 21.5 25.0
2 11.0 37.5 41.5 38.5
3 15.0 10.0 16.0 18.5
4 19.0 15.5 27.0 38.0
回答by seeiespi
You can approach this problem using pd.rolling()
to create a rolling average and then just grab every second element using iloc
您可以使用pd.rolling()
创建滚动平均值来解决此问题,然后使用iloc
df = df.rolling(2).mean()
df = df.iloc[::2, :]
Note that the first observation will be missing (i.e. the rolling starts at the top) so make sure to check that your data is sorted how you need it.
请注意,第一个观察将丢失(即滚动从顶部开始),因此请确保检查您的数据是否按您需要的方式排序。
回答by Divakar
NumPythonic way would be to extract the elements as a NumPy array with df.values
, then reshape to a 3D
array with 2
elements along axis=1
and 4
along axis=2
and perform the average reduction along axis=1
and finally convert back to a dataframe, like so -
NumPythonic 方法是使用 将元素提取为 NumPy 数组df.values
,然后将元素重塑为一个3D
数组,2
并沿axis=1
和4
沿axis=2
执行平均减少axis=1
,最后转换回数据帧,如下所示 -
pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))
As it turns out, you can introduce NumPy's very efficient tool : np.einsum
to do this average-reduction
as a combination of sum-reduction
and scaling-down
, like so -
事实证明,你可以介绍与NumPy的非常有效的工具:np.einsum
要做到这一点average-reduction
作为的组合sum-reduction
和scaling-down
,像这样-
pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)
Please note that the proposed approaches assume that the number of rows is divisible by 2
.
请注意,建议的方法假设行数可以被 整除2
。
Also as noted by @DSM
, to preserve the column names, you need to add columns=df.columns
when converting back to Dataframe, i.e. -
同样noted by @DSM
,为了保留列名,您需要columns=df.columns
在转换回 Dataframe 时添加,即 -
pd.DataFrame(...,columns=df.columns)
Sample run -
样品运行 -
>>> df
0 1 2 3
0 2 50 25 26
1 4 11 38 44
2 6 33 16 25
3 8 37 27 25
4 10 28 48 32
5 12 47 35 45
6 14 8 16 7
7 16 12 16 30
8 18 22 39 29
9 20 9 15 47
>>> pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))
0 1 2 3
0 3 30.5 31.5 35.0
1 7 35.0 21.5 25.0
2 11 37.5 41.5 38.5
3 15 10.0 16.0 18.5
4 19 15.5 27.0 38.0
>>> pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)
0 1 2 3
0 3 30.5 31.5 35.0
1 7 35.0 21.5 25.0
2 11 37.5 41.5 38.5
3 15 10.0 16.0 18.5
4 19 15.5 27.0 38.0
Runtime tests -
运行时测试 -
In this section, let's test out all the three approaches listed thus far to solve the problem for performance, including @ayhan's solution with groupby
.
在本节中,让我们测试迄今为止列出的所有三种方法来解决性能问题,包括@ayhan's solution with groupby
.
In [24]: A = np.random.randint(0,9,(200,50))
In [25]: df = pd.DataFrame(A)
In [26]: %timeit df.groupby(df.index//2).mean() # @ayhan's solution
1000 loops, best of 3: 1.61 ms per loop
In [27]: %timeit pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))
1000 loops, best of 3: 317 μs per loop
In [28]: %timeit pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)
1000 loops, best of 3: 266 μs per loop
回答by piRSquared
df.set_index(np.arange(len(df)) // 2).mean(level=0)