pandas 计算表中每 x 行的平均值并创建新表

Question

提问by Gnu

I have a long table of data (~200 rows by 50 columns) and I need to create a code that can calculate the mean values of every two rows and for each column in the table with the final output being a new table of the mean values. This is obviously crazy to do in Excel! I use python3 and I am aware of some similar questions:here, hereand here. But none of these helps as I need some elegant code to work with multiple columns and produces an organised data table. By the way my original datatable has been imported using pandas and is defined as a dataframe but could not find an easy way to do this in pandas. Help is much appreciated.

我有一个很长的数据表（~200 行 x 50 列），我需要创建一个代码来计算表中每两行和每一列的平均值，最终输出是一个新的平均值表值。这在 Excel 中显然很疯狂！我使用 python3 并且我知道一些类似的问题：here，here和here。但是这些都没有帮助，因为我需要一些优雅的代码来处理多列并生成一个有组织的数据表。顺便说一下，我的原始数据表是使用 Pandas 导入的，并被定义为一个数据框，但在 Pandas 中找不到一种简单的方法来做到这一点。非常感谢帮助。

An example of the table (short version) is:

表格的一个例子（简短版本）是：

a   b   c   d
2   50  25  26
4   11  38  44
6   33  16  25
8   37  27  25
10  28  48  32
12  47  35  45
14  8   16  7
16  12  16  30
18  22  39  29
20  9   15  47

Expected mean table:

预期均值表：

a    b     c     d
3   30.5  31.5  35
7   35    21.5  25
11  37.5  41.5  38.5
15  10    16    18.5
19  15.5  27    38

Answer 1

回答by ayhan

You can create an artificial group using df.index//2(or as @DSM pointed out, using np.arange(len(df))//2- so that it works for all indices) and then use groupby:

您可以使用df.index//2（或如@DSM 指出的那样，使用np.arange(len(df))//2- 以便它适用于所有索引）创建一个人工组，然后使用 groupby：

df.groupby(np.arange(len(df))//2).mean()
Out[13]: 
      a     b     c     d
0   3.0  30.5  31.5  35.0
1   7.0  35.0  21.5  25.0
2  11.0  37.5  41.5  38.5
3  15.0  10.0  16.0  18.5
4  19.0  15.5  27.0  38.0

Answer 2

回答by seeiespi

You can approach this problem using pd.rolling()to create a rolling average and then just grab every second element using iloc

您可以使用pd.rolling()创建滚动平均值来解决此问题，然后使用iloc

df = df.rolling(2).mean() 
df = df.iloc[::2, :]

Note that the first observation will be missing (i.e. the rolling starts at the top) so make sure to check that your data is sorted how you need it.

请注意，第一个观察将丢失（即滚动从顶部开始），因此请确保检查您的数据是否按您需要的方式排序。

Answer 3

回答by Divakar

NumPythonic way would be to extract the elements as a NumPy array with df.values, then reshape to a 3Darray with 2elements along axis=1and 4along axis=2and perform the average reduction along axis=1and finally convert back to a dataframe, like so -

NumPythonic 方法是使用将元素提取为 NumPy 数组df.values，然后将元素重塑为一个3D数组，2并沿axis=1和4沿axis=2执行平均减少axis=1，最后转换回数据帧，如下所示 -

pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))

As it turns out, you can introduce NumPy's very efficient tool : np.einsumto do this average-reductionas a combination of sum-reductionand scaling-down, like so -

事实证明，你可以介绍与NumPy的非常有效的工具：np.einsum要做到这一点average-reduction作为的组合sum-reduction和scaling-down，像这样-

pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)

Please note that the proposed approaches assume that the number of rows is divisible by 2.

请注意，建议的方法假设行数可以被整除2。

Also as noted by @DSM, to preserve the column names, you need to add columns=df.columnswhen converting back to Dataframe, i.e. -

同样noted by @DSM，为了保留列名，您需要columns=df.columns在转换回 Dataframe 时添加，即 -

pd.DataFrame(...,columns=df.columns)

Sample run -

样品运行 -

>>> df
    0   1   2   3
0   2  50  25  26
1   4  11  38  44
2   6  33  16  25
3   8  37  27  25
4  10  28  48  32
5  12  47  35  45
6  14   8  16   7
7  16  12  16  30
8  18  22  39  29
9  20   9  15  47
>>> pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))
    0     1     2     3
0   3  30.5  31.5  35.0
1   7  35.0  21.5  25.0
2  11  37.5  41.5  38.5
3  15  10.0  16.0  18.5
4  19  15.5  27.0  38.0
>>> pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)
    0     1     2     3
0   3  30.5  31.5  35.0
1   7  35.0  21.5  25.0
2  11  37.5  41.5  38.5
3  15  10.0  16.0  18.5
4  19  15.5  27.0  38.0

Runtime tests -

运行时测试 -

In this section, let's test out all the three approaches listed thus far to solve the problem for performance, including @ayhan's solution with groupby.

在本节中，让我们测试迄今为止列出的所有三种方法来解决性能问题，包括@ayhan's solution with groupby.

In [24]: A = np.random.randint(0,9,(200,50))

In [25]: df = pd.DataFrame(A)

In [26]: %timeit df.groupby(df.index//2).mean() # @ayhan's solution
1000 loops, best of 3: 1.61 ms per loop

In [27]: %timeit pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))
1000 loops, best of 3: 317 μs per loop

In [28]: %timeit pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)
1000 loops, best of 3: 266 μs per loop

Answer 4

回答by piRSquared

df.set_index(np.arange(len(df)) // 2).mean(level=0)

pandas 计算表中每 x 行的平均值并创建新表

提问by Gnu

回答by ayhan

回答by seeiespi

回答by Divakar

回答by piRSquared

相关推荐

最近更新

标签

pandas 计算表中每 x 行的平均值并创建新表

提问by Gnu

回答by ayhan

回答by seeiespi

回答by Divakar

回答by piRSquared

相关推荐

pandas 将 openpyxl 数据传递给熊猫

在“分组依据”pandas 数据框中重复值

pandas 以 5 分钟为间隔对 DataFrame 进行分组

pandas 忽略熊猫数据框中的非数字字符串值

相关推荐

最近更新

标签