计算 Pandas DataFrame 中一组列的平均值的最有效方法

Question

提问by Einar

I have a DataFramewith columns like this:

我有一个DataFrame像这样的列：

["A_1", "A_2", "A_3", "B_1", "B_2", "B_3"]

I'd like to "collapse" the various A and B columns in a single column each and calculate their mean value. In short, at the end of the operation I'd get:

我想将各个 A 和 B 列“折叠”在一列中并计算它们的平均值。简而言之，在手术结束时，我会得到：

["A", "B"]

where "A" is the column-wise mean of all "A" columns and "B" the mean of all "B" columns.

其中“A”是所有“A”列的列均值，“B”是所有“B”列的均值。

As far as I understood, groupbyis not suited for this task, or perhaps I'm using it incorrectly:

据我了解，groupby不适合此任务，或者我使用不当：

grouped = data.groupby([item for item in data if "A" not in item])

If I use axis=1, all I get is an empty DataFrame when calling mean(), and if not I'm not getting the desired effect. I would like to avoid building a separate DataFrame to be fillled with the means via iteration (e.g. by calculating means separately then adding them like new_df["A"] = mean_a). Is there an efficient solution for this?

如果我使用axis=1，则在调用 mean() 时我得到的只是一个空的 DataFrame，否则我将无法获得所需的效果。我想避免构建一个单独的 DataFrame 以通过迭代来填充手段（例如，通过单独计算手段然后像一样添加它们new_df["A"] = mean_a）。有没有有效的解决方案？

Answer 1

回答by ely

You want to make use of the built-in mean()function that accepts an axisargument to specify row-wise means. Since you know your specific column name convention for the different means that you want, you can use the example code below to do it very efficiently. Here I chose to just make two additional columns rather than to actually destroy the existing data. I could have also put these new columns into a new data frame; it just depends on what your needs are and what's convenient for you. The same basic idea will work in either case.

您想利用mean()接受axis参数的内置函数来指定行方式。由于您知道您想要的不同方式的特定列名称约定，因此您可以使用下面的示例代码非常有效地完成此操作。在这里，我选择只创建两个额外的列，而不是实际销毁现有数据。我也可以将这些新列放入新的数据框中；这只是取决于您的需求是什么以及什么对您来说方便。相同的基本思想在任何一种情况下都适用。

In [1]: import pandas

In [2]: dfrm = pandas.DataFrame([[1,2,3,4,5,6],[7,8,9,10,11,12],[13,14,15,16,17,18]], columns = ['A_1', 'A_2', 'A_3', 'B_1', 'B_2', 'B_3'])

In [3]: dfrm
Out[3]: 
   A_1  A_2  A_3  B_1  B_2  B_3
0    1    2    3    4    5    6
1    7    8    9   10   11   12
2   13   14   15   16   17   18

In [4]: dfrm["A_mean"] = dfrm[[elem for elem in dfrm.columns if elem[0]=='A']].mean(axis=1)

In [5]: dfrm
Out[5]: 
   A_1  A_2  A_3  B_1  B_2  B_3  A_mean
0    1    2    3    4    5    6       2
1    7    8    9   10   11   12       8
2   13   14   15   16   17   18      14

In [6]: dfrm["B_mean"] = dfrm[[elem for elem in dfrm.columns if elem[0]=='B']].mean(axis=1)

In [7]: dfrm
Out[7]: 
   A_1  A_2  A_3  B_1  B_2  B_3  A_mean  B_mean
0    1    2    3    4    5    6       2       5
1    7    8    9   10   11   12       8      11
2   13   14   15   16   17   18      14      17

Answer 2

回答by DSM

I don't know about efficient, but I might do something like this:

我不知道效率，但我可能会做这样的事情：

~/coding$ cat colgroup.dat
A_1,A_2,A_3,B_1,B_2,B_3
1,2,3,4,5,6
7,8,9,10,11,12
13,14,15,16,17,18
~/coding$ python
Python 2.7.3 (default, Apr 20 2012, 22:44:07) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> df = pandas.read_csv("colgroup.dat")
>>> df
   A_1  A_2  A_3  B_1  B_2  B_3
0    1    2    3    4    5    6
1    7    8    9   10   11   12
2   13   14   15   16   17   18
>>> grouped = df.groupby(lambda x: x[0], axis=1)
>>> for i, group in grouped:
...     print i, group
... 
A    A_1  A_2  A_3
0    1    2    3
1    7    8    9
2   13   14   15
B    B_1  B_2  B_3
0    4    5    6
1   10   11   12
2   16   17   18
>>> grouped.mean()
key_0   A   B
0       2   5
1       8  11
2      14  17

I suppose lambda x: x.split('_')[0]would be a little more robust.

我想lambda x: x.split('_')[0]会更健壮一点。

计算 Pandas DataFrame 中一组列的平均值的最有效方法

提问by Einar

回答by ely

回答by DSM

相关推荐

最近更新

标签

计算 Pandas DataFrame 中一组列的平均值的最有效方法

提问by Einar

回答by ely

回答by DSM

相关推荐

Python Pandas：导入熊猫时找不到 numpy.core.multiarray

如何在 Pandas 中读取固定宽度格式的文本文件

使用 Pandas 将索引列添加到 DataFrame

pandas 如何使用日期时间对数据框进行切片？

相关推荐

最近更新

标签