计算 Pandas DataFrame 中一组列的平均值的最有效方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11265116/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Most efficient way to calculate the mean of a group of columns in a pandas DataFrame
提问by Einar
I have a DataFramewith columns like this:
我有一个DataFrame像这样的列:
["A_1", "A_2", "A_3", "B_1", "B_2", "B_3"]
I'd like to "collapse" the various A and B columns in a single column each and calculate their mean value. In short, at the end of the operation I'd get:
我想将各个 A 和 B 列“折叠”在一列中并计算它们的平均值。简而言之,在手术结束时,我会得到:
["A", "B"]
where "A" is the column-wise mean of all "A" columns and "B" the mean of all "B" columns.
其中“A”是所有“A”列的列均值,“B”是所有“B”列的均值。
As far as I understood, groupbyis not suited for this task, or perhaps I'm using it incorrectly:
据我了解,groupby不适合此任务,或者我使用不当:
grouped = data.groupby([item for item in data if "A" not in item])
If I use axis=1, all I get is an empty DataFrame when calling mean(), and if not I'm not getting the desired effect. I would like to avoid building a separate DataFrame to be fillled with the means via iteration (e.g. by calculating means separately then adding them like new_df["A"] = mean_a). Is there an efficient solution for this?
如果我使用axis=1,则在调用 mean() 时我得到的只是一个空的 DataFrame,否则我将无法获得所需的效果。我想避免构建一个单独的 DataFrame 以通过迭代来填充手段(例如,通过单独计算手段然后像 一样添加它们new_df["A"] = mean_a)。有没有有效的解决方案?
回答by ely
You want to make use of the built-in mean()function that accepts an axisargument to specify row-wise means. Since you know your specific column name convention for the different means that you want, you can use the example code below to do it very efficiently. Here I chose to just make two additional columns rather than to actually destroy the existing data. I could have also put these new columns into a new data frame; it just depends on what your needs are and what's convenient for you. The same basic idea will work in either case.
您想利用mean()接受axis参数的内置函数来指定行方式。由于您知道您想要的不同方式的特定列名称约定,因此您可以使用下面的示例代码非常有效地完成此操作。在这里,我选择只创建两个额外的列,而不是实际销毁现有数据。我也可以将这些新列放入新的数据框中;这只是取决于您的需求是什么以及什么对您来说方便。相同的基本思想在任何一种情况下都适用。
In [1]: import pandas
In [2]: dfrm = pandas.DataFrame([[1,2,3,4,5,6],[7,8,9,10,11,12],[13,14,15,16,17,18]], columns = ['A_1', 'A_2', 'A_3', 'B_1', 'B_2', 'B_3'])
In [3]: dfrm
Out[3]:
A_1 A_2 A_3 B_1 B_2 B_3
0 1 2 3 4 5 6
1 7 8 9 10 11 12
2 13 14 15 16 17 18
In [4]: dfrm["A_mean"] = dfrm[[elem for elem in dfrm.columns if elem[0]=='A']].mean(axis=1)
In [5]: dfrm
Out[5]:
A_1 A_2 A_3 B_1 B_2 B_3 A_mean
0 1 2 3 4 5 6 2
1 7 8 9 10 11 12 8
2 13 14 15 16 17 18 14
In [6]: dfrm["B_mean"] = dfrm[[elem for elem in dfrm.columns if elem[0]=='B']].mean(axis=1)
In [7]: dfrm
Out[7]:
A_1 A_2 A_3 B_1 B_2 B_3 A_mean B_mean
0 1 2 3 4 5 6 2 5
1 7 8 9 10 11 12 8 11
2 13 14 15 16 17 18 14 17
回答by DSM
I don't know about efficient, but I might do something like this:
我不知道效率,但我可能会做这样的事情:
~/coding$ cat colgroup.dat
A_1,A_2,A_3,B_1,B_2,B_3
1,2,3,4,5,6
7,8,9,10,11,12
13,14,15,16,17,18
~/coding$ python
Python 2.7.3 (default, Apr 20 2012, 22:44:07)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> df = pandas.read_csv("colgroup.dat")
>>> df
A_1 A_2 A_3 B_1 B_2 B_3
0 1 2 3 4 5 6
1 7 8 9 10 11 12
2 13 14 15 16 17 18
>>> grouped = df.groupby(lambda x: x[0], axis=1)
>>> for i, group in grouped:
... print i, group
...
A A_1 A_2 A_3
0 1 2 3
1 7 8 9
2 13 14 15
B B_1 B_2 B_3
0 4 5 6
1 10 11 12
2 16 17 18
>>> grouped.mean()
key_0 A B
0 2 5
1 8 11
2 14 17
I suppose lambda x: x.split('_')[0]would be a little more robust.
我想lambda x: x.split('_')[0]会更健壮一点。

