Pandas：同名列的平均值

Question

提问by user3635284

I have a dataframe with columns like:

我有一个包含如下列的数据框：

['id','name','foo1', 'foo1', 'foo1', 'foo2','foo2', 'foo3']

I would like to get a new dataframe where columns sharing the same name are averaged:

我想获得一个新的数据框，其中共享相同名称的列被平均：

['id','name','foo1', 'foo2','foo3']

Here column foo1 would be the average of the three columns named foo1 in the original dataframe, foo2 would be the average of the two columns named foo2 and foo3 would be just foo3

这里 foo1 列将是原始数据帧中名为 foo1 的三列的平均值，foo2 将是名为 foo2 的两列的平均值，而 foo3 将只是 foo3

Note: id and name are not numeric and I have to keep them.

注意：id 和 name 不是数字，我必须保留它们。

Answer 1

采纳答案by rojeeer

The basic idea is that you can group by your columns names and do mean operations for each group.

基本思想是您可以按列名称分组并对每个组进行平均操作。

I saw some comments for your question and tried to give you different ways to achieve the goal. (Solution (3) is the best I found!)

我看到了您的问题的一些评论，并试图为您提供实现目标的不同方法。（解决方案（3）是我发现的最好的！）

(1) Quick solution. If you have very limited columns that are non-numeric, and own unique names, e.g., columns idand name. What you can do is:

(1) 快速解决。如果您的非数字列非常有限，并且拥有唯一名称，例如列id和name. 你可以做的是：

First set index ['id', 'name']to preserve them,

首先设置索引['id', 'name']以保留它们，

df = df.set_index(['id', 'name'])

then use DataFrame.groupbyfunction on columns, set axis=1(iterate over each column), apply meanfunction for each group.

然后使用DataFrame.groupby函数 on columns，设置axis=1（迭代每列），mean为每个组应用函数。

df.groupby(by=df.columns, axis=1).mean()

And finally, reset index to recover ['id', 'name']columns

最后，重置索引以恢复['id', 'name']列

df = df.reset_index()

Here is a sample code:

这是一个示例代码：

In [35]: df = pd.DataFrame([['001', 'a', 1, 10, 100, 1000], ['002', 'b', 2, 20, 200, 2000]], columns=['id', 'name', 'c1', 'c2', 'c2', 'c3'], index=list('AB'))

In [36]: df = df.set_index(['id', 'name'])

In [37]: df = df.groupby(by=df.columns, axis=1).mean()

In [38]: df = df.reset_index()

In [39]: df
Out[39]: 
    id name  c1   c2    c3
0  001    a   1   55  1000
1  002    b   2  110  2000

(2) Complete solution. If you have lots of columns that are non-numeric and unique named, what you can do is:

(2)完整的解决方案。如果您有许多非数字且唯一命名的列，您可以做的是：

First transpose you dataframe,

首先转置你的数据帧，

df2 = df.transpose()

Then you do group by operations (on its index and axis=0), but carefully handle each groups: for these numeric groups, return their mean value; and for these non-numeric groups, return their first row:

然后你按操作（在它的索引和axis=0）上进行分组，但仔细处理每个组：对于这些数字组，返回它们的平均值；对于这些非数字组，返回它们的第一行：

df2 = df2.groupby(by=df2.index, axis=0).apply(lambda g: g.mean() if isinstance(g.iloc[0,0], numbers.Number) else g.iloc[0])

And finally, transpose back:

最后，转回：

df = df2.transpose()

Here is sample of code:

这是代码示例：

In [98]: df = pd.DataFrame([['001', 'a', 1, 10, 100, 1000], ['002', 'b', 2, 20, 200, 2000]], columns=['id', 'name', 'c1', 'c2', 'c2', 'c3'], index=list('AB'))

In [99]: df2 = df.transpose()

In [100]: df2 = df2.groupby(by=df2.index, axis=0).apply(lambda g: g.mean() if isinstance(g.iloc[0,0], numbers.Number) else g.iloc[0])

In [101]: df3 = df2.transpose()

In [102]: df3
Out[102]: 
  c1   c2    c3   id name
A  1   55  1000  001    a
B  2  110  2000  002    b

In [103]: df
Out[103]: 
    id name  c1  c2   c2    c3
A  001    a   1  10  100  1000
B  002    b   2  20  200  2000

You need to import numbers

你需要 import numbers

More notes:

更多笔记：

(3) All in one! This solution is the best I found:

(3) 合而为一！这个解决方案是我发现的最好的：

df.groupby(by=df.columns, axis=1).apply(lambda g: g.mean(axis=1) if isinstance(g.iloc[0,0], numbers.Number) else g.iloc[:,0])

I tried to handle each group for the un-transposed groups, that is,

我尝试处理未转置组的每个组，即

df.groupby(by=df.columns, axis=1).apply(gf)

And

和

gf = lambda g: g.mean(axis=1) if isinstance(g.iloc[0,0], numbers.Number) else g.iloc[:,0]

I failed before, because I do not carefully hand the axis. You must set axis=1for meanfunction, and return columns for non-numeric groups.

我之前失败了，因为我没有仔细地手轴。您必须axis=1为mean函数设置，并为非数字组返回列。

Thanks!

谢谢！

Pandas：同名列的平均值

提问by user3635284

采纳答案by rojeeer

相关推荐

最近更新

标签

Pandas：同名列的平均值

提问by user3635284

采纳答案by rojeeer

相关推荐

带有索引的 Pandas Plot 导致“KeyError [] 不在索引中”

如何检查列的任何值是否在 Pandas 的范围内（在两个值之间）？

如何在 Pandas 数据框中用 NaN 替换一系列值？

在 python 中读取 RDa 文件作为 Pandas 数据框

相关推荐

最近更新

标签