pandas 熊猫数据框分组并加入

Question

提问by enneppi

Let's suppose to have this:

让我们假设有这个：

np.random.seed(123)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                           'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

So the dataframe appears like below:

所以数据框如下所示：

     A      B         C         D
0  foo    one -1.085631  1.265936
1  bar    one  0.997345 -0.866740
2  foo    two  0.282978 -0.678886
3  bar  three -1.506295 -0.094709
4  foo    two -0.578600  1.491390
5  bar    two  1.651437 -0.638902
6  foo    one -2.426679 -0.443982
7  foo  three -0.428913 -0.434351

I want to group the dfby B, calculate the sum of Ccolumn multiplied by the sum of Dcolumn for each group and finally joining this grouped-by result with the original df. In Python:

我想对dfby进行分组B，计算C列的总和乘以D每个组的列的总和，最后将此分组结果与原始df. 在 Python 中：

grouped = df.groupby('B').apply(lambda group: sum(group['C'])*sum(group['D'])).reset_index()
grouped.columns = ['B', 'new_value']
df.join(grouped.set_index('B'), on='B')

There exists a more pythonicand efficient way to solve this kind of problem?

有没有更pythonic和有效的方法来解决这类问题？

Answer 1

采纳答案by MaxU

Solution 1:

解决方案1：

In [25]: df.groupby('B')['C','D'].transform('sum').prod(1)
Out[25]:
0    0.112635
1    0.112635
2    0.235371
3    1.023841
4    0.235371
5    0.235371
6    0.112635
7    1.023841
dtype: float64

Solution 2:

解决方案2：

In [18]: grp = df.groupby('B')

In [19]: grp['C'].transform('sum') * grp['D'].transform('sum')
Out[19]:
0    0.112635
1    0.112635
2    0.235371
3    1.023841
4    0.235371
5    0.235371
6    0.112635
7    1.023841
dtype: float64

Demo:

演示：

In [20]: df
Out[20]:
     A      B         C         D
0  foo    one -1.085631  1.265936
1  bar    one  0.997345 -0.866740
2  foo    two  0.282978 -0.678886
3  bar  three -1.506295 -0.094709
4  foo    two -0.578600  1.491390
5  bar    two  1.651437 -0.638902
6  foo    one -2.426679 -0.443982
7  foo  three -0.428913 -0.434351

In [21]: grp = df.groupby('B')

In [22]: df['new'] = grp['C'].transform('sum') * grp['D'].transform('sum')

In [23]: df
Out[23]:
     A      B         C         D       new
0  foo    one -1.085631  1.265936  0.112635
1  bar    one  0.997345 -0.866740  0.112635
2  foo    two  0.282978 -0.678886  0.235371
3  bar  three -1.506295 -0.094709  1.023841
4  foo    two -0.578600  1.491390  0.235371
5  bar    two  1.651437 -0.638902  0.235371
6  foo    one -2.426679 -0.443982  0.112635
7  foo  three -0.428913 -0.434351  1.023841


In [26]: df['new2'] = df.groupby('B')['C','D'].transform('sum').prod(1)

In [27]: df
Out[27]:
     A      B         C         D       new      new2
0  foo    one -1.085631  1.265936  0.112635  0.112635
1  bar    one  0.997345 -0.866740  0.112635  0.112635
2  foo    two  0.282978 -0.678886  0.235371  0.235371
3  bar  three -1.506295 -0.094709  1.023841  1.023841
4  foo    two -0.578600  1.491390  0.235371  0.235371
5  bar    two  1.651437 -0.638902  0.235371  0.235371
6  foo    one -2.426679 -0.443982  0.112635  0.112635
7  foo  three -0.428913 -0.434351  1.023841  1.023841

Check:

查看：

In [28]: df.new.eq(df.new2).all()
Out[28]: True

Answer 2

回答by piRSquared

Solution 1

方案一

You can take the sum in the groupbyover just columns ['C', 'D']then perform prodacross axis=1(row rise, across columns). This will be a reduced dataframe with an index equal to the unique values in column B. You can use joinwith on='B'to link back up. Make sure you renamethe pd.Serieswith the name you'd like the column to be.

您可以groupby在仅列中求和，['C', 'D']然后prod跨列执行axis=1（行上升，跨列）。这将是一个简化的数据框，其索引等于 column 中的唯一值B。您可以使用joinwithon='B'来链接备份。确保你rename的pd.Series名称为您想要的列是。

df.join(df.groupby('B')['C', 'D'].sum().prod(1).rename('newCol'), on='B')

Solution 2

解决方案2

Same idea as in solution 1 except we use map+ assignto combine with existing dataframe df

与解决方案 1 中的想法相同，只是我们使用map+assign与现有数据框结合df

df.assign(newCol=df.B.map(df.groupby('B')['C', 'D'].sum().prod(1)))

Both yield

两者产量

     A      B         C         D    newCol
0  foo    one -1.085631  1.265936  0.112635
1  bar    one  0.997345 -0.866740  0.112635
2  foo    two  0.282978 -0.678886  0.235371
3  bar  three -1.506295 -0.094709  1.023841
4  foo    two -0.578600  1.491390  0.235371
5  bar    two  1.651437 -0.638902  0.235371
6  foo    one -2.426679 -0.443982  0.112635
7  foo  three -0.428913 -0.434351  1.023841

pandas 熊猫数据框分组并加入

提问by enneppi

采纳答案by MaxU

回答by piRSquared

Solution 1

方案一

Solution 2

解决方案2

相关推荐

最近更新

标签

pandas 熊猫数据框分组并加入

提问by enneppi

采纳答案by MaxU

回答by piRSquared

Solution 1

方案一

Solution 2

解决方案2

相关推荐

将空行附加到 Pandas 中的 Dataframe

pandas Python：将数据帧转换为列表中包含字符串项的列表

pandas 从熊猫数据框中按名称绘制正态分布图

Pandas 中的就地 sort_values 到底是什么意思？

相关推荐

最近更新

标签