pandas 熊猫数据框分组并加入
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41925614/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas dataframe groupby and join
提问by enneppi
Let's suppose to have this:
让我们假设有这个:
np.random.seed(123)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
So the dataframe appears like below:
所以数据框如下所示:
A B C D
0 foo one -1.085631 1.265936
1 bar one 0.997345 -0.866740
2 foo two 0.282978 -0.678886
3 bar three -1.506295 -0.094709
4 foo two -0.578600 1.491390
5 bar two 1.651437 -0.638902
6 foo one -2.426679 -0.443982
7 foo three -0.428913 -0.434351
I want to group the df
by B
, calculate the sum of C
column multiplied by the sum of D
column for each group and finally joining this grouped-by result with the original df
.
In Python:
我想对df
by进行分组B
,计算C
列的总和乘以D
每个组的列的总和,最后将此分组结果与原始df
. 在 Python 中:
grouped = df.groupby('B').apply(lambda group: sum(group['C'])*sum(group['D'])).reset_index()
grouped.columns = ['B', 'new_value']
df.join(grouped.set_index('B'), on='B')
There exists a more pythonicand efficient way to solve this kind of problem?
有没有更pythonic和有效的方法来解决这类问题?
采纳答案by MaxU
Solution 1:
解决方案1:
In [25]: df.groupby('B')['C','D'].transform('sum').prod(1)
Out[25]:
0 0.112635
1 0.112635
2 0.235371
3 1.023841
4 0.235371
5 0.235371
6 0.112635
7 1.023841
dtype: float64
Solution 2:
解决方案2:
In [18]: grp = df.groupby('B')
In [19]: grp['C'].transform('sum') * grp['D'].transform('sum')
Out[19]:
0 0.112635
1 0.112635
2 0.235371
3 1.023841
4 0.235371
5 0.235371
6 0.112635
7 1.023841
dtype: float64
Demo:
演示:
In [20]: df
Out[20]:
A B C D
0 foo one -1.085631 1.265936
1 bar one 0.997345 -0.866740
2 foo two 0.282978 -0.678886
3 bar three -1.506295 -0.094709
4 foo two -0.578600 1.491390
5 bar two 1.651437 -0.638902
6 foo one -2.426679 -0.443982
7 foo three -0.428913 -0.434351
In [21]: grp = df.groupby('B')
In [22]: df['new'] = grp['C'].transform('sum') * grp['D'].transform('sum')
In [23]: df
Out[23]:
A B C D new
0 foo one -1.085631 1.265936 0.112635
1 bar one 0.997345 -0.866740 0.112635
2 foo two 0.282978 -0.678886 0.235371
3 bar three -1.506295 -0.094709 1.023841
4 foo two -0.578600 1.491390 0.235371
5 bar two 1.651437 -0.638902 0.235371
6 foo one -2.426679 -0.443982 0.112635
7 foo three -0.428913 -0.434351 1.023841
In [26]: df['new2'] = df.groupby('B')['C','D'].transform('sum').prod(1)
In [27]: df
Out[27]:
A B C D new new2
0 foo one -1.085631 1.265936 0.112635 0.112635
1 bar one 0.997345 -0.866740 0.112635 0.112635
2 foo two 0.282978 -0.678886 0.235371 0.235371
3 bar three -1.506295 -0.094709 1.023841 1.023841
4 foo two -0.578600 1.491390 0.235371 0.235371
5 bar two 1.651437 -0.638902 0.235371 0.235371
6 foo one -2.426679 -0.443982 0.112635 0.112635
7 foo three -0.428913 -0.434351 1.023841 1.023841
Check:
查看:
In [28]: df.new.eq(df.new2).all()
Out[28]: True
回答by piRSquared
Solution 1
方案一
You can take the sum in the groupby
over just columns ['C', 'D']
then perform prod
across axis=1
(row rise, across columns). This will be a reduced dataframe with an index equal to the unique values in column B
. You can use join
with on='B'
to link back up. Make sure you rename
the pd.Series
with the name you'd like the column to be.
您可以groupby
在仅列中求和,['C', 'D']
然后prod
跨列执行axis=1
(行上升,跨列)。这将是一个简化的数据框,其索引等于 column 中的唯一值B
。您可以使用join
withon='B'
来链接备份。确保你rename
的pd.Series
名称为您想要的列是。
df.join(df.groupby('B')['C', 'D'].sum().prod(1).rename('newCol'), on='B')
Solution 2
解决方案2
Same idea as in solution 1 except we use map
+ assign
to combine with existing dataframe df
与解决方案 1 中的想法相同,只是我们使用map
+assign
与现有数据框结合df
df.assign(newCol=df.B.map(df.groupby('B')['C', 'D'].sum().prod(1)))
Both yield
两者产量
A B C D newCol
0 foo one -1.085631 1.265936 0.112635
1 bar one 0.997345 -0.866740 0.112635
2 foo two 0.282978 -0.678886 0.235371
3 bar three -1.506295 -0.094709 1.023841
4 foo two -0.578600 1.491390 0.235371
5 bar two 1.651437 -0.638902 0.235371
6 foo one -2.426679 -0.443982 0.112635
7 foo three -0.428913 -0.434351 1.023841