Python Pandas - GroupBy 然后在原始表上合并
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24980437/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas - GroupBy and then Merge on original table
提问by WalkingDeadFan
I'm trying to write a function to aggregate and perform various stats calcuations on a dataframe in Pandas and then merge it to the original dataframe however, I'm running to issues. This is code equivalent in SQL:
我正在尝试编写一个函数来对 Pandas 中的数据帧进行聚合和执行各种统计计算,然后将其合并到原始数据帧中,但是,我遇到了问题。这是在 SQL 中等效的代码:
SELECT EID,
PCODE,
SUM(PVALUE) AS PVALUE,
SUM(SQRT(SC*EXP(SC-1))) AS SC,
SUM(SI) AS SI,
SUM(EE) AS EE
INTO foo_bar_grp
FROM foo_bar
GROUP BY EID, PCODE
And then join on the original table:
然后加入原始表:
SELECT *
FROM foo_bar_grp INNER JOIN
foo_bar ON foo_bar.EID = foo_bar_grp.EID
AND foo_bar.PCODE = foo_bar_grp.PCODE
Here are the steps: Loading the dataIN:>>
以下是步骤: 加载数据IN:>>
pol_dict = {'PID':[1,1,2,2],
'EID':[123,123,123,123],
'PCODE':['GU','GR','GU','GR'],
'PVALUE':[100,50,150,300],
'SI':[400,40,140,140],
'SC':[230,23,213,213],
'EE':[10000,10000,2000,30000],
}
pol_df = DataFrame(pol_dict)
pol_df
OUT:>>
输出:>>
EID EE PCODE PID PVALUE SC SI
0 123 10000 GU 1 100 230 400
1 123 10000 GR 1 50 23 40
2 123 2000 GU 2 150 213 140
3 123 30000 GR 2 300 213 140
Step 2: Calculating and Grouping on the data:
第 2 步:对数据进行计算和分组:
My pandas code is as follows:
我的熊猫代码如下:
#create aggregation dataframe
poagg_df = pol_df
del poagg_df['PID']
po_grouped_df = poagg_df.groupby(['EID','PCODE'])
#generate acc level aggregate
acc_df = po_grouped_df.agg({
'PVALUE' : np.sum,
'SI' : lambda x: np.sqrt(np.sum(x * np.exp(x-1))),
'SC' : np.sum,
'EE' : np.sum
})
This works fine until I want to join on the original table:
这工作正常,直到我想加入原始表:
IN:>>
在:>>
po_account_df = pd.merge(acc_df, po_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
OUT:>> KeyError: u'no item named EID'
OUT:>> KeyError: 你没有名为 EID 的项目
For some reason, the grouped dataframe can't join back to the original table. I've looked at ways of trying to convert the groupby columns to actual columns but that doesn't seem to work.
出于某种原因,分组的数据框无法连接回原始表。我已经研究了尝试将 groupby 列转换为实际列的方法,但这似乎不起作用。
Please note, the end goal is to be able to find the percentage for each column (PVALUE, SI, SC, EE) IE:
请注意,最终目标是能够找到每列(PVALUE、SI、SC、EE)IE 的百分比:
pol_acc_df['PVALUE_PCT'] = np.round(pol_acc_df.PVALUE_Po/pol_acc_df.PVALUE_Acc,4)
Thanks!
谢谢!
采纳答案by chrisb
By default, groupby
output has the grouping columns as indicies, not columns, which is why the merge is failing.
默认情况下,groupby
输出将分组列作为索引,而不是列,这就是合并失败的原因。
There are a couple different ways to handle it, probably the easiest is using the as_index
parameter when you define the groupby object.
有几种不同的处理方法,可能最简单的方法是as_index
在定义 groupby 对象时使用参数。
po_grouped_df = poagg_df.groupby(['EID','PCODE'], as_index=False)
Then, your merge should work as expected.
然后,您的合并应该按预期工作。
In [356]: pd.merge(acc_df, pol_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
Out[356]:
EID PCODE SC_Acc EE_Acc SI_Acc PVALUE_Acc EE_Po PVALUE_Po \
0 123 GR 236 40000 1.805222e+31 350 10000 50
1 123 GR 236 40000 1.805222e+31 350 30000 300
2 123 GU 443 12000 8.765549e+87 250 10000 100
3 123 GU 443 12000 8.765549e+87 250 2000 150
SC_Po SI_Po
0 23 40
1 213 140
2 230 400
3 213 140
回答by vyang
From the pandas docs:
来自熊猫文档:
Transformation: perform some group-specific computations and return a like-indexed object
转换:执行一些特定于组的计算并返回一个类似索引的对象
Unfortunately, transform
works series by series, so you wouldn't be able to perform multiple functions on multiple columns as you've done with agg
, but transform
does allow you to skip merge
不幸的是,transform
逐个系列地工作,因此您无法像使用 那样在多列上执行多个功能agg
,但transform
确实允许您跳过merge
po_grouped_df = pol_df.groupby(['EID','PCODE'])
pol_df['sum_pval'] = po_grouped_df['PVALUE'].transform(sum)
pol_df['func_si'] = po_grouped_df['SI'].transform(lambda x: np.sqrt(np.sum(x * np.exp(x-1))))
pol_df['sum_sc'] = po_grouped_df['SC'].transform(sum)
pol_df['sum_ee'] = po_grouped_df['EE'].transform(sum)
pol_df
Results in:
结果是:
PID EID PCODE PVALUE SI SC EE sum_pval func_si sum_sc sum_ee
1 123 GU 100 400 230 10000 250 8.765549e+87 443 12000
1 123 GR 50 40 23 10000 350 1.805222e+31 236 40000
2 123 GU 150 140 213 2000 250 8.765549e+87 443 12000
2 123 GR 300 140 213 30000 350 1.805222e+31 236 40000
For more info, check out this SO answer.
有关更多信息,请查看此 SO 答案。