Python Pandas - GroupBy 然后在原始表上合并

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24980437/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 05:33:09  来源:igfitidea点击:

Pandas - GroupBy and then Merge on original table

pythonpython-2.7pandas

提问by WalkingDeadFan

I'm trying to write a function to aggregate and perform various stats calcuations on a dataframe in Pandas and then merge it to the original dataframe however, I'm running to issues. This is code equivalent in SQL:

我正在尝试编写一个函数来对 Pandas 中的数据帧进行聚合和执行各种统计计算,然后将其合并到原始数据帧中,但是,我遇到了问题。这是在 SQL 中等效的代码:

SELECT EID,
       PCODE,
       SUM(PVALUE) AS PVALUE,
       SUM(SQRT(SC*EXP(SC-1))) AS SC,
       SUM(SI) AS SI,
       SUM(EE) AS EE
INTO foo_bar_grp
FROM foo_bar
GROUP BY EID, PCODE 

And then join on the original table:

然后加入原始表:

SELECT *
FROM foo_bar_grp INNER JOIN 
foo_bar ON foo_bar.EID = foo_bar_grp.EID 
        AND foo_bar.PCODE = foo_bar_grp.PCODE

Here are the steps: Loading the dataIN:>>

以下是步骤: 加载数据IN:>>

pol_dict = {'PID':[1,1,2,2],
             'EID':[123,123,123,123],
             'PCODE':['GU','GR','GU','GR'],
             'PVALUE':[100,50,150,300],
             'SI':[400,40,140,140],
             'SC':[230,23,213,213],
             'EE':[10000,10000,2000,30000],
             }


pol_df = DataFrame(pol_dict)

pol_df

OUT:>>

输出:>>

   EID    EE PCODE  PID  PVALUE   SC   SI
0  123  10000    GU    1     100  230  400
1  123  10000    GR    1      50   23   40
2  123   2000    GU    2     150  213  140
3  123  30000    GR    2     300  213  140

Step 2: Calculating and Grouping on the data:

第 2 步:对数据进行计算和分组:

My pandas code is as follows:

我的熊猫代码如下:

#create aggregation dataframe
poagg_df = pol_df
del poagg_df['PID']
po_grouped_df = poagg_df.groupby(['EID','PCODE'])

#generate acc level aggregate
acc_df = po_grouped_df.agg({
    'PVALUE' : np.sum,
    'SI' : lambda x: np.sqrt(np.sum(x * np.exp(x-1))),
    'SC' : np.sum,
    'EE' : np.sum
})

This works fine until I want to join on the original table:

这工作正常,直到我想加入原始表:

IN:>>

在:>>

po_account_df = pd.merge(acc_df, po_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))

OUT:>> KeyError: u'no item named EID'

OUT:>> KeyError: 你没有名为 EID 的项目

For some reason, the grouped dataframe can't join back to the original table. I've looked at ways of trying to convert the groupby columns to actual columns but that doesn't seem to work.

出于某种原因,分组的数据框无法连接回原始表。我已经研究了尝试将 groupby 列转换为实际列的方法,但这似乎不起作用。

Please note, the end goal is to be able to find the percentage for each column (PVALUE, SI, SC, EE) IE:

请注意,最终目标是能够找到每列(PVALUE、SI、SC、EE)IE 的百分比:

pol_acc_df['PVALUE_PCT'] = np.round(pol_acc_df.PVALUE_Po/pol_acc_df.PVALUE_Acc,4)

Thanks!

谢谢!

采纳答案by chrisb

By default, groupbyoutput has the grouping columns as indicies, not columns, which is why the merge is failing.

默认情况下,groupby输出将分组列作为索引,而不是列,这就是合并失败的原因。

There are a couple different ways to handle it, probably the easiest is using the as_indexparameter when you define the groupby object.

有几种不同的处理方法,可能最简单的方法是as_index在定义 groupby 对象时使用参数。

po_grouped_df = poagg_df.groupby(['EID','PCODE'], as_index=False)

Then, your merge should work as expected.

然后,您的合并应该按预期工作。

In [356]: pd.merge(acc_df, pol_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
Out[356]: 
   EID PCODE  SC_Acc  EE_Acc        SI_Acc  PVALUE_Acc  EE_Po  PVALUE_Po  \
0  123    GR     236   40000  1.805222e+31         350  10000         50   
1  123    GR     236   40000  1.805222e+31         350  30000        300   
2  123    GU     443   12000  8.765549e+87         250  10000        100   
3  123    GU     443   12000  8.765549e+87         250   2000        150   

   SC_Po  SI_Po  
0     23     40  
1    213    140  
2    230    400  
3    213    140  

回答by vyang

From the pandas docs:

来自熊猫文档

Transformation: perform some group-specific computations and return a like-indexed object

转换:执行一些特定于组的计算并返回一个类似索引的对象

Unfortunately, transformworks series by series, so you wouldn't be able to perform multiple functions on multiple columns as you've done with agg, but transformdoes allow you to skip merge

不幸的是,transform逐个系列地工作,因此您无法像使用 那样在多列上执行多个功能agg,但transform确实允许您跳过merge

po_grouped_df = pol_df.groupby(['EID','PCODE'])
pol_df['sum_pval'] = po_grouped_df['PVALUE'].transform(sum)
pol_df['func_si'] = po_grouped_df['SI'].transform(lambda x: np.sqrt(np.sum(x * np.exp(x-1))))
pol_df['sum_sc'] = po_grouped_df['SC'].transform(sum)
pol_df['sum_ee'] = po_grouped_df['EE'].transform(sum)
pol_df

Results in:

结果是:

PID EID PCODE   PVALUE  SI  SC  EE      sum_pval    func_si         sum_sc  sum_ee
1   123 GU      100     400 230 10000   250         8.765549e+87    443     12000
1   123 GR      50      40  23  10000   350         1.805222e+31    236     40000
2   123 GU      150     140 213 2000    250         8.765549e+87    443     12000
2   123 GR      300     140 213 30000   350         1.805222e+31    236     40000

For more info, check out this SO answer.

有关更多信息,请查看此 SO 答案