Python Pandas - GroupBy 然后在原始表上合并

Question

提问by WalkingDeadFan

I'm trying to write a function to aggregate and perform various stats calcuations on a dataframe in Pandas and then merge it to the original dataframe however, I'm running to issues. This is code equivalent in SQL:

我正在尝试编写一个函数来对 Pandas 中的数据帧进行聚合和执行各种统计计算，然后将其合并到原始数据帧中，但是，我遇到了问题。这是在 SQL 中等效的代码：

SELECT EID,
       PCODE,
       SUM(PVALUE) AS PVALUE,
       SUM(SQRT(SC*EXP(SC-1))) AS SC,
       SUM(SI) AS SI,
       SUM(EE) AS EE
INTO foo_bar_grp
FROM foo_bar
GROUP BY EID, PCODE

And then join on the original table:

然后加入原始表：

SELECT *
FROM foo_bar_grp INNER JOIN 
foo_bar ON foo_bar.EID = foo_bar_grp.EID 
        AND foo_bar.PCODE = foo_bar_grp.PCODE

Here are the steps: Loading the dataIN:>>

以下是步骤：加载数据IN：>>

pol_dict = {'PID':[1,1,2,2],
             'EID':[123,123,123,123],
             'PCODE':['GU','GR','GU','GR'],
             'PVALUE':[100,50,150,300],
             'SI':[400,40,140,140],
             'SC':[230,23,213,213],
             'EE':[10000,10000,2000,30000],
             }


pol_df = DataFrame(pol_dict)

pol_df

OUT:>>

输出：>>

   EID    EE PCODE  PID  PVALUE   SC   SI
0  123  10000    GU    1     100  230  400
1  123  10000    GR    1      50   23   40
2  123   2000    GU    2     150  213  140
3  123  30000    GR    2     300  213  140

Step 2: Calculating and Grouping on the data:

第 2 步：对数据进行计算和分组：

My pandas code is as follows:

我的熊猫代码如下：

#create aggregation dataframe
poagg_df = pol_df
del poagg_df['PID']
po_grouped_df = poagg_df.groupby(['EID','PCODE'])

#generate acc level aggregate
acc_df = po_grouped_df.agg({
    'PVALUE' : np.sum,
    'SI' : lambda x: np.sqrt(np.sum(x * np.exp(x-1))),
    'SC' : np.sum,
    'EE' : np.sum
})

This works fine until I want to join on the original table:

这工作正常，直到我想加入原始表：

IN:>>

在：>>

po_account_df = pd.merge(acc_df, po_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))

OUT:>> KeyError: u'no item named EID'

OUT:>> KeyError: 你没有名为 EID 的项目

For some reason, the grouped dataframe can't join back to the original table. I've looked at ways of trying to convert the groupby columns to actual columns but that doesn't seem to work.

出于某种原因，分组的数据框无法连接回原始表。我已经研究了尝试将 groupby 列转换为实际列的方法，但这似乎不起作用。

Please note, the end goal is to be able to find the percentage for each column (PVALUE, SI, SC, EE) IE:

请注意，最终目标是能够找到每列（PVALUE、SI、SC、EE）IE 的百分比：

pol_acc_df['PVALUE_PCT'] = np.round(pol_acc_df.PVALUE_Po/pol_acc_df.PVALUE_Acc,4)

Thanks!

谢谢！

Answer 1

采纳答案by chrisb

By default, groupbyoutput has the grouping columns as indicies, not columns, which is why the merge is failing.

默认情况下，groupby输出将分组列作为索引，而不是列，这就是合并失败的原因。

There are a couple different ways to handle it, probably the easiest is using the as_indexparameter when you define the groupby object.

有几种不同的处理方法，可能最简单的方法是as_index在定义 groupby 对象时使用参数。

po_grouped_df = poagg_df.groupby(['EID','PCODE'], as_index=False)

Then, your merge should work as expected.

然后，您的合并应该按预期工作。

In [356]: pd.merge(acc_df, pol_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
Out[356]: 
   EID PCODE  SC_Acc  EE_Acc        SI_Acc  PVALUE_Acc  EE_Po  PVALUE_Po  \
0  123    GR     236   40000  1.805222e+31         350  10000         50   
1  123    GR     236   40000  1.805222e+31         350  30000        300   
2  123    GU     443   12000  8.765549e+87         250  10000        100   
3  123    GU     443   12000  8.765549e+87         250   2000        150   

   SC_Po  SI_Po  
0     23     40  
1    213    140  
2    230    400  
3    213    140

Answer 2

回答by vyang

From the pandas docs:

来自熊猫文档：

Transformation: perform some group-specific computations and return a like-indexed object

转换：执行一些特定于组的计算并返回一个类似索引的对象

Unfortunately, transformworks series by series, so you wouldn't be able to perform multiple functions on multiple columns as you've done with agg, but transformdoes allow you to skip merge

不幸的是，transform逐个系列地工作，因此您无法像使用那样在多列上执行多个功能agg，但transform确实允许您跳过merge

po_grouped_df = pol_df.groupby(['EID','PCODE'])
pol_df['sum_pval'] = po_grouped_df['PVALUE'].transform(sum)
pol_df['func_si'] = po_grouped_df['SI'].transform(lambda x: np.sqrt(np.sum(x * np.exp(x-1))))
pol_df['sum_sc'] = po_grouped_df['SC'].transform(sum)
pol_df['sum_ee'] = po_grouped_df['EE'].transform(sum)
pol_df

Results in:

结果是：

PID EID PCODE   PVALUE  SI  SC  EE      sum_pval    func_si         sum_sc  sum_ee
1   123 GU      100     400 230 10000   250         8.765549e+87    443     12000
1   123 GR      50      40  23  10000   350         1.805222e+31    236     40000
2   123 GU      150     140 213 2000    250         8.765549e+87    443     12000
2   123 GR      300     140 213 30000   350         1.805222e+31    236     40000

For more info, check out this SO answer.

有关更多信息，请查看此 SO 答案。

Python Pandas - GroupBy 然后在原始表上合并

提问by WalkingDeadFan

采纳答案by chrisb

回答by vyang

相关推荐

最近更新

标签

Python Pandas - GroupBy 然后在原始表上合并

提问by WalkingDeadFan

采纳答案by chrisb

回答by vyang

相关推荐

Python 模拟多个返回值

检查密钥是否存在并使用 Python 迭代 JSON 数组

Python 如何获取包含特定数据类型的 pandas.DataFrame 列

Python 如何解压pkl文件？

相关推荐

最近更新

标签