在 Pandas 中按年份和 ID 求和

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35328979/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:40:42  来源:igfitidea点击:

Sum by year and id in Pandas

pythonpandas

提问by Arthur Morris

I would like to understand the most compact way to replicate the following Stata command in Python 2.7 using pandas: egen yr_id_sum = total(var_to_sum), missing by(id year).

我想了解最紧凑的方式来复制使用大Pandas在Python 2.7以下Stata的命令: egen yr_id_sum = total(var_to_sum), missing by(id year)

I'd like to produce the yr_id_sum column in this table:

我想在这个表中生成 yr_id_sum 列:

id    year    value  yr_id_sum
1     2010    1      3
1     2010    2      3
1     2011    3      7
1     2011    4      7
2     2010    11     23
2     2010    12     23
2     2011    13     27
2     2011    14     27

I can do this for one grouping variable as follows (this may help clarify what I'm trying to do):

我可以按如下方式对一个分组变量执行此操作(这可能有助于阐明我要做什么):

def add_mean(grp):
    grp['ann_sum'] = grp['var_to_sum'].sum()
    return grp

df=df.groupby('year').apply(add_sum)

This is equivalent to egen year_sum = total(var_to_sum), missing by(year).

这相当于egen year_sum = total(var_to_sum), missing by(year).

I'm having difficulty with expanding answers like thisabout using sums with a multiindex to my case.

我在扩展这样的关于使用多索引总和的答案时遇到了困难。

df.set_index(['year', 'id'], inplace=True)
df=df.groupby(['year', 'id').apply(add_sum)

Seems like it should do what I want it to... but I get Exception: cannot handle a non-unique multi-index!

似乎它应该做我想做的事……但我明白了 Exception: cannot handle a non-unique multi-index!

Here are some of the answers that I've already looked at:

以下是我已经看过的一些答案:

回答by DSM

To reproduce your desired output, you could use transform: it takes the results of a groupby operation and broadcasts it back up to the original index. For example:

要重现您想要的输出,您可以使用transform: 它获取 groupby 操作的结果并将其广播回原始索引。例如:

>>> df["yr_id_sum"] = df.groupby(["id", "year"])["value"].transform(sum)
>>> df
   id  year  value  yr_id_sum
0   1  2010      1          3
1   1  2010      2          3
2   1  2011      3          7
3   1  2011      4          7
4   2  2010     11         23
5   2  2010     12         23
6   2  2011     13         27
7   2  2011     14         27

which is basically

这基本上是

>>> df.groupby(["id", "year"])["value"].sum()
id  year
1   2010     3
    2011     7
2   2010    23
    2011    27
Name: value, dtype: int64

but repeated to match the original columns being used as the index.

但重复以匹配用作索引的原始列。