在 Pandas 中按年份和 ID 求和
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35328979/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Sum by year and id in Pandas
提问by Arthur Morris
I would like to understand the most compact way to replicate the following Stata command in Python 2.7 using pandas:
egen yr_id_sum = total(var_to_sum), missing by(id year)
.
我想了解最紧凑的方式来复制使用大Pandas在Python 2.7以下Stata的命令:
egen yr_id_sum = total(var_to_sum), missing by(id year)
。
I'd like to produce the yr_id_sum column in this table:
我想在这个表中生成 yr_id_sum 列:
id year value yr_id_sum
1 2010 1 3
1 2010 2 3
1 2011 3 7
1 2011 4 7
2 2010 11 23
2 2010 12 23
2 2011 13 27
2 2011 14 27
I can do this for one grouping variable as follows (this may help clarify what I'm trying to do):
我可以按如下方式对一个分组变量执行此操作(这可能有助于阐明我要做什么):
def add_mean(grp):
grp['ann_sum'] = grp['var_to_sum'].sum()
return grp
df=df.groupby('year').apply(add_sum)
This is equivalent to egen year_sum = total(var_to_sum), missing by(year)
.
这相当于egen year_sum = total(var_to_sum), missing by(year)
.
I'm having difficulty with expanding answers like thisabout using sums with a multiindex to my case.
我在扩展这样的关于使用多索引总和的答案时遇到了困难。
df.set_index(['year', 'id'], inplace=True)
df=df.groupby(['year', 'id').apply(add_sum)
Seems like it should do what I want it to... but I get Exception: cannot handle a non-unique multi-index!
似乎它应该做我想做的事……但我明白了 Exception: cannot handle a non-unique multi-index!
Here are some of the answers that I've already looked at:
以下是我已经看过的一些答案:
- This question about applying a user defined function to each subgroup of a Group By in Pandasis close to what I am looking for.
- I am trying to follow this question, with an unconditional sum.
- 这个关于将用户定义的函数应用于 Pandas 中 Group By 的每个子组的问题与我正在寻找的很接近。
- 我试图用无条件的总和来回答这个问题。
回答by DSM
To reproduce your desired output, you could use transform
: it takes the results of a groupby operation and broadcasts it back up to the original index. For example:
要重现您想要的输出,您可以使用transform
: 它获取 groupby 操作的结果并将其广播回原始索引。例如:
>>> df["yr_id_sum"] = df.groupby(["id", "year"])["value"].transform(sum)
>>> df
id year value yr_id_sum
0 1 2010 1 3
1 1 2010 2 3
2 1 2011 3 7
3 1 2011 4 7
4 2 2010 11 23
5 2 2010 12 23
6 2 2011 13 27
7 2 2011 14 27
which is basically
这基本上是
>>> df.groupby(["id", "year"])["value"].sum()
id year
1 2010 3
2011 7
2 2010 23
2011 27
Name: value, dtype: int64
but repeated to match the original columns being used as the index.
但重复以匹配用作索引的原始列。