Python pandas 相当于 R groupby mutate
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40923165/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python pandas equivalent to R groupby mutate
提问by asosnovsky
So in R when I have a data frame consisting of say 4 columns, call it df
and I want to compute the ratio by sum product of a group, I can it in such a way:
因此,在 R 中,当我有一个由 4 列组成的数据框时,调用它df
并且我想通过一组的总和来计算比率,我可以这样:
// generate data
df = data.frame(a=c(1,1,0,1,0),b=c(1,0,0,1,0),c=c(10,5,1,5,10),d=c(3,1,2,1,2));
| a b c d |
| 1 1 10 3 |
| 1 0 5 1 |
| 0 0 1 2 |
| 1 1 5 1 |
| 0 0 10 2 |
// compute sum product ratio
df = df%>% group_by(a,b) %>%
mutate(
ratio=c/sum(c*d)
);
| a b c d ratio |
| 1 1 10 3 0.286 |
| 1 1 5 1 0.143 |
| 1 0 5 1 1 |
| 0 0 1 2 0.045 |
| 0 0 10 2 0.454 |
But in python I need to resort to loops. I know there should be a more elegant way than raw loops in python, anyone got any ideas?
但在 python 中,我需要求助于循环。我知道应该有比 python 中的原始循环更优雅的方式,有人有任何想法吗?
回答by Psidom
回答by datistics
According to this thread on pandas githubwe can use the transform()
method to replicate the combination of dplyr::groupby()
and dplyr::mutate()
. For this example, it would look as follows:
根据这一线索对大PandasGitHub上我们可以使用 transform()
的方法来复制的组合dplyr::groupby()
和dplyr::mutate()
。对于此示例,它将如下所示:
df = pd.DataFrame(
dict(
a=(1 , 1, 0, 1, 0 ),
b=(1 , 0, 0, 1, 0 ),
c=(10, 5, 1, 5, 10),
d=(3 , 1, 2, 1, 2 ),
)
).assign(
prod_c_d = lambda x: x['c'] * x['d'],
ratio = lambda x: x['c'] / (x.groupby(['a','b']).transform('sum')['prod_c_d'])
)
This example uses pandas method chaining. For more information on how to use method chaining to replicate dplyr
workflows see this blogpost.
此示例使用Pandas 方法链接。有关如何使用方法链复制dplyr
工作流的更多信息,请参阅此博文。
The method using apply()
and groupby()
does not work for me because it does not seem to be adaptable. For example, it does not work if we delete g.c/
from the lambda expression.
使用apply()
和的方法groupby()
对我不起作用,因为它似乎不具有适应性。例如,如果我们g.c/
从 lambda 表达式中删除它就不起作用。
df['ratio'] = df.groupby(['a','b'], group_keys=False)\
.apply(lambda g: (g.c * g.d).sum() )