pandas - 分组和计算唯一值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30213185/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas - group and count nunique values
提问by Fabio Lamanna
I have this kind of dataframe df:
我有这种数据框 df:
User,C,G
111,ar,1
112,es,1
112,es,1
112,es,2
113,es,2
113,es,3
113,es,3
114,es,4
What I would like to return as output is:
我想作为输出返回的是:
G,nU,ar,es
1,2,1,1
2,2,0,2
3,1,0,1
4,1,0,1
Basically, for each G, I'm counting the number of different Userinside it on the nUcolumn and the occurrences of the strings in C. Each User has a unique Cvalue.
For instance, in the Gnumber 1 I have two Users (111 and 112), with one occurrence in 'ar' and one in 'es' (no matter if there are two 112 occurrences, I just need the (112,'es') single couple). Summing up the 'ar' and 'es' columns should return the nUcolumn. So far I tried this:
基本上,对于每个G,我正在计算列User上不同内部的数量以及. 每个用户都有一个唯一的值。例如,在数字 1 中,我有两个用户(111 和 112),其中一个出现在 'ar' 中,一个出现在 'es'(无论是否有两个 112 出现,我只需要 (112,'es' ) 单身夫妇)。总结 'ar' 和 'es' 列应该返回该列。到目前为止,我试过这个:nUCCGnU
d = df.reset_index().groupby('G')['User'].nunique()
which correctly returns the count of Users but no information about the Ccolumn.
它正确返回用户数,但没有关于C列的信息。
Sorry for the confusion this might cause.
很抱歉这可能引起的混乱。
回答by unutbu
Given df,
给定df,
result = df.groupby(['G', 'User'])['C'].value_counts()
yields
产量
G User
1 111 ar 1
112 es 2
2 112 es 1
113 es 1
3 113 es 2
4 114 es 1
dtype: int64
This counts each occurrence of arand es. We really only want to count unique occurrences, so let's set each value in the Series to 1:
这种计算的每次出现ar和es。我们真的只想计算唯一出现的次数,所以让我们将系列中的每个值设置为 1:
result[:] = 1
so that resultlooks like
所以result看起来像
G User
1 111 ar 1
112 es 1
2 112 es 1
113 es 1
3 113 es 1
4 114 es 1
dtype: int64
Now if we group by the first and last index levels (the Gvalues and the Cvalues), and sum each group,
现在,如果我们按第一个和最后一个索引级别(G值和C值)分组,并对每个组求和,
result = result.groupby(level=['G',-1]).sum()
we get
我们得到
G
1 ar 1
es 1
2 es 2
3 es 1
4 es 1
dtype: int64
Now we can unstack the last index level:
现在我们可以取消堆叠最后一个索引级别:
result = result.unstack()
to obtain
获得
ar es
G
1 1 1
2 NaN 2
3 NaN 1
4 NaN 1
Fill the NaNs with zeros:
用零填充 NaN:
result = result.fillna(0)
Define the nUcolumn and the sum of the rows:
定义nU列和行的总和:
result['nU'] = result.sum(axis=1)
and reorder the columns:
并对列重新排序:
result = result[['nU', 'ar', 'es']]
Putting it all together:
把它们放在一起:
import pandas as pd
df = pd.read_csv('data')
result = df.groupby(['G', 'User'])['C'].value_counts()
result[:] = 1
result = result.groupby(level=['G',-1]).sum()
result = result.unstack()
result = result.fillna(0)
result['nU'] = result.sum(axis=1)
result = result[['nU', 'ar', 'es']]
yields
产量
nU ar es
G
1 2 1 1
2 2 0 2
3 1 0 1
4 1 0 1

