pandas - 分组和计算唯一值

Question

提问by Fabio Lamanna

I have this kind of dataframe df:

我有这种数据框 df：

User,C,G
111,ar,1
112,es,1
112,es,1
112,es,2
113,es,2
113,es,3
113,es,3
114,es,4

What I would like to return as output is:

我想作为输出返回的是：

G,nU,ar,es
1,2,1,1
2,2,0,2
3,1,0,1
4,1,0,1

Basically, for each G, I'm counting the number of different Userinside it on the nUcolumn and the occurrences of the strings in C. Each User has a unique Cvalue. For instance, in the Gnumber 1 I have two Users (111 and 112), with one occurrence in 'ar' and one in 'es' (no matter if there are two 112 occurrences, I just need the (112,'es') single couple). Summing up the 'ar' and 'es' columns should return the nUcolumn. So far I tried this:

基本上，对于每个G，我正在计算列User上不同内部的数量以及. 每个用户都有一个唯一的值。例如，在数字 1 中，我有两个用户（111 和 112），其中一个出现在 'ar' 中，一个出现在 'es'（无论是否有两个 112 出现，我只需要 (112,'es' ) 单身夫妇)。总结 'ar' 和 'es' 列应该返回该列。到目前为止，我试过这个：nUCCGnU

d = df.reset_index().groupby('G')['User'].nunique()

which correctly returns the count of Users but no information about the Ccolumn.

它正确返回用户数，但没有关于C列的信息。

Sorry for the confusion this might cause.

很抱歉这可能引起的混乱。

Answer 1

回答by unutbu

Given df,

给定df，

result = df.groupby(['G', 'User'])['C'].value_counts()

yields

产量

G  User    
1  111   ar    1
   112   es    2
2  112   es    1
   113   es    1
3  113   es    2
4  114   es    1
dtype: int64

This counts each occurrence of arand es. We really only want to count unique occurrences, so let's set each value in the Series to 1:

这种计算的每次出现ar和es。我们真的只想计算唯一出现的次数，所以让我们将系列中的每个值设置为 1：

result[:] = 1

so that resultlooks like

所以result看起来像

G  User    
1  111   ar    1
   112   es    1
2  112   es    1
   113   es    1
3  113   es    1
4  114   es    1
dtype: int64

Now if we group by the first and last index levels (the Gvalues and the Cvalues), and sum each group,

现在，如果我们按第一个和最后一个索引级别（G值和C值）分组，并对每个组求和，

result = result.groupby(level=['G',-1]).sum()

we get

我们得到

G    
1  ar    1
   es    1
2  es    2
3  es    1
4  es    1
dtype: int64

Now we can unstack the last index level:

现在我们可以取消堆叠最后一个索引级别：

result = result.unstack()

to obtain

获得

   ar  es
G        
1   1   1
2 NaN   2
3 NaN   1
4 NaN   1

Fill the NaNs with zeros:

用零填充 NaN：

result = result.fillna(0)

Define the nUcolumn and the sum of the rows:

定义nU列和行的总和：

result['nU'] = result.sum(axis=1)

and reorder the columns:

并对列重新排序：

result = result[['nU', 'ar', 'es']]

Putting it all together:

把它们放在一起：

import pandas as pd
df = pd.read_csv('data')
result = df.groupby(['G', 'User'])['C'].value_counts()
result[:] = 1
result = result.groupby(level=['G',-1]).sum()
result = result.unstack()
result = result.fillna(0)
result['nU'] = result.sum(axis=1)
result = result[['nU', 'ar', 'es']]

yields

产量

   nU  ar  es
G            
1   2   1   1
2   2   0   2
3   1   0   1
4   1   0   1

pandas - 分组和计算唯一值

提问by Fabio Lamanna

回答by unutbu

相关推荐

最近更新

标签

pandas - 分组和计算唯一值

提问by Fabio Lamanna

回答by unutbu

相关推荐

pandas 在python中在下划线处拆分并存储第一个值

Pandas 文档中的“广播”一词是什么意思？

Pandas Dataframe：获取最大元素的索引

pandas 用于线性回归的熊猫数据框转换

相关推荐

最近更新

标签