pandas 熊猫组合两个分组依据，过滤和合并组（计数）

Question

提问by Sheepy

I have a dataframe that I need to combine two different groupbys with one of them filtered.

我有一个数据框，我需要将两个不同的 groupbys 与其中一个过滤组合。

 ID     EVENT      SUCCESS
 1       PUT          Y
 2       POST         Y
 2       PUT          N
 1       DELETE       Y

This table below is how I would like the data to look like. Firstly grouping the 'EVENT' counts, the second is to count the amount of Successes ('Y') per ID

下表是我希望数据的样子。首先对'EVENT'计数进行分组，其次是计算每个ID的成功（'Y'）数量

ID  PUT   POST  DELETE SUCCESS
 1   1     0       1      2
 2   1     1       0      1

I've tried a few techniques and the closet I've found is two separate methods which yield the following

我尝试了一些技术，我发现的壁橱是两种不同的方法，它们产生以下结果

group_df = df.groupby(['ID', 'EVENT']) count_group_df = group_df.size().unstack()

Which yields the following for the 'EVENT' counts

对于“事件”计数产生以下结果

ID  PUT   POST  DELETE
 1   1     0       1      
 2   1     1       0

For the Successes with filters, i dont know whether I can join this to the first set on 'ID'

对于带有过滤器的成功，我不知道我是否可以将其加入“ID”的第一组

 df_success = df.loc[df['SUCCESS'] == 'Y', ['ID', 'SUCCESS']]
 count_group_df_2 = df_success.groupby(['ID', 'SUCCESS'])


ID  SUCCESS
1      2
2      1

I need to combine these somehow?

我需要以某种方式组合这些？

Additionally I'd also like to merge the counts two of the 'EVENT''s for example PUT's and POST's into one column.

此外，我还想将两个“事件”的计数（例如 PUT 和 POST）合并到一列中。

Answer 1

采纳答案by jezrael

Use concatfor merge them together:

使用concat用于合并在一起：

df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = (df['SUCCESS'] == 'Y').groupby(df['ID']).sum().astype(int)
df = pd.concat([df1, df_success],axis=1)
print (df)
    DELETE  POST  PUT  SUCCESS
ID                            
1        1     0    1        2
2        0     1    1        1

Another solution with value_counts:

另一个解决方案value_counts：

df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = df.loc[df['SUCCESS'] == 'Y', 'ID'].value_counts().rename('SUCCESS')
df = pd.concat([df1, df_success],axis=1)
print (df)
    DELETE  POST  PUT  SUCCESS
ID                            
1        1     0    1        2
2        0     1    1        1

Last is possible convert index to column and remove columns name IDby reset_index+ rename_axis:

最后可以将索引转换为列并ID通过reset_index+删除列名rename_axis：

df = df.reset_index().rename_axis(None, axis=1)
print (df)
   ID  DELETE  POST  PUT  SUCCESS
0   1       1     0    1        2
1   2       0     1    1        1

Answer 2

回答by piRSquared

pandas

pd.get_dummies(df.EVENT) \ 
  .assign(SUCCESS=df.SUCCESS.eq('Y').astype(int)) \
  .groupby(df.ID).sum().reset_index()

   ID  DELETE  POST  PUT  SUCCESS
0   1       1     0    1        2
1   2       0     1    1        1

numpyand pandas

numpy和 pandas

f, u = pd.factorize(df.EVENT.values)
n = u.size
d = np.eye(n)[f]
s = (df.SUCCESS.values == 'Y').astype(int)
d1 = pd.DataFrame(
    np.column_stack([d, s]),
    df.index, np.append(u, 'SUCCESS')
)
d1.groupby(df.ID).sum().reset_index()

   ID  DELETE  POST  PUT  SUCCESS
0   1       1     0    1        2
1   2       0     1    1        1

Timing
small data

定时
小数据

%%timeit
f, u = pd.factorize(df.EVENT.values)
n = u.size
d = np.eye(n)[f]
s = (df.SUCCESS.values == 'Y').astype(int)
d1 = pd.DataFrame(
    np.column_stack([d, s]),
    df.index, np.append(u, 'SUCCESS')
)
d1.groupby(df.ID).sum().reset_index()
1000 loops, best of 3: 1.32 ms per loop

%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = (df['SUCCESS'] == 'Y').groupby(df['ID']).sum().astype(int)
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 3.3 ms per loop

%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = df.loc[df['SUCCESS'] == 'Y', 'ID'].value_counts().rename('SUCCESS')
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 3.28 ms per loop

%timeit pd.get_dummies(df.EVENT).assign(SUCCESS=df.SUCCESS.eq('Y').astype(int)).groupby(df.ID).sum().reset_index()
100 loops, best of 3: 2.62 ms per loop

large data

大数据

df = pd.DataFrame(dict(
        ID=np.random.randint(100, size=100000),
        EVENT=np.random.choice('PUT POST DELETE'.split(), size=100000),
        SUCCESS=np.random.choice(list('YN'), size=100000)
    ))

%%timeit
f, u = pd.factorize(df.EVENT.values)
n = u.size
d = np.eye(n)[f]
s = (df.SUCCESS.values == 'Y').astype(int)
d1 = pd.DataFrame(
    np.column_stack([d, s]),
    df.index, np.append(u, 'SUCCESS')
)
d1.groupby(df.ID).sum().reset_index()
100 loops, best of 3: 10.8 ms per loop

%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = (df['SUCCESS'] == 'Y').groupby(df['ID']).sum().astype(int)
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 17.7 ms per loop

%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = df.loc[df['SUCCESS'] == 'Y', 'ID'].value_counts().rename('SUCCESS')
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 17.4 ms per loop

%timeit pd.get_dummies(df.EVENT).assign(SUCCESS=df.SUCCESS.eq('Y').astype(int)).groupby(df.ID).sum().reset_index()
100 loops, best of 3: 16.8 ms per loop

Answer 3

回答by Golden Lion

use a pivot_table and a dataframe filter

使用一个 pivot_table 和一个数据框过滤器

 df=pd.DataFrame([{"ID":1,
             "EVENT":"PUT",
             "SUCCESS":"Y"
            },
             {
                 "ID":2,
             "EVENT":"POST",
             "SUCCESS":"Y"
             }
             ,
             {
                 "ID":2,
             "EVENT":"PUT",
             "SUCCESS":"N"
             },
             {
                 "ID":1,
             "EVENT":"DELETE",
             "SUCCESS":"Y"
             }])
 filter=df['SUCCESS']=='Y'
 event= df[filter].groupby('ID')['EVENT'].size().reset_index()

 print(event)
 #df_t=df.T
 #print(df_t)

 event= df[filter].pivot_table(index='ID', columns='EVENT', values='SUCCESS', aggfunc='count',fill_value=0)
 print (event)

pandas 熊猫组合两个分组依据，过滤和合并组（计数）

提问by Sheepy

采纳答案by jezrael

回答by piRSquared

回答by Golden Lion

相关推荐

最近更新

标签

pandas 熊猫组合两个分组依据，过滤和合并组（计数）

提问by Sheepy

采纳答案by jezrael

回答by piRSquared

回答by Golden Lion

相关推荐

Pandas：使用变量从变量名称创建具有一行名称和列名称的数据框

pandas “IndexError: positional indexers are out-of-bounds” 当它们显然不是

pandas Python：日期时间到季节

pandas 熊猫将 NULL 读取为 NaN 浮点数而不是 str

相关推荐

最近更新

标签