pandas 熊猫组合两个分组依据,过滤和合并组(计数)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44325204/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:43:23  来源:igfitidea点击:

Pandas combine two group by's, filter and merge the groups(counts)

pythonpandaspandas-groupby

提问by Sheepy

I have a dataframe that I need to combine two different groupbys with one of them filtered.

我有一个数据框,我需要将两个不同的 groupbys 与其中一个过滤组合。

 ID     EVENT      SUCCESS
 1       PUT          Y
 2       POST         Y
 2       PUT          N
 1       DELETE       Y 

This table below is how I would like the data to look like. Firstly grouping the 'EVENT' counts, the second is to count the amount of Successes ('Y') per ID

下表是我希望数据的样子。首先对'EVENT'计数进行分组,其次是计算每个ID的成功('Y')数量

ID  PUT   POST  DELETE SUCCESS
 1   1     0       1      2
 2   1     1       0      1

I've tried a few techniques and the closet I've found is two separate methods which yield the following

我尝试了一些技术,我发现的壁橱是两种不同的方法,它们产生以下结果

group_df = df.groupby(['ID', 'EVENT']) count_group_df = group_df.size().unstack()

group_df = df.groupby(['ID', 'EVENT']) count_group_df = group_df.size().unstack()

Which yields the following for the 'EVENT' counts

对于“事件”计数产生以下结果

ID  PUT   POST  DELETE
 1   1     0       1      
 2   1     1       0      

For the Successes with filters, i dont know whether I can join this to the first set on 'ID'

对于带有过滤器的成功,我不知道我是否可以将其加入“ID”的第一组

 df_success = df.loc[df['SUCCESS'] == 'Y', ['ID', 'SUCCESS']]
 count_group_df_2 = df_success.groupby(['ID', 'SUCCESS'])


ID  SUCCESS
1      2
2      1

I need to combine these somehow?

我需要以某种方式组合这些?

Additionally I'd also like to merge the counts two of the 'EVENT''s for example PUT's and POST's into one column.

此外,我还想将两个“事件”的计数(例如 PUT 和 POST)合并到一列中。

采纳答案by jezrael

Use concatfor merge them together:

使用concat用于合并在一起:

df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = (df['SUCCESS'] == 'Y').groupby(df['ID']).sum().astype(int)
df = pd.concat([df1, df_success],axis=1)
print (df)
    DELETE  POST  PUT  SUCCESS
ID                            
1        1     0    1        2
2        0     1    1        1

Another solution with value_counts:

另一个解决方案value_counts

df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = df.loc[df['SUCCESS'] == 'Y', 'ID'].value_counts().rename('SUCCESS')
df = pd.concat([df1, df_success],axis=1)
print (df)
    DELETE  POST  PUT  SUCCESS
ID                            
1        1     0    1        2
2        0     1    1        1

Last is possible convert index to column and remove columns name IDby reset_index+ rename_axis:

最后可以将索引转换为列并ID通过reset_index+删除列名rename_axis

df = df.reset_index().rename_axis(None, axis=1)
print (df)
   ID  DELETE  POST  PUT  SUCCESS
0   1       1     0    1        2
1   2       0     1    1        1

回答by piRSquared

pandas

pandas

pd.get_dummies(df.EVENT) \ 
  .assign(SUCCESS=df.SUCCESS.eq('Y').astype(int)) \
  .groupby(df.ID).sum().reset_index()

   ID  DELETE  POST  PUT  SUCCESS
0   1       1     0    1        2
1   2       0     1    1        1

numpyand pandas

numpypandas

f, u = pd.factorize(df.EVENT.values)
n = u.size
d = np.eye(n)[f]
s = (df.SUCCESS.values == 'Y').astype(int)
d1 = pd.DataFrame(
    np.column_stack([d, s]),
    df.index, np.append(u, 'SUCCESS')
)
d1.groupby(df.ID).sum().reset_index()

   ID  DELETE  POST  PUT  SUCCESS
0   1       1     0    1        2
1   2       0     1    1        1


Timing
small data

定时
小数据

%%timeit
f, u = pd.factorize(df.EVENT.values)
n = u.size
d = np.eye(n)[f]
s = (df.SUCCESS.values == 'Y').astype(int)
d1 = pd.DataFrame(
    np.column_stack([d, s]),
    df.index, np.append(u, 'SUCCESS')
)
d1.groupby(df.ID).sum().reset_index()
1000 loops, best of 3: 1.32 ms per loop

%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = (df['SUCCESS'] == 'Y').groupby(df['ID']).sum().astype(int)
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 3.3 ms per loop

%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = df.loc[df['SUCCESS'] == 'Y', 'ID'].value_counts().rename('SUCCESS')
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 3.28 ms per loop

%timeit pd.get_dummies(df.EVENT).assign(SUCCESS=df.SUCCESS.eq('Y').astype(int)).groupby(df.ID).sum().reset_index()
100 loops, best of 3: 2.62 ms per loop

large data

大数据

df = pd.DataFrame(dict(
        ID=np.random.randint(100, size=100000),
        EVENT=np.random.choice('PUT POST DELETE'.split(), size=100000),
        SUCCESS=np.random.choice(list('YN'), size=100000)
    ))

%%timeit
f, u = pd.factorize(df.EVENT.values)
n = u.size
d = np.eye(n)[f]
s = (df.SUCCESS.values == 'Y').astype(int)
d1 = pd.DataFrame(
    np.column_stack([d, s]),
    df.index, np.append(u, 'SUCCESS')
)
d1.groupby(df.ID).sum().reset_index()
100 loops, best of 3: 10.8 ms per loop

%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = (df['SUCCESS'] == 'Y').groupby(df['ID']).sum().astype(int)
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 17.7 ms per loop

%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = df.loc[df['SUCCESS'] == 'Y', 'ID'].value_counts().rename('SUCCESS')
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 17.4 ms per loop

%timeit pd.get_dummies(df.EVENT).assign(SUCCESS=df.SUCCESS.eq('Y').astype(int)).groupby(df.ID).sum().reset_index()
100 loops, best of 3: 16.8 ms per loop

回答by Golden Lion

use a pivot_table and a dataframe filter

使用一个 pivot_table 和一个数据框过滤器

 df=pd.DataFrame([{"ID":1,
             "EVENT":"PUT",
             "SUCCESS":"Y"
            },
             {
                 "ID":2,
             "EVENT":"POST",
             "SUCCESS":"Y"
             }
             ,
             {
                 "ID":2,
             "EVENT":"PUT",
             "SUCCESS":"N"
             },
             {
                 "ID":1,
             "EVENT":"DELETE",
             "SUCCESS":"Y"
             }])
 filter=df['SUCCESS']=='Y'
 event= df[filter].groupby('ID')['EVENT'].size().reset_index()

 print(event)
 #df_t=df.T
 #print(df_t)

 event= df[filter].pivot_table(index='ID', columns='EVENT', values='SUCCESS', aggfunc='count',fill_value=0)
 print (event)