pandas 为pandas中的groupby计算nunique()

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49297480/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:19:51  来源:igfitidea点击:

Calculate nunique() for groupby in pandas

pythonpandaspandas-groupby

提问by Slavka

I have a dataframe with columns:

我有一个带列的数据框:

  1. diff- difference between registration date and payment date,in days
  2. country- country of user
  3. user_id
  4. campaign_id-- another categorical column, we will use it in groupby
  1. diff- 注册日期和付款日期之间的差异,以天为单位
  2. country- 用户国家
  3. user_id
  4. campaign_id-- 另一个分类列,我们将在 groupby 中使用它

I need to calculate count distinct users for every country+campaign_idgroup who has diff<=n. For example, for country'A', campaign'abc' and diff7 i need to get count distinct users from country'A', campaign'abc' and diff<=7

我需要为<=n 的每个country+campaign_id组计算不同的用户diff数。例如,对于country“A”、campaign“abc”和diff7,我需要从country“A”、campaign“abc”和diff<=7 中获取不同的用户数

My current solution(below) works too long

我目前的解决方案(如下)工作时间太长

import pandas as pd
import numpy as np

## generate test dataframe
df = pd.DataFrame({
        'country':np.random.choice(['A', 'B', 'C', 'D'], 10000),
        'campaign': np.random.choice(['camp1', 'camp2', 'camp3', 'camp4', 'camp5', 'camp6'], 10000),
        'diff':np.random.choice(range(10), 10000),
        'user_id': np.random.choice(range(1000), 10000)
        })
## main
result_df = pd.DataFrame()
for diff in df['diff'].unique():
    tmp_df = df.loc[df['diff']<=diff,:]
    tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index()
    tmp_df['diff'] = diff
    tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
    result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0)

Maybe there is better way to do this?

也许有更好的方法来做到这一点?

采纳答案by jezrael

First use list comprehension with concatand assignfor join all together and then groupbywith nuniquewith adding column diff, last rename columns and if necessary add reindexfor custom columns order:

首先使用列表理解 with concatand assignfor join all together,然后groupbywith add nuniquecolumn diff,最后重命名列,并在必要时添加reindex自定义列顺序:

df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in  df['diff'].unique()])
df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id']
          .nunique()
          .reset_index()
          .rename(columns={'user_id':'unique_ppl'})
          .reindex(columns=['country', 'campaign', 'unique_ppl', 'diff']))

回答by jpp

One alternative below, but @jezrael's solutionis optimal.

下面的另一种选择,但@jezrael 的解决方案是最佳的。

Performance benchmarking

性能基准测试

%timeit original(df)  # 149ms
%timeit jp(df)        # 81ms
%timeit jez(df)       # 47ms

def original(df):
    result_df = pd.DataFrame()
    for diff in df['diff'].unique():
        tmp_df = df.loc[df['diff']<=diff,:]
        tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index()
        tmp_df['diff'] = diff
        tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
        result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0)

    return result_df

def jp(df):

    result_df = pd.DataFrame()
    lst = []
    lst_append = lst.append
    for diff in df['diff'].unique():
        tmp_df = df.loc[df['diff']<=diff,:]
        tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).agg({'user_id': 'nunique'})
        tmp_df['diff'] = diff
        tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
        lst_append(tmp_df)

    result_df = result_df.append(pd.concat(lst, ignore_index=True, axis=0), ignore_index=True)

    return result_df

def jez(df):
    df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in  df['diff'].unique()])
    df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id']
              .nunique()
              .reset_index()
              .rename(columns={'user_id':'unique_ppl'})
              .reindex(columns=['country', 'campaign', 'unique_ppl', 'diff']))
    return df2