pandas 为pandas中的groupby计算nunique()
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49297480/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Calculate nunique() for groupby in pandas
提问by Slavka
I have a dataframe with columns:
我有一个带列的数据框:
diff
- difference between registration date and payment date,in dayscountry
- country of useruser_id
campaign_id
-- another categorical column, we will use it in groupby
diff
- 注册日期和付款日期之间的差异,以天为单位country
- 用户国家user_id
campaign_id
-- 另一个分类列,我们将在 groupby 中使用它
I need to calculate count distinct users for every country
+campaign_id
group who has diff
<=n.
For example, for country
'A', campaign
'abc' and diff
7 i need to get count distinct users from country
'A', campaign
'abc' and diff
<=7
我需要为<=n 的每个country
+campaign_id
组计算不同的用户diff
数。例如,对于country
“A”、campaign
“abc”和diff
7,我需要从country
“A”、campaign
“abc”和diff
<=7 中获取不同的用户数
My current solution(below) works too long
我目前的解决方案(如下)工作时间太长
import pandas as pd
import numpy as np
## generate test dataframe
df = pd.DataFrame({
'country':np.random.choice(['A', 'B', 'C', 'D'], 10000),
'campaign': np.random.choice(['camp1', 'camp2', 'camp3', 'camp4', 'camp5', 'camp6'], 10000),
'diff':np.random.choice(range(10), 10000),
'user_id': np.random.choice(range(1000), 10000)
})
## main
result_df = pd.DataFrame()
for diff in df['diff'].unique():
tmp_df = df.loc[df['diff']<=diff,:]
tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index()
tmp_df['diff'] = diff
tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0)
Maybe there is better way to do this?
也许有更好的方法来做到这一点?
采纳答案by jezrael
First use list comprehension with concat
and assign
for join all together and then groupby
with nunique
with adding column diff
, last rename columns and if necessary add reindex
for custom columns order:
首先使用列表理解 with concat
and assign
for join all together,然后groupby
with add nunique
column diff
,最后重命名列,并在必要时添加reindex
自定义列顺序:
df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in df['diff'].unique()])
df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id']
.nunique()
.reset_index()
.rename(columns={'user_id':'unique_ppl'})
.reindex(columns=['country', 'campaign', 'unique_ppl', 'diff']))
回答by jpp
One alternative below, but @jezrael's solutionis optimal.
下面的另一种选择,但@jezrael 的解决方案是最佳的。
Performance benchmarking
性能基准测试
%timeit original(df) # 149ms
%timeit jp(df) # 81ms
%timeit jez(df) # 47ms
def original(df):
result_df = pd.DataFrame()
for diff in df['diff'].unique():
tmp_df = df.loc[df['diff']<=diff,:]
tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index()
tmp_df['diff'] = diff
tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0)
return result_df
def jp(df):
result_df = pd.DataFrame()
lst = []
lst_append = lst.append
for diff in df['diff'].unique():
tmp_df = df.loc[df['diff']<=diff,:]
tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).agg({'user_id': 'nunique'})
tmp_df['diff'] = diff
tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
lst_append(tmp_df)
result_df = result_df.append(pd.concat(lst, ignore_index=True, axis=0), ignore_index=True)
return result_df
def jez(df):
df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in df['diff'].unique()])
df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id']
.nunique()
.reset_index()
.rename(columns={'user_id':'unique_ppl'})
.reindex(columns=['country', 'campaign', 'unique_ppl', 'diff']))
return df2