Python 熊猫聚合计数不同

Question

提问by dave

Let's say I have a log of user activity and I want to generate a report of total duration and the number of unique users per day.

假设我有一个用户活动日志，我想生成一份总持续时间和每天唯一用户数的报告。

import numpy as np
import pandas as pd
df = pd.DataFrame({'date': ['2013-04-01','2013-04-01','2013-04-01','2013-04-02', '2013-04-02'],
    'user_id': ['0001', '0001', '0002', '0002', '0002'],
    'duration': [30, 15, 20, 15, 30]})

Aggregating duration is pretty straightforward:

聚合持续时间非常简单：

group = df.groupby('date')
agg = group.aggregate({'duration': np.sum})
agg
            duration
date
2013-04-01        65
2013-04-02        45

What I'd like to do is sum the duration and count distincts at the same time, but I can't seem to find an equivalent for count_distinct:

我想做的是同时对持续时间和不同的计数求和，但我似乎找不到 count_distinct 的等效项：

agg = group.aggregate({ 'duration': np.sum, 'user_id': count_distinct})

This works, but surely there's a better way, no?

这有效，但肯定有更好的方法，不是吗？

group = df.groupby('date')
agg = group.aggregate({'duration': np.sum})
agg['uv'] = df.groupby('date').user_id.nunique()
agg
            duration  uv
date
2013-04-01        65   2
2013-04-02        45   1

I'm thinking I just need to provide a function that returns the count of distinct items of a Series object to the aggregate function, but I don't have a lot of exposure to the various libraries at my disposal. Also, it seems that the groupby object already knows this information, so wouldn't I just be duplicating effort?

我想我只需要提供一个函数，该函数将 Series 对象的不同项的计数返回给聚合函数，但我没有太多接触可供我使用的各种库。此外，似乎 groupby 对象已经知道这些信息，所以我不会只是重复工作吗？

Answer 1

采纳答案by DSM

How about either of:

怎么样：

>>> df
         date  duration user_id
0  2013-04-01        30    0001
1  2013-04-01        15    0001
2  2013-04-01        20    0002
3  2013-04-02        15    0002
4  2013-04-02        30    0002
>>> df.groupby("date").agg({"duration": np.sum, "user_id": pd.Series.nunique})
            duration  user_id
date                         
2013-04-01        65        2
2013-04-02        45        1
>>> df.groupby("date").agg({"duration": np.sum, "user_id": lambda x: x.nunique()})
            duration  user_id
date                         
2013-04-01        65        2
2013-04-02        45        1

Answer 2

回答by Ricky McMaster

'nunique' is an option for .agg() since pandas 0.20.0, so:

'nunique' 是 .agg() 的一个选项，因为 pandas 0.20.0，所以：

df.groupby('date').agg({'duration': 'sum', 'user_id': 'nunique'})

Answer 3

回答by user6903745

Just adding to the answers already given, the solution using the string "nunique"seems much faster, tested here on ~21M rows dataframe, then grouped to ~2M

只是添加到已经给出的答案，使用字符串的解决方案"nunique"似乎要快得多，在 ~21M 行数据帧上测试，然后分组到 ~2M

%time _=g.agg({"id": lambda x: x.nunique()})
CPU times: user 3min 3s, sys: 2.94 s, total: 3min 6s
Wall time: 3min 20s

%time _=g.agg({"id": pd.Series.nunique})
CPU times: user 3min 2s, sys: 2.44 s, total: 3min 4s
Wall time: 3min 18s

%time _=g.agg({"id": "nunique"})
CPU times: user 14 s, sys: 4.76 s, total: 18.8 s
Wall time: 24.4 s

Python 熊猫聚合计数不同

提问by dave

采纳答案by DSM

回答by Ricky McMaster

回答by user6903745

相关推荐

最近更新

标签

Python 熊猫聚合计数不同

提问by dave

采纳答案by DSM

回答by Ricky McMaster

回答by user6903745

相关推荐

Python 如何将数字转换为字母表？

Python sklearn 问题：在进行回归时发现样本数量不一致的数组

Python 将 HTML 表格转换为 JSON

Python 如何在 Django 中检查与 mysql 的数据库连接

相关推荐

最近更新

标签