Python 如何通过多列中的唯一索引对熊猫求和?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35307732/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 16:15:46  来源:igfitidea点击:

How to sum in pandas by unique index in several columns?

pythonpandassumaggregate

提问by ShanZhengYang

I have a pandas DataFrame which details online activities in terms of "clicks" during an user session. There are as many as 50,000 unique users, and the dataframe has around 1.5 million samples. Obviously most users have multiple records.

我有一个 Pandas DataFrame,它详细说明了用户会话期间“点击”方面的在线活动。有多达 50,000 个独立用户,数据框有大约 150 万个样本。显然大多数用户有多个记录。

The four columns are a unique user id, the date when the user began the service "Registration", the date the user used the service "Session", the total number of clicks.

四列分别是唯一的用户id、用户开始服务“注册”的日期、用户使用服务“会话”的日期、总点击次数。

The organization of the dataframe is as follows:

数据框的组织如下:

User_ID    Registration  Session      clicks
2349876    2012-02-22    2014-04-24   2 
1987293    2011-02-01    2013-05-03   1 
2234214    2012-07-22    2014-01-22   7 
9874452    2010-12-22    2014-08-22   2 
...

(There is also an index above beginning with 0, but one could set User_IDas the index.)

(上面还有一个以0开头的索引,但可以设置User_ID为索引。)

I would like to aggregate the total number of clicks by the user since Registration date. The dataframe (or pandas Series object) would list User_ID and "Total_Number_Clicks".

我想汇总自注册日期以来用户的总点击次数。数据框(或熊猫系列对象)将列出 User_ID 和“Total_Number_Clicks”。

User_ID    Total_Clicks
2349876    722 
1987293    341
2234214    220 
9874452    1405 
...

How does one do this in pandas? Is this done by .agg()? Each User_IDneeds to be summed individually.

如何在熊猫中做到这一点?这是由.agg()? 每个都User_ID需要单独求和。

As there are 1.5 million records, does this scale?

由于有 150 万条记录,这是否可以扩展?

回答by Kaushal Kumar Singh

suppose your dataframe name is df, then do the following

假设您的数据框名称是 df,然后执行以下操作

df.groupby(['User_ID']).sum()[['User_ID','clicks']]

回答by jezrael

IIUC you can use groupby, sumand reset_index:

您可以使用 IIUC groupbysum并且reset_index

print df
   User_ID Registration    Session  clicks
0  2349876   2012-02-22 2014-04-24       2
1  1987293   2011-02-01 2013-05-03       1
2  2234214   2012-07-22 2014-01-22       7
3  9874452   2010-12-22 2014-08-22       2

print df.groupby('User_ID')['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

If first column User_IDis index:

如果第一列User_IDindex

print df
        Registration    Session  clicks
User_ID                                
2349876   2012-02-22 2014-04-24       2
1987293   2011-02-01 2013-05-03       1
2234214   2012-07-22 2014-01-22       7
9874452   2010-12-22 2014-08-22       2

print df.groupby(level=0)['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

Or:

或者:

print df.groupby(df.index)['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

EDIT:

编辑:

As Alexanderpointed, you need filter data before groupby, if Sessiondates is less as Registrationdates per User_ID:

正如亚历山大所指出的,您需要在之前过滤数据groupby,如果Session日期小于Registration日期 per User_ID

print df
   User_ID Registration    Session  clicks
0  2349876   2012-02-22 2014-04-24       2
1  1987293   2011-02-01 2013-05-03       1
2  2234214   2012-07-22 2014-01-22       7
3  9874452   2010-12-22 2014-08-22       2

print df[df.Session >= df.Registration].groupby('User_ID')['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

I change 3. row of data for better sample:

我更改 3. 数据行以获得更好的样本:

print df
        Registration    Session  clicks
User_ID                                
2349876   2012-02-22 2014-04-24       2
1987293   2011-02-01 2013-05-03       1
2234214   2012-07-22 2012-01-22       7
9874452   2010-12-22 2014-08-22       2

print df.Session >= df.Registration
User_ID
2349876     True
1987293     True
2234214    False
9874452     True
dtype: bool

print df[df.Session >= df.Registration]
        Registration    Session  clicks
User_ID                                
2349876   2012-02-22 2014-04-24       2
1987293   2011-02-01 2013-05-03       1
9874452   2010-12-22 2014-08-22       2

df1 = df[df.Session >= df.Registration]
print df1.groupby(df1.index)['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2349876       2
2  9874452       2

回答by Alexander

The first thing to do is filter registrations dates that precede the registration date, then group on the User_ID and sum.

首先要做的是过滤注册日期之前的注册日期,然后对 User_ID 和 sum 进行分组。

gb = (df[df.Session >= df.Registration]
      .groupby('User_ID')
      .clicks.agg({'Total_Clicks': np.sum}))

>>> gb
         Total_Clicks
User_ID              
1987293             1
2234214             7
2349876             2
9874452             2

For the use case you mentioned, I believe this is scalable. It always depends, of course, on your available memory.

对于您提到的用例,我相信这是可扩展的。当然,这始终取决于您的可用内存。