Python 如何通过多列中的唯一索引对熊猫求和?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35307732/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to sum in pandas by unique index in several columns?
提问by ShanZhengYang
I have a pandas DataFrame which details online activities in terms of "clicks" during an user session. There are as many as 50,000 unique users, and the dataframe has around 1.5 million samples. Obviously most users have multiple records.
我有一个 Pandas DataFrame,它详细说明了用户会话期间“点击”方面的在线活动。有多达 50,000 个独立用户,数据框有大约 150 万个样本。显然大多数用户有多个记录。
The four columns are a unique user id, the date when the user began the service "Registration", the date the user used the service "Session", the total number of clicks.
四列分别是唯一的用户id、用户开始服务“注册”的日期、用户使用服务“会话”的日期、总点击次数。
The organization of the dataframe is as follows:
数据框的组织如下:
User_ID Registration Session clicks
2349876 2012-02-22 2014-04-24 2
1987293 2011-02-01 2013-05-03 1
2234214 2012-07-22 2014-01-22 7
9874452 2010-12-22 2014-08-22 2
...
(There is also an index above beginning with 0, but one could set User_ID
as the index.)
(上面还有一个以0开头的索引,但可以设置User_ID
为索引。)
I would like to aggregate the total number of clicks by the user since Registration date. The dataframe (or pandas Series object) would list User_ID and "Total_Number_Clicks".
我想汇总自注册日期以来用户的总点击次数。数据框(或熊猫系列对象)将列出 User_ID 和“Total_Number_Clicks”。
User_ID Total_Clicks
2349876 722
1987293 341
2234214 220
9874452 1405
...
How does one do this in pandas? Is this done by .agg()
? Each User_ID
needs to be summed individually.
如何在熊猫中做到这一点?这是由.agg()
? 每个都User_ID
需要单独求和。
As there are 1.5 million records, does this scale?
由于有 150 万条记录,这是否可以扩展?
回答by Kaushal Kumar Singh
suppose your dataframe name is df, then do the following
假设您的数据框名称是 df,然后执行以下操作
df.groupby(['User_ID']).sum()[['User_ID','clicks']]
回答by jezrael
IIUC you can use groupby
, sum
and reset_index
:
您可以使用 IIUC groupby
,sum
并且reset_index
:
print df
User_ID Registration Session clicks
0 2349876 2012-02-22 2014-04-24 2
1 1987293 2011-02-01 2013-05-03 1
2 2234214 2012-07-22 2014-01-22 7
3 9874452 2010-12-22 2014-08-22 2
print df.groupby('User_ID')['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2234214 7
2 2349876 2
3 9874452 2
If first column User_ID
is index
:
如果第一列User_ID
是index
:
print df
Registration Session clicks
User_ID
2349876 2012-02-22 2014-04-24 2
1987293 2011-02-01 2013-05-03 1
2234214 2012-07-22 2014-01-22 7
9874452 2010-12-22 2014-08-22 2
print df.groupby(level=0)['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2234214 7
2 2349876 2
3 9874452 2
Or:
或者:
print df.groupby(df.index)['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2234214 7
2 2349876 2
3 9874452 2
EDIT:
编辑:
As Alexanderpointed, you need filter data before groupby
, if Session
dates is less as Registration
dates per User_ID
:
正如亚历山大所指出的,您需要在之前过滤数据groupby
,如果Session
日期小于Registration
日期 per User_ID
:
print df
User_ID Registration Session clicks
0 2349876 2012-02-22 2014-04-24 2
1 1987293 2011-02-01 2013-05-03 1
2 2234214 2012-07-22 2014-01-22 7
3 9874452 2010-12-22 2014-08-22 2
print df[df.Session >= df.Registration].groupby('User_ID')['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2234214 7
2 2349876 2
3 9874452 2
I change 3. row of data for better sample:
我更改 3. 数据行以获得更好的样本:
print df
Registration Session clicks
User_ID
2349876 2012-02-22 2014-04-24 2
1987293 2011-02-01 2013-05-03 1
2234214 2012-07-22 2012-01-22 7
9874452 2010-12-22 2014-08-22 2
print df.Session >= df.Registration
User_ID
2349876 True
1987293 True
2234214 False
9874452 True
dtype: bool
print df[df.Session >= df.Registration]
Registration Session clicks
User_ID
2349876 2012-02-22 2014-04-24 2
1987293 2011-02-01 2013-05-03 1
9874452 2010-12-22 2014-08-22 2
df1 = df[df.Session >= df.Registration]
print df1.groupby(df1.index)['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2349876 2
2 9874452 2
回答by Alexander
The first thing to do is filter registrations dates that precede the registration date, then group on the User_ID and sum.
首先要做的是过滤注册日期之前的注册日期,然后对 User_ID 和 sum 进行分组。
gb = (df[df.Session >= df.Registration]
.groupby('User_ID')
.clicks.agg({'Total_Clicks': np.sum}))
>>> gb
Total_Clicks
User_ID
1987293 1
2234214 7
2349876 2
9874452 2
For the use case you mentioned, I believe this is scalable. It always depends, of course, on your available memory.
对于您提到的用例,我相信这是可扩展的。当然,这始终取决于您的可用内存。