Pandas:有条件的 groupby
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39634175/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: groupby with condition
提问by Petr Petrov
I have dataframe:
我有数据框:
ID,used_at,active_seconds,subdomain,visiting,category
123,2016-02-05 19:39:21,2,yandex.ru,2,Computers
123,2016-02-05 19:43:01,1,mail.yandex.ru,2,Computers
123,2016-02-05 19:43:13,6,mail.yandex.ru,2,Computers
234,2016-02-05 19:46:09,16,avito.ru,2,Automobiles
234,2016-02-05 19:48:36,21,avito.ru,2,Automobiles
345,2016-02-05 19:48:59,58,avito.ru,2,Automobiles
345,2016-02-05 19:51:21,4,avito.ru,2,Automobiles
345,2016-02-05 19:58:55,4,disk.yandex.ru,2,Computers
345,2016-02-05 19:59:21,2,mail.ru,2,Computers
456,2016-02-05 19:59:27,2,mail.ru,2,Computers
456,2016-02-05 20:02:15,18,avito.ru,2,Automobiles
456,2016-02-05 20:04:55,8,avito.ru,2,Automobiles
456,2016-02-05 20:07:21,24,avito.ru,2,Automobiles
567,2016-02-05 20:09:03,58,avito.ru,2,Automobiles
567,2016-02-05 20:10:01,26,avito.ru,2,Automobiles
567,2016-02-05 20:11:51,30,disk.yandex.ru,2,Computers
I need to do
我需要去做
group = df.groupby(['category']).agg({'active_seconds': sum}).rename(columns={'active_seconds': 'count_sec_target'}).reset_index()
but I want to add there condition connected with
但我想在那里添加条件
df.groupby(['category'])['ID'].count()
and if count for category
less than 5
, I want to drop this category.
I don't know, how can I write this condition there.
如果计数category
小于5
,我想放弃这个类别。我不知道,我怎么能在那里写这个条件。
回答by jezrael
As EdChum commented, you can use filter
:
正如EdChum 评论的那样,您可以使用filter
:
Also you can simplify aggregation by sum
:
您还可以通过sum
以下方式简化聚合:
df = df.groupby(['category']).filter(lambda x: len(x) >= 5)
group = df.groupby(['category'], as_index=False)['active_seconds']
.sum()
.rename(columns={'active_seconds': 'count_sec_target'})
print (group)
category count_sec_target
0 Automobiles 233
1 Computers 47
Another solution with reset_index
:
另一个解决方案reset_index
:
df = df.groupby(['category']).filter(lambda x: len(x) >= 5)
group = df.groupby(['category'])['active_seconds'].sum().reset_index(name='count_sec_target')
print (group)
category count_sec_target
0 Automobiles 233
1 Computers 47