Pandas DataFrame 在 Groupby 两列之后找到最大值并获取计数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45161355/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas DataFrame find the max after Groupby two columns and get counts
提问by weigod
I have a dataframe df as following:
我有一个数据框 df 如下:
userId pageId tag
0 3122471 e852 18
1 3122471 f3e2 18
2 3122471 7e93 18
3 3122471 2768 6
4 3122471 53d9 6
5 3122471 06d7 15
6 3122471 e31c 15
7 3122471 c6f3 2
8 1234123 fjwe 1
9 1234123 eiae 4
10 1234123 ieha 4
After using df.groupby(['userId', 'tag'])['pageId'].count()
to group the data by userId and tag .
I will get:
使用df.groupby(['userId', 'tag'])['pageId'].count()
按 userId 和 tag 对数据进行分组后。我会得到:
userId tag
3122471 2 1
6 2
15 2
18 3
1234123 1 1
4 2
Now I want to find the tag that each user has the most. Just as following:
现在我想找到每个用户拥有最多的标签。就像下面这样:
userId tag
3122471 18
1234123 4
(Note: if there are multiple tags that has the same count, I want to use a function my_rule
to determine which to show)
(注意:如果有多个标签的计数相同,我想使用一个函数my_rule
来确定要显示哪个)
采纳答案by Zero
You could work on aggregated data.
您可以处理聚合数据。
In [387]: dff = df.groupby(['userId', 'tag'], as_index=False)['pageId'].count()
In [388]: dff
Out[388]:
userId tag pageId
0 1234123 1 1
1 1234123 4 2
2 3122471 2 1
3 3122471 6 2
4 3122471 15 2
5 3122471 18 3
In [389]: dff.groupby('userId').apply(lambda x: x.tag[x.pageId.idxmax()])
Out[389]:
userId
1234123 4
3122471 18
dtype: int64
回答by VinceP
group the original dataframe by userid
按用户 ID 对原始数据框进行分组
df.groupby('userId').max()['tag']
or
或者
df.groupby('userId', as_index=False)['tag'].max()
Note that the second solution is a factor of two faster
请注意,第二个解决方案的速度要快两倍
%timeit df.groupby('userId').max()['tag']
# 100 loops, best of 3: 5.69 ms per loop
%timeit df.groupby('userId', as_index=False)['tag'].max()
# 100 loops, best of 3: 2.43 ms per loop
回答by jezrael
I think you need DataFrameGroupBy.size
with DataFrameGroupBy.idxmax
, but first reset_index
:
我认为你需要DataFrameGroupBy.size
with DataFrameGroupBy.idxmax
,但首先reset_index
:
What is the difference between size and count in pandas?
df = df.groupby(['userId', 'tag'])['pageId'].size()
df = df.reset_index(level='userId')
.groupby('userId')['pageId'].idxmax().reset_index(name='tag')
print (df)
userId tag
0 1234123 4
1 3122471 18
Timings:
时间:
np.random.seed(123)
N = 100000
df = pd.DataFrame(np.random.randint(1000, size=(N, 3)), columns= ['userId','pageId','tag'])
#print (df)
In [188]: %timeit (df.groupby(['userId', 'tag'], as_index=False)['pageId'].count().groupby('userId').apply(lambda x: x.tag[x.pageId.idxmax()]))
10 loops, best of 3: 180 ms per loop
In [189]: %timeit (df.groupby(['userId', 'tag'])['pageId'].size().reset_index(level='userId').groupby('userId')['pageId'].idxmax())
10 loops, best of 3: 103 ms per loop
VinceP solution is wrong, so no include to timings.
VinceP 解决方案是错误的,所以不包括时间。