Pandas DataFrame 在 Groupby 两列之后找到最大值并获取计数

Question

提问by weigod

I have a dataframe df as following:

我有一个数据框 df 如下：

   userId  pageId  tag
0  3122471  e852   18
1  3122471  f3e2   18
2  3122471  7e93   18
3  3122471  2768    6
4  3122471  53d9    6
5  3122471  06d7   15
6  3122471  e31c   15
7  3122471  c6f3    2
8  1234123  fjwe    1
9  1234123  eiae    4
10 1234123  ieha    4

After using df.groupby(['userId', 'tag'])['pageId'].count()to group the data by userId and tag . I will get:

使用df.groupby(['userId', 'tag'])['pageId'].count()按 userId 和 tag 对数据进行分组后。我会得到：

userId   tag
3122471  2      1
         6      2
         15     2
         18     3
1234123   1     1
          4     2

Now I want to find the tag that each user has the most. Just as following:

现在我想找到每个用户拥有最多的标签。就像下面这样：

userId   tag
3122471  18
1234123   4

(Note: if there are multiple tags that has the same count, I want to use a function my_ruleto determine which to show)

（注意：如果有多个标签的计数相同，我想使用一个函数my_rule来确定要显示哪个）

Answer 1

采纳答案by Zero

You could work on aggregated data.

您可以处理聚合数据。

In [387]: dff = df.groupby(['userId', 'tag'], as_index=False)['pageId'].count()

In [388]: dff
Out[388]:
    userId  tag  pageId
0  1234123    1       1
1  1234123    4       2
2  3122471    2       1
3  3122471    6       2
4  3122471   15       2
5  3122471   18       3

In [389]: dff.groupby('userId').apply(lambda x: x.tag[x.pageId.idxmax()])
Out[389]:
userId
1234123     4
3122471    18
dtype: int64

Answer 2

回答by VinceP

group the original dataframe by userid

按用户 ID 对原始数据框进行分组

 df.groupby('userId').max()['tag']

or

或者

 df.groupby('userId', as_index=False)['tag'].max()

Note that the second solution is a factor of two faster

请注意，第二个解决方案的速度要快两倍

%timeit df.groupby('userId').max()['tag']
# 100 loops, best of 3: 5.69 ms per loop
%timeit df.groupby('userId', as_index=False)['tag'].max()
# 100 loops, best of 3: 2.43 ms per loop

Answer 3

回答by jezrael

I think you need DataFrameGroupBy.sizewith DataFrameGroupBy.idxmax, but first reset_index:

我认为你需要DataFrameGroupBy.sizewith DataFrameGroupBy.idxmax，但首先reset_index：

What is the difference between size and count in pandas?

Pandas的大小和数量有什么区别？

df = df.groupby(['userId', 'tag'])['pageId'].size()
df = df.reset_index(level='userId')
      .groupby('userId')['pageId'].idxmax().reset_index(name='tag')
print (df)
    userId  tag
0  1234123    4
1  3122471   18

Timings:

时间：

np.random.seed(123)
N = 100000

df = pd.DataFrame(np.random.randint(1000, size=(N, 3)), columns= ['userId','pageId','tag'])
#print (df)

In [188]: %timeit (df.groupby(['userId', 'tag'], as_index=False)['pageId'].count().groupby('userId').apply(lambda x: x.tag[x.pageId.idxmax()]))
10 loops, best of 3: 180 ms per loop

In [189]: %timeit (df.groupby(['userId', 'tag'])['pageId'].size().reset_index(level='userId').groupby('userId')['pageId'].idxmax())
10 loops, best of 3: 103 ms per loop

VinceP solution is wrong, so no include to timings.

VinceP 解决方案是错误的，所以不包括时间。

Pandas DataFrame 在 Groupby 两列之后找到最大值并获取计数

提问by weigod

采纳答案by Zero

回答by VinceP

回答by jezrael

相关推荐

最近更新

标签

Pandas DataFrame 在 Groupby 两列之后找到最大值并获取计数

提问by weigod

采纳答案by Zero

回答by VinceP

回答by jezrael

相关推荐

pandas ValueError：使用pandas DataFrame在python中数组的长度必须相同

pandas 尽管使用频率重新索引，但 ARIMA 模型的“无法在没有频率的情况下向时间戳添加整数值”错误

pandas 多级熊猫分组

根据另一列、Python、Pandas 中的值删除一列的重复项

相关推荐

最近更新

标签