pandas numpy中的groupby，计数和平均值，python中的pandas

Question

提问by Anand T

I have a dataframe that looks like this:

我有一个看起来像这样的数据框：

       userId  movieId  rating
0           1       31     2.5
1           1     1029     3.0
2           1     3671     3.0
3           2       10     4.0
4           2       17     5.0
5           3       60     3.0
6           3      110     4.0
7           3      247     3.5
8           4       10     4.0
9           4      112     5.0
10          5        3     4.0
11          5       39     4.0
12          5      104     4.0

I need to get a dataframe which has unique userId, number of ratings by the user and the average rating by the user as shown below:

我需要获取一个数据框，它具有唯一的 userId、用户的评分数和用户的平均评分，如下所示：

       userId    count    mean
0           1        3    2.83
1           2        2     4.5
2           3        3     3.5
3           4        2     4.5
4           5        3     4.0

Can someone help?

有人可以帮忙吗？

Answer 1

采纳答案by Scott Boston

df1 = df.groupby('userId')['rating'].agg(['count','mean']).reset_index()
print(df1)


   userId  count      mean
0       1      3  2.833333
1       2      2  4.500000
2       3      3  3.500000
3       4      2  4.500000
4       5      3  4.000000

Answer 2

回答by Kewl

Drop movieIdsince we're not using it, groupby userId, and then apply the aggregation methods:

删除，movieId因为我们没有使用它 groupby userId，然后应用聚合方法：

import pandas as pd

df = pd.DataFrame({'userId': [1,1,1,2,2,3,3,3,4,4,5,5,5],
                  'movieId':[31,1029,3671,10,17,60,110,247,10,112,3,39,104],
                  'rating':[2.5,3.0,3.0,4.0,5.0,3.0,4.0,3.5,4.0,5.0,4.0,4.0,4.0]})

df = df.drop('movieId', axis=1).groupby('userId').agg(['count','mean'])

print(df)

Which produces:

产生：

       rating          
        count      mean
userId                 
1           3  2.833333
2           2  4.500000
3           3  3.500000
4           2  4.500000
5           3  4.000000

Answer 3

回答by Divakar

Here's a NumPy based approach using the fact that userIDcolumn appears to be sorted -

这是一种基于 NumPy 的方法，它使用userID列似乎已排序的事实-

unq, tags, count = np.unique(df.userId.values, return_inverse=1, return_counts=1)
mean_vals = np.bincount(tags, df.rating.values)/count
df_out = pd.DataFrame(np.c_[unq, count], columns = (('userID', 'count')))
df_out['mean'] = mean_vals

Sample run -

样品运行 -

In [103]: df
Out[103]: 
    userId  movieId  rating
0        1       31     2.5
1        1     1029     3.0
2        1     3671     3.0
3        2       10     4.0
4        2       17     5.0
5        3       60     3.0
6        3      110     4.0
7        3      247     3.5
8        4       10     4.0
9        4      112     5.0
10       5        3     4.0
11       5       39     4.0
12       5      104     4.0

In [104]: df_out
Out[104]: 
   userID  count      mean
0       1      3  2.833333
1       2      2  4.500000
2       3      3  3.500000
3       4      2  4.500000
4       5      3  4.000000

pandas numpy中的groupby，计数和平均值，python中的pandas

提问by Anand T

采纳答案by Scott Boston

回答by Kewl

回答by Divakar

相关推荐

最近更新

标签

pandas numpy中的groupby，计数和平均值，python中的pandas

提问by Anand T

采纳答案by Scott Boston

回答by Kewl

回答by Divakar

相关推荐

按列数过滤 Pandas df 并写入数据

pandas 重命名列后得到keyerror

pandas 如何根据数据帧的 NAN 百分比删除列？

pandas 熊猫连接不同的索引

相关推荐

最近更新

标签