pandas numpy中的groupby,计数和平均值,python中的pandas
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43456149/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
groupby, count and average in numpy, pandas in python
提问by Anand T
I have a dataframe that looks like this:
我有一个看起来像这样的数据框:
userId movieId rating
0 1 31 2.5
1 1 1029 3.0
2 1 3671 3.0
3 2 10 4.0
4 2 17 5.0
5 3 60 3.0
6 3 110 4.0
7 3 247 3.5
8 4 10 4.0
9 4 112 5.0
10 5 3 4.0
11 5 39 4.0
12 5 104 4.0
I need to get a dataframe which has unique userId, number of ratings by the user and the average rating by the user as shown below:
我需要获取一个数据框,它具有唯一的 userId、用户的评分数和用户的平均评分,如下所示:
userId count mean
0 1 3 2.83
1 2 2 4.5
2 3 3 3.5
3 4 2 4.5
4 5 3 4.0
Can someone help?
有人可以帮忙吗?
采纳答案by Scott Boston
df1 = df.groupby('userId')['rating'].agg(['count','mean']).reset_index()
print(df1)
userId count mean
0 1 3 2.833333
1 2 2 4.500000
2 3 3 3.500000
3 4 2 4.500000
4 5 3 4.000000
回答by Kewl
Drop movieId
since we're not using it, groupby userId
, and then apply the aggregation methods:
删除,movieId
因为我们没有使用它 groupby userId
,然后应用聚合方法:
import pandas as pd
df = pd.DataFrame({'userId': [1,1,1,2,2,3,3,3,4,4,5,5,5],
'movieId':[31,1029,3671,10,17,60,110,247,10,112,3,39,104],
'rating':[2.5,3.0,3.0,4.0,5.0,3.0,4.0,3.5,4.0,5.0,4.0,4.0,4.0]})
df = df.drop('movieId', axis=1).groupby('userId').agg(['count','mean'])
print(df)
Which produces:
产生:
rating
count mean
userId
1 3 2.833333
2 2 4.500000
3 3 3.500000
4 2 4.500000
5 3 4.000000
回答by Divakar
Here's a NumPy based approach using the fact that userID
column appears to be sorted -
这是一种基于 NumPy 的方法,它使用userID
列似乎已排序的事实-
unq, tags, count = np.unique(df.userId.values, return_inverse=1, return_counts=1)
mean_vals = np.bincount(tags, df.rating.values)/count
df_out = pd.DataFrame(np.c_[unq, count], columns = (('userID', 'count')))
df_out['mean'] = mean_vals
Sample run -
样品运行 -
In [103]: df
Out[103]:
userId movieId rating
0 1 31 2.5
1 1 1029 3.0
2 1 3671 3.0
3 2 10 4.0
4 2 17 5.0
5 3 60 3.0
6 3 110 4.0
7 3 247 3.5
8 4 10 4.0
9 4 112 5.0
10 5 3 4.0
11 5 39 4.0
12 5 104 4.0
In [104]: df_out
Out[104]:
userID count mean
0 1 3 2.833333
1 2 2 4.500000
2 3 3 3.500000
3 4 2 4.500000
4 5 3 4.000000