pandas 按数据框计算分类数据熊猫组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31080383/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Counting categorical data pandas group by dataframe
提问by metersk
I have a data frame that looks like this:
我有一个如下所示的数据框:
+---+-----------+----------------+-------+
| | uid | msg | count |
+---+-----------+----------------+-------+
| 0 | 121437681 | eis | 1 |
| 1 | 14403832 | eis | 1 |
| 2 | 190442364 | eis | 1 |
| 3 | 190102625 | eis | 1 |
| 4 | 190428772 | eis_reply | 1 |
| 5 | 190428772 | single_message | 1 |
| 6 | 190428772 | yes | 1 |
| 7 | 190104837 | eis | 1 |
| 8 | 144969454 | eis | 1 |
| 9 | 190738403 | eis | 1 |
+---+-----------+----------------+-------+
What I would like to do is count the instances of each msgfor each uid.
我想做的是msg为每个 uid计算每个实例。
I created a groupby object and found the count of all messages:
我创建了一个 groupby 对象并找到了所有消息的计数:
grouped_test = test.groupby('uid')
grouped_test.count('msg')
But I'm not quite sure how to count each type of message for each uid. I was thinking about creating masks and 4 separate data frames, but that doesn't seem like a n efficient way to accomplish this.
但我不太确定如何为每个 uid 计算每种类型的消息。我正在考虑创建掩码和 4 个单独的数据帧,但这似乎不是实现此目的的有效方法。
Sample data - http://www.sharecsv.com/s/16573757eb123c5b15cae4edcb7296e3/sample_data.csv
示例数据 - http://www.sharecsv.com/s/16573757eb123c5b15cae4edcb7296e3/sample_data.csv
回答by BrenBarn
Group by uid and apply value_countsto the msg column:
按 uid 分组并应用于value_countsmsg 列:
>>> d.groupby('uid').msg.value_counts()
uid
14403832 eis 1
121437681 eis 1
144969454 eis 1
190102625 eis 1
190104837 eis 1
190170637 eis 1
190428772 eis 1
single_message 1
yes 1
eis_reply 1
190442364 eis 1
190738403 eis 1
190991478 single_message 1
eis_reply 1
yes 1
191356453 eis 1
191619393 eis 1
dtype: int64
回答by Alexander
Apply groupbyon both idand msg, and then sum the countof each:
应用groupby在id和 上msg,然后count对每个求和:
>>> df.groupby(['uid', 'msg'])['count'].sum()
uid msg
14403832 eis 1
121437681 eis 1
144969454 eis 1
190102625 eis 1
190104837 eis 1
190170637 eis 1
190428772 eis 1
eis_reply 1
single_message 1
yes 1
190442364 eis 1
190738403 eis 1
190991478 eis_reply 1
single_message 1
yes 1
191356453 eis 1
191619393 eis 1
Name: count, dtype: int64
You can reset the index to retrieve the flattened version:
您可以重置索引以检索扁平版本:
>>> df.groupby(['uid', 'msg'])['count'].sum().reset_index()
uid msg count
0 14403832 eis 1
1 121437681 eis 1
2 144969454 eis 1
3 190102625 eis 1
4 190104837 eis 1
5 190170637 eis 1
6 190428772 eis 1
7 190428772 eis_reply 1
8 190428772 single_message 1
9 190428772 yes 1
10 190442364 eis 1
11 190738403 eis 1
12 190991478 eis_reply 1
13 190991478 single_message 1
14 190991478 yes 1
15 191356453 eis 1
16 191619393 eis 1

