pandas Python:选择最常用的分组依据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23692419/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:03:42  来源:igfitidea点击:

Python: select most frequent using group by

pythonpandas

提问by bensw

How do I get the most frequently occurring category (mode) for each tag in Python?

如何获取 Python 中每个标签最常出现的类别(模式)?

+------------------+----------+
| tag              | category |
+------------------+----------+
| automotive       |        8 |
| ba               |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |       10 |
| bamboo           |        8 |
| bamboo           |        9 |
| bamboo           |        8 |
| bamboo           |       10 |
| bamboo           |        8 |
| bamboo           |        9 |
| bamboo           |        8 |
| banana tree      |        8 |
| banana tree      |        8 |
| banana tree      |        8 |
| banana tree      |        8 |
| bath             |        9 |
+-----------------------------+

expected output would be like

预期的输出会像

    tag     | category 
------------+-----------
 ba         |        8      
 automotive |        8      
 bananatree |        8        
 bath       |        9    
 bamboo     |        8 

I have borrowed table from Stephen J. Fuhry and edited output from David Fuhry on MySQL SELECT most frequent by groupdue to confidentiality of my data set.

由于我的数据集的机密性,我从 Stephen J. Fuhry 那里借用了表格,并在MySQL SELECT 上按组编辑了 David Fuhry 的输出。

回答by Karl D.

In the comments you note you're using pandas. You can do something like the following:

在评论中,您注意到您正在使用pandas. 您可以执行以下操作:

>>> df

           tag  category
0    automotive         8
1            ba         8
2        bamboo         8
3        bamboo         8
4        bamboo         8
5        bamboo         8
6        bamboo         8
7        bamboo        10
8        bamboo         8
9        bamboo         9
10       bamboo         8
11       bamboo        10
12       bamboo         8
13       bamboo         9
14       bamboo         8
15  banana tree         8
16  banana tree         8
17  banana tree         8
18  banana tree         8
19         bath         9

Do a groupbyon 'tag' for the 'category' column and then within each group use the modemethod. However, we have to make it a conditional because pandasdoesn't return a number for the modeif the number of observations is less than 3 (we can just return the group itself in the special cases of 1 or 2 observations in a group). We can use the aggregate/aggmethod with a lambda function to do this:

groupby对“类别”列执行“标记”,然后在每个组内使用该mode方法。但是,我们必须将其设置为条件,因为如果观察次数小于 3,pandas则不会返回数字mode(在组中有 1 或 2 个观察的特殊情况下,我们可以只返回组本身)。我们可以使用aggregate/agg带有 lambda 函数的方法来执行此操作:

>>> mode = lambda x: x.mode() if len(x) > 2 else np.array(x)
>>> df.groupby('tag')['category'].agg(mode)

tag
automotive     8
ba             8
bamboo         8
banana tree    8
bath           9

Note, when the mode is multi-modal you will get a array (numpy). For example, suppose there were two entries for bath (all the other data is the same):

请注意,当模式为多模式时,您将获得一个数组(numpy)。例如,假设有两个用于 Bath 的条目(所有其他数据都相同):

tag|category
bath|9
bath|10

In that case the output would be:

在这种情况下,输出将是:

>>> mode = lambda x: x.mode() if len(x) > 2 else np.array(x)
>>> df.groupby('tag')['category'].agg(mode)

tag
automotive           8
ba                   8
bamboo               8
banana tree          8
bath           [9, 10]

You can also use the value_countsmethod instead of mode. Once again, do a groupbyon 'tag' for the 'category' column and then within each group use the value_countsmethod. value_countsarranges in descending order so you want to grab the index of the first row:

您也可以使用value_counts方法代替mode。再次groupby对“类别”列执行“标记”,然后在每个组内使用该value_counts方法。value_counts按降序排列,因此您要获取第一行的索引:

>>> df.groupby('tag')['category'].agg(lambda x: x.value_counts().index[0])

tag
automotive     8
ba             8
bamboo         8
banana tree    8
bath           9

However, this won't return an array in multi-modal situations. It will just return the first mode.

但是,这不会在多模式情况下返回数组。它只会返回第一种模式。