Python GroupBy pandas DataFrame 并选择最常见的值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15222754/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:33:28  来源:igfitidea点击:

GroupBy pandas DataFrame and select most common value

pythonpandasgroup-bypandas-groupbymode

提问by Viacheslav Nefedov

I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to group by data frame by first two columns and select most common value of the third column for each combination.

我有一个包含三个字符串列的数据框。我知道第 3 列中唯一的一个值对前两个的每个组合都有效。为了清理数据,我必须按前两列按数据框分组,并为每个组合选择第三列的最常见值。

My code:

我的代码:

import pandas as pd
from scipy import stats

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short name' : ['NY','New','Spb','NY']})

print source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])

Last line of code doesn't work, it says "Key error 'Short name'" and if I try to group only by City, then I got an AssertionError. What can I do fix it?

最后一行代码不起作用,它说“关键错误'短名称'”,如果我尝试仅按城市分组,则会出现断言错误。我能做些什么来解决它?

采纳答案by HYRY

You can use value_counts()to get a count series, and get the first row:

您可以使用value_counts()获取计数系列,并获取第一行:

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

In case you are wondering about performing other agg functions in the .agg() try this.

如果您想知道在 .agg() 中执行其他 agg 函数,试试这个。

# Let's add a new col,  account
source['account'] = [1,2,3,3]

source.groupby(['Country','City']).agg(mod  = ('Short name', \
                                        lambda x: x.value_counts().index[0]),
                                        avg = ('account', 'mean') \
                                      )

回答by eumiro

For agg, the lambba function gets a Series, which does not have a 'Short name'attribute.

对于agg,lamba 函数得到一个Series,它没有'Short name'属性。

stats.modereturns a tuple of two arrays, so you have to take the first element of the first array in this tuple.

stats.mode返回一个由两个数组组成的元组,因此您必须取该元组中第一个数组的第一个元素。

With these two simple changements:

通过这两个简单的更改:

source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])

returns

返回

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

回答by abw333

A little late to the game here, but I was running into some performance issues with HYRY's solution, so I had to come up with another one.

这里的游戏有点晚了,但是我在 HYRY 的解决方案中遇到了一些性能问题,所以我不得不想出另一个。

It works by finding the frequency of each key-value, and then, for each key, only keeping the value that appears with it most often.

它的工作原理是找到每个键值的频率,然后,对于每个键,只保留最常出现的值。

There's also an additional solution that supports multiple modes.

还有一个支持多种模式的附加解决方案。

On a scale test that's representative of the data I'm working with, this reduced runtime from 37.4s to 0.5s!

在代表我正在处理的数据的规模测试中,这将运行时间从 37.4 秒减少到 0.5 秒!

Here's the code for the solution, some example usage, and the scale test:

这是解决方案的代码、一些示例用法和规模测试:

import numpy as np
import pandas as pd
import random
import time

test_input = pd.DataFrame(columns=[ 'key',          'value'],
                          data=  [[ 1,              'A'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              np.nan ],
                                  [ 2,              np.nan ],
                                  [ 3,              'C'    ],
                                  [ 3,              'C'    ],
                                  [ 3,              'D'    ],
                                  [ 3,              'D'    ]])

def mode(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the mode.                                                                                                                                                                                                                                                                                                         

    The output is a DataFrame with a record per group that has at least one mode                                                                                                                                                                                                                                                                                     
    (null values are not counted). The `key_cols` are included as columns, `value_col`                                                                                                                                                                                                                                                                               
    contains a mode (ties are broken arbitrarily and deterministically) for each                                                                                                                                                                                                                                                                                     
    group, and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                 
    '''
    return df.groupby(key_cols + [value_col]).size() \
             .to_frame(count_col).reset_index() \
             .sort_values(count_col, ascending=False) \
             .drop_duplicates(subset=key_cols)

def modes(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the modes.                                                                                                                                                                                                                                                                                                        

    The output is a DataFrame with a record per group that has at least                                                                                                                                                                                                                                                                                              
    one mode (null values are not counted). The `key_cols` are included as                                                                                                                                                                                                                                                                                           
    columns, `value_col` contains lists indicating the modes for each group,                                                                                                                                                                                                                                                                                         
    and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                        
    '''
    return df.groupby(key_cols + [value_col]).size() \
             .to_frame(count_col).reset_index() \
             .groupby(key_cols + [count_col])[value_col].unique() \
             .to_frame().reset_index() \
             .sort_values(count_col, ascending=False) \
             .drop_duplicates(subset=key_cols)

print test_input
print mode(test_input, ['key'], 'value', 'count')
print modes(test_input, ['key'], 'value', 'count')

scale_test_data = [[random.randint(1, 100000),
                    str(random.randint(123456789001, 123456789100))] for i in range(1000000)]
scale_test_input = pd.DataFrame(columns=['key', 'value'],
                                data=scale_test_data)

start = time.time()
mode(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
modes(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
scale_test_input.groupby(['key']).agg(lambda x: x.value_counts().index[0])
print time.time() - start

Running this code will print something like:

运行此代码将打印如下内容:

   key value
0    1     A
1    1     B
2    1     B
3    1   NaN
4    2   NaN
5    3     C
6    3     C
7    3     D
8    3     D
   key value  count
1    1     B      2
2    3     C      2
   key  count   value
1    1      2     [B]
2    3      2  [C, D]
0.489614009857
9.19386196136
37.4375009537

Hope this helps!

希望这可以帮助!

回答by nunodsousa

Formally, the correct answer is the @eumiro Solution. The problem of @HYRY solution is that when you have a sequence of numbers like [1,2,3,4] the solution is wrong, i. e., you don't have the mode. Example:

正式地,正确答案是@eumiro 解决方案。@HYRY 解决方案的问题是,当您有像 [1,2,3,4] 这样的数字序列时,解决方案是错误的,即您没有mode。例子:

>>> import pandas as pd
>>> df = pd.DataFrame(
        {
            'client': ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'D', 'D', 'E', 'E', 'E', 'E', 'E', 'A'], 
            'total': [1, 4, 3, 2, 4, 1, 2, 3, 5, 1, 2, 2, 2, 3, 4], 
            'bla': [10, 40, 30, 20, 40, 10, 20, 30, 50, 10, 20, 20, 20, 30, 40]
        }
    )

If you compute like @HYRY you obtain:

如果你像@HYRY一样计算,你会得到:

>>> print(df.groupby(['client']).agg(lambda x: x.value_counts().index[0]))
        total  bla
client            
A           4   30
B           4   40
C           1   10
D           3   30
E           2   20

Which is clearly wrong (see the Avalue that should be 1and not 4) because it can't handle with unique values.

这显然是错误的(请参阅应该是1而不是4A值),因为它无法处理唯一值。

Thus, the other solution is correct:

因此,另一个解决方案是正确的:

>>> import scipy.stats
>>> print(df.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0]))
        total  bla
client            
A           1   10
B           4   40
C           1   10
D           3   30
E           2   20

回答by Dimitri

A slightly clumsier but faster approach for larger datasets involves getting the counts for a column of interest, sorting the counts highest to lowest, and then de-duplicating on a subset to only retain the largest cases. The code example is following:

对于较大的数据集,一种稍微笨拙但更快的方法是获取感兴趣列的计数,将计数从高到低排序,然后对子集进行重复数据删除以仅保留最大的案例。代码示例如下:

>>> import pandas as pd
>>> source = pd.DataFrame(
        {
            'Country': ['USA', 'USA', 'Russia', 'USA'], 
            'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
            'Short name': ['NY', 'New', 'Spb', 'NY']
        }
    )
>>> grouped_df = source\
        .groupby(['Country','City','Short name'])[['Short name']]\
        .count()\
        .rename(columns={'Short name':'count'})\
        .reset_index()\
        .sort_values('count', ascending=False)\
        .drop_duplicates(subset=['Country', 'City'])\
        .drop('count', axis=1)
>>> print(grouped_df)
  Country              City Short name
1     USA          New-York         NY
0  Russia  Sankt-Petersburg        Spb

回答by Diego Perez Sastre

The problem hereis the performance, if you have a lot of rows it will be a problem.

这里的问题是性能,如果你有很多行,那将是一个问题。

If it is your case, please try with this:

如果是您的情况,请尝试以下操作:

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

source.groupby(['Country','City']).Short_name.value_counts().groupby['Country','City']).first()

回答by kmader

If you want another approach for solving it that is does not depend on value_countsor scipy.statsyou can use the Countercollection

如果您想要另一种不依赖于解决它的方法,value_counts或者scipy.stats您可以使用该Counter集合

from collections import Counter
get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]

Which can be applied to the above example like this

可以像这样应用于上面的例子

src = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

src.groupby(['Country','City']).agg(get_most_common)

回答by cs95

Pandas >= 0.16

熊猫 >= 0.16

pd.Series.modeis available!

pd.Series.mode可用!

Use groupby, GroupBy.agg, and apply the pd.Series.modefunction to each group:

使用groupby, GroupBy.agg, 并将pd.Series.mode函数应用于每个组:

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

If this is needed as a DataFrame, use

如果需要将其作为 DataFrame,请使用

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

The useful thing about Series.modeis that it always returns a Series, making it very compatible with aggand apply, especially when reconstructing the groupby output. It is also faster.

有用的Series.mode是它总是返回一个系列,使其与aggand非常兼容apply,特别是在重建 groupby 输出时。它也更快。

# Accepted answer.
%timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
# Proposed in this post.
%timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

5.56 ms ± 343 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.76 ms ± 387 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Dealing with Multiple Modes

处理多种模式

Series.modealso does a good job when there are multiplemodes:

Series.mode当有多种模式时也做得很好:

source2 = source.append(
    pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),
    ignore_index=True)

# Now `source2` has two modes for the 
# ("USA", "New-York") group, they are "NY" and "New".
source2

  Country              City Short name
0     USA          New-York         NY
1     USA          New-York        New
2  Russia  Sankt-Petersburg        Spb
3     USA          New-York         NY
4     USA          New-York        New

source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg          Spb
USA      New-York            [NY, New]
Name: Short name, dtype: object

Or, if you want a separate row for each mode, you can use GroupBy.apply:

或者,如果您希望每种模式都有单独的一行,您可以使用GroupBy.apply

source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)

Country  City               
Russia   Sankt-Petersburg  0    Spb
USA      New-York          0     NY
                           1    New
Name: Short name, dtype: object

If you don't carewhich mode is returned as long as it's either one of them, then you will need a lambda that calls modeand extracts the first result.

如果您不在乎返回哪种模式,只要它是其中之一,那么您将需要一个调用mode并提取第一个结果的 lambda 。

source2.groupby(['Country','City'])['Short name'].agg(
    lambda x: pd.Series.mode(x)[0])

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object


Alternatives to (not) consider

替代(不)考虑

You can also use statistics.modefrom python, but...

您也可以statistics.mode从 python 中使用,但是...

source.groupby(['Country','City'])['Short name'].apply(statistics.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

...it does not work well when having to deal with multiple modes; a StatisticsErroris raised. This is mentioned in the docs:

...在必须处理多种模式时效果不佳;aStatisticsError被提出。这在文档中提到:

If data is empty, or if there is not exactly one most common value, StatisticsError is raised.

如果 data 为空,或者如果没有一个最常见的值,则会引发 StatisticsError 。

But you can see for yourself...

但是你可以亲眼看看...

statistics.mode([1, 2])
# ---------------------------------------------------------------------------
# StatisticsError                           Traceback (most recent call last)
# ...
# StatisticsError: no unique mode; found 2 equally common values

回答by Josh Friedlander

The two top answers here suggest:

这里的两个顶级答案建议:

df.groupby(cols).agg(lambda x:x.value_counts().index[0])

or, preferably

或者,最好

df.groupby(cols).agg(pd.Series.mode)

However both of these fail in simple edge cases, as demonstrated here:

然而,这两种方法都在简单的边缘情况下失败,如下所示:

df = pd.DataFrame({
    'client_id':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
    'date':['2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01'],
    'location':['NY', 'NY', 'LA', 'LA', 'DC', 'DC', 'LA', np.NaN]
})

The first:

首先:

df.groupby(['client_id', 'date']).agg(lambda x:x.value_counts().index[0])

yields IndexError(because of the empty Series returned by group C). The second:

产量IndexError(因为 group 返回的空系列C)。第二:

df.groupby(['client_id', 'date']).agg(pd.Series.mode)

returns ValueError: Function does not reduce, since the first group returns a list of two (since there are two modes). (As documented here, if the first group returned a single mode this would work!)

返回ValueError: Function does not reduce,因为第一组返回两个列表(因为有两种模式)。(如记录在这里,如果第一批返回的单一模式,这会工作!)

Two possible solutions for this case are:

这种情况的两种可能的解决方案是:

import scipy
x.groupby(['client_id', 'date']).agg(lambda x: scipy.stats.mode(x)[0])

And the solution given to me by cs95 in the comments here:

以及 cs95 在此处的评论中给我的解决方案:

def foo(x): 
    m = pd.Series.mode(x); 
    return m.values[0] if not m.empty else np.nan
df.groupby(['client_id', 'date']).agg(foo)

However, all of these are slow and not suited for large datasets. A solution I ended up using which a) can deal with these cases and b) is much, much faster, is a lightly modified version of abw33's answer (which should be higher):

但是,所有这些都很慢,不适合大型数据集。我最终使用的解决方案 a) 可以处理这些情况并且 b) 快得多,是 abw33 答案的轻微修改版本(应该更高):

def get_mode_per_column(dataframe, group_cols, col):
    return (dataframe.fillna(-1)  # NaN placeholder to keep group 
            .groupby(group_cols + [col])
            .size()
            .to_frame('count')
            .reset_index()
            .sort_values('count', ascending=False)
            .drop_duplicates(subset=group_cols)
            .drop(columns=['count'])
            .sort_values(group_cols)
            .replace(-1, np.NaN))  # restore NaNs

group_cols = ['client_id', 'date']    
non_grp_cols = list(set(df).difference(group_cols))
output_df = get_mode_per_column(df, group_cols, non_grp_cols[0]).set_index(group_cols)
for col in non_grp_cols[1:]:
    output_df[col] = get_mode_per_column(df, group_cols, col)[col].values

Essentially, the method works on one col at a time and outputs a df, so instead of concat, which is intensive, you treat the first as a df, and then iteratively add the output array (values.flatten()) as a column in the df.

本质上,该方法一次作用于一个 col 并输出一个 df,因此concat您将第一个视为 df,然后迭代地将输出数组 ( values.flatten())添加为 df 中的列,而不是密集的 。

回答by irene

If you don't want to include NaN values, using Counteris much much faster than pd.Series.modeor pd.Series.value_counts()[0]:

如果您不想包含 NaN 值,使用Counterpd.Series.modeor快得多pd.Series.value_counts()[0]

def get_most_common(srs):
    x = list(srs)
    my_counter = Counter(x)
    return my_counter.most_common(1)[0][0]

df.groupby(col).agg(get_most_common)

should work. This will fail when you have NaN values, as each NaN will be counted separately.

应该管用。当您有 NaN 值时,这将失败,因为每个 NaN 将被单独计算。