Python 使用 pandas GroupBy 获取每个组的统计信息(例如计数、平均值等)?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19384532/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:37:28  来源:igfitidea点击:

Get statistics for each group (such as count, mean, etc) using pandas GroupBy?

pythonpandasdataframegroup-bypandas-groupby

提问by Roman

I have a data frame dfand I use several columns from it to groupby:

我有一个数据框df,我使用其中的几列到groupby

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.

通过上述方式,我几乎得到了我需要的表格(数据框)。缺少的是包含每个组中行数的附加列。换句话说,我的意思是,但我也想知道使用了多少个数字来获得这些平均值。例如,在第一组中有 8 个值,在第二组中有 10 个值,依此类推。

In short: How do I get group-wisestatistics for a dataframe?

简而言之:如何获取数据框的分组统计信息?

采纳答案by Boud

On groupbyobject, the aggfunction can take a list to apply several aggregation methodsat once. This should give you the result you need:

groupby对象上,该agg函数可以使用一个列表来一次应用多个聚合方法。这应该给你你需要的结果:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

回答by Pedro M Duarte

Quick Answer:

快速回答:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

获取每组行数的最简单方法是调用.size(),它返回一个Series

df.groupby(['col1','col2']).size()


Usually you want this result as a DataFrame(instead of a Series) so you can do:


通常,您希望将此结果作为 a DataFrame(而不是 a Series),因此您可以执行以下操作:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')


If you want to find out how to calculate the row counts and other statistics for each group continue reading below.


如果您想了解如何计算每个组的行数和其他统计信息,请继续阅读下面的内容。



Detailed example:

详细示例:

Consider the following example dataframe:

考虑以下示例数据帧:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

First let's use .size()to get the row counts:

首先让我们使用.size()来获取行数:

In [3]: df.groupby(['col1', 'col2']).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

Then let's use .size().reset_index(name='counts')to get the row counts:

然后让我们使用.size().reset_index(name='counts')来获取行数:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1



Including results for more statistics

包括更多统计数据的结果

When you want to calculate statistics on grouped data, it usually looks like this:

当您要计算分组数据的统计信息时,通常如下所示:

In [5]: (df
   ...: .groupby(['col1', 'col2'])
   ...: .agg({
   ...:     'col3': ['mean', 'count'], 
   ...:     'col4': ['median', 'min', 'count']
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

由于嵌套的列标签,并且因为行计数是基于每列的,所以上面的结果处理起来有点烦人。

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

为了更好地控制输出,我通常将统计信息拆分为单独的聚合,然后使用join. 它看起来像这样:

In [6]: gb = df.groupby(['col1', 'col2'])
   ...: counts = gb.size().to_frame(name='counts')
   ...: (counts
   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63





Footnotes

脚注

The code used to generate the test data is shown below:

用于生成测试数据的代码如下所示:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['E', 'F'],
   ...:         ['E', 'F'],
   ...:         ['G', 'H'] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
   ...: )
   ...: 
   ...: df[['col3', 'col4', 'col5', 'col6']] = \
   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)
   ...: 


Disclaimer:


免责声明:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaNentries in the mean calculation without telling you about it.

如果您聚合的某些列具有空值,那么您确实希望将组行计数视为每列的独立聚合。否则你可能会被误导,因为有多少记录实际上被用来计算平均值,因为熊猫会NaN在不告诉你的情况下删除平均值计算中的条目。

回答by Nimesh

We can easily do it by using groupby and count. But, we should remember to use reset_index().

我们可以通过使用 groupby 和 count 轻松完成。但是,我们应该记住使用 reset_index()。

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()

回答by cs95

One Function to Rule Them All: GroupBy.describe

一个功能来统治他们: GroupBy.describe

Returns count, mean, std, and other useful statistics per-group.

返回countmeanstd,和其他有用的统计每个组。

df.groupby(['col1', 'col2'])['col3', 'col4'].describe()


# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

from IPython.display import display

with pd.option_context('precision', 2):
    display(df.groupby(['A', 'B'])['C'].describe())

           count  mean   std   min   25%   50%   75%   max
A   B                                                     
bar one      1.0  0.40   NaN  0.40  0.40  0.40  0.40  0.40
    three    1.0  2.24   NaN  2.24  2.24  2.24  2.24  2.24
    two      1.0 -0.98   NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one      2.0  1.36  0.58  0.95  1.15  1.36  1.56  1.76
    three    1.0 -0.15   NaN -0.15 -0.15 -0.15 -0.15 -0.15
    two      2.0  1.42  0.63  0.98  1.20  1.42  1.65  1.87

To get specific statistics, just select them,

要获得特定的统计数据,只需选择它们,

df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]

           count      mean
A   B                     
bar one      1.0  0.400157
    three    1.0  2.240893
    two      1.0 -0.977278
foo one      2.0  1.357070
    three    1.0 -0.151357
    two      2.0  1.423148

describeworks for multiple columns (change ['C']to ['C', 'D']—or remove it altogether—and see what happens, the result is a MultiIndexed columned dataframe).

describe适用于多列(更改['C']['C', 'D']- 或完全删除它 - 看看会发生什么,结果是一个 MultiIndexed 列数据框)。

You also get different statistics for string data. Here's an example,

您还可以获得不同的字符串数据统计信息。这是一个例子,

df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)

with pd.option_context('precision', 2):
    display(df2.groupby(['A', 'B'])
               .describe(include='all')
               .dropna(how='all', axis=1))

              C                                                   D                
          count  mean       std   min   25%   50%   75%   max count unique top freq
A   B                                                                              
bar one    14.0  0.40  5.76e-17  0.40  0.40  0.40  0.40  0.40    14      1   a   14
    three  14.0  2.24  4.61e-16  2.24  2.24  2.24  2.24  2.24    14      1   b   14
    two     9.0 -0.98  0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98     9      1   c    9
foo one    22.0  1.43  4.10e-01  0.95  0.95  1.76  1.76  1.76    22      2   a   13
    three  15.0 -0.15  0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15    15      1   c   15
    two    26.0  1.49  4.48e-01  0.98  0.98  1.87  1.87  1.87    26      2   b   15

For more information, see the documentation.

有关更多信息,请参阅文档

回答by Mahendra

Create a group object and call methods like below example:

创建一个组对象并调用如下示例所示的方法:

grp = df.groupby(['col1',  'col2',  'col3']) 

grp.max() 
grp.mean() 
grp.describe() 

回答by Jake Drew

To get multiple stats, collapse the index, and retain column names:

要获取多个统计信息,请折叠索引并保留列名:

df = df.groupby(['col1','col2']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
df

Produces:

产生:

**enter image description here**

**在此处输入图片描述**

回答by Ichsan

Please try this code

请试试这个代码

new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df

I think that code will add a column called 'count it' which count of each group

我认为该代码将添加一个名为“count it”的列,该列对每个组进行计数