pandas 熊猫中的聚合

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/53781634/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:12:55  来源:igfitidea点击:

Aggregation in pandas

pythonpandasdataframepandas-groupbyaggregation

提问by jezrael

  1. How to perform aggregation with pandas?
  2. No DataFrame after aggregation! What happened?
  3. How to aggregate mainly strings columns (to lists, tuples, strings with separator)?
  4. How to aggregate counts?
  5. How to create new column filled by aggregated values?
  1. 如何使用Pandas进行聚合?
  2. 聚合后没有 DataFrame!发生了什么?
  3. 如何主要聚合字符串列(到lists, tuples, strings with separator)?
  4. 如何聚合计数?
  5. 如何创建由聚合值填充的新列?

I've seen these recurring questions asking about various faces of the pandas aggregate functionality. Most of the information regarding aggregation and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.

我已经看到这些反复出现的问题询问Pandas聚合功能的各个方面。今天关于聚合及其各种用例的大部分信息都分散在数十个措辞恶劣、无法搜索的帖子中。这里的目的是为后代整理一些更重要的观点。

This Q/A is meant to be the next instalment in a series of helpful user-guides:

本问答是一系列有用的用户指南的下一部分:

Please note that this post is not meant to be a replacement for the documentation about aggregationand about groupby, so please read that as well!

请注意,这篇文章并不是要替代关于聚合groupby文档,所以也请阅读它!

回答by jezrael

Question 1

问题 1

How to perform aggregation with pandas ?

如何使用 Pandas 进行聚合?

Expanded aggregation documentation.

扩展的聚合文档

Aggregating functions are the ones that reduce the dimension of the returned objects. It means output Series/DataFrame have less or same rows like original. Some common aggregating functions are tabulated below:

聚合函数是减少返回对象维度的函数。这意味着输出 Series/DataFrame 与原始行有更少或相同的行。下表列出了一些常见的聚合函数:

Function    Description
mean()      Compute mean of groups
sum()       Compute sum of group values
size()      Compute group sizes
count()     Compute count of group
std()       Standard deviation of groups
var()       Compute variance of groups
sem()       Standard error of the mean of groups
describe()  Generates descriptive statistics
first()     Compute first of group values
last()      Compute last of group values
nth()       Take nth value, or a subset if n is a list
min()       Compute min of group values
max()       Compute max of group values
np.random.seed(123)

df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one'],
                   'C' : np.random.randint(5, size=6),
                   'D' : np.random.randint(5, size=6),
                   'E' : np.random.randint(5, size=6)})
print (df)
     A      B  C  D  E
0  foo    one  2  3  0
1  foo    two  4  1  0
2  bar  three  2  1  1
3  foo    two  1  0  3
4  bar    two  3  1  4
5  foo    one  2  1  0

Aggregation by filtered columns and cython implemented functions:

按过滤列和cython 实现的函数聚合:

df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

Aaggregate function is using for all columns without specified in groupbyfunction, here A, Bcolumns:

Aaggregate 函数用于所有没有在groupby函数中指定的A, B列,这里的列:

df2 = df.groupby(['A', 'B'], as_index=False).sum()
print (df2)
     A      B  C  D  E
0  bar  three  2  1  1
1  bar    two  3  1  4
2  foo    one  4  4  0
3  foo    two  5  1  3

You can also specified only some columns used for aggregation in list after groupbyfunction:

您还可以在groupby函数后的列表中仅指定用于聚合的某些列:

df3 = df.groupby(['A', 'B'], as_index=False)['C','D'].sum()
print (df3)
     A      B  C  D
0  bar  three  2  1
1  bar    two  3  1
2  foo    one  4  4
3  foo    two  5  1

Same results by using function DataFrameGroupBy.agg:

使用功能相同的结果DataFrameGroupBy.agg

df1 = df.groupby(['A', 'B'], as_index=False)['C'].agg('sum')
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

df2 = df.groupby(['A', 'B'], as_index=False).agg('sum')
print (df2)
     A      B  C  D  E
0  bar  three  2  1  1
1  bar    two  3  1  4
2  foo    one  4  4  0
3  foo    two  5  1  3

For multiplied functions applied for one column use list of tuples - names of new columns and aggregted functions:

对于应用于一列的乘法函数,使用tuples列表- 新列和聚合函数的名称:

df4 = (df.groupby(['A', 'B'])['C']
         .agg([('average','mean'),('total','sum')])
         .reset_index())
print (df4)
     A      B  average  total
0  bar  three      2.0      2
1  bar    two      3.0      3
2  foo    one      2.0      4
3  foo    two      2.5      5

If want to pass multiple functions is possible pass listof tuples:

如果想通过多种功能,可以通listtupleS:

df5 = (df.groupby(['A', 'B'])
         .agg([('average','mean'),('total','sum')]))

print (df5)
                C             D             E      
          average total average total average total
A   B                                              
bar three     2.0     2     1.0     1     1.0     1
    two       3.0     3     1.0     1     4.0     4
foo one       2.0     4     2.0     4     0.0     0
    two       2.5     5     0.5     1     1.5     3

Then get MultiIndexin columns:

然后进入MultiIndex列:

print (df5.columns)
MultiIndex(levels=[['C', 'D', 'E'], ['average', 'total']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

And for converting to columns, flattening MultiIndexuse mapwith join:

并转换成列,压扁MultiIndex使用map具有join

df5.columns = df5.columns.map('_'.join)
df5 = df5.reset_index()
print (df5)
     A      B  C_average  C_total  D_average  D_total  E_average  E_total
0  bar  three        2.0        2        1.0        1        1.0        1
1  bar    two        3.0        3        1.0        1        4.0        4
2  foo    one        2.0        4        2.0        4        0.0        0
3  foo    two        2.5        5        0.5        1        1.5        3

Another solution is pass list of aggregate functions, then flatten MultiIndexand for another columns names use str.replace:

另一个解决方案是传递聚合函数列表,然后展平MultiIndex并为其他列名称使用str.replace

df5 = df.groupby(['A', 'B']).agg(['mean','sum'])

df5.columns = (df5.columns.map('_'.join)
                  .str.replace('sum','total')
                  .str.replace('mean','average'))
df5 = df5.reset_index()
print (df5)
     A      B  C_average  C_total  D_average  D_total  E_average  E_total
0  bar  three        2.0        2        1.0        1        1.0        1
1  bar    two        3.0        3        1.0        1        4.0        4
2  foo    one        2.0        4        2.0        4        0.0        0
3  foo    two        2.5        5        0.5        1        1.5        3

If want specified each column with aggregated function separately pass dictionary:

如果要分别指定带有聚合函数的每一列,请传递dictionary

df6 = (df.groupby(['A', 'B'], as_index=False)
         .agg({'C':'sum','D':'mean'})
         .rename(columns={'C':'C_total', 'D':'D_average'}))
print (df6)
     A      B  C_total  D_average
0  bar  three        2        1.0
1  bar    two        3        1.0
2  foo    one        4        2.0
3  foo    two        5        0.5

You can pass custom function too:

您也可以传递自定义函数:

def func(x):
    return x.iat[0] + x.iat[-1]

df7 = (df.groupby(['A', 'B'], as_index=False)
         .agg({'C':'sum','D': func})
         .rename(columns={'C':'C_total', 'D':'D_sum_first_and_last'}))
print (df7)
     A      B  C_total  D_sum_first_and_last
0  bar  three        2                     2
1  bar    two        3                     2
2  foo    one        4                     4
3  foo    two        5                     1

Question 2

问题2

No DataFrame after aggregation! What happened?

聚合后没有 DataFrame!发生了什么?

Aggregation by 2 or more columns:

按 2 列或更多列聚合:

df1 = df.groupby(['A', 'B'])['C'].sum()
print (df1)
A    B    
bar  three    2
     two      3
foo  one      4
     two      5
Name: C, dtype: int32

First check Indexand typeof pandas object:

首先检查IndextypePandas对象:

print (df1.index)
MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']],
           labels=[[0, 0, 1, 1], [1, 2, 0, 2]],
           names=['A', 'B'])

print (type(df1))
<class 'pandas.core.series.Series'>

There are 2 solutions how get MultiIndex Seriesto columns:

有两种解决方案如何获取MultiIndex Series列:

  • add parameter as_index=False
  • 添加参数 as_index=False
df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5
df1 = df.groupby(['A', 'B'])['C'].sum().reset_index()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5


If group by one column:

如果按一列分组:

df2 = df.groupby('A')['C'].sum()
print (df2)
A
bar    5
foo    9
Name: C, dtype: int32

... get Serieswith Index:

...获取SeriesIndex

print (df2.index)
Index(['bar', 'foo'], dtype='object', name='A')

print (type(df2))
<class 'pandas.core.series.Series'>

And solution is same like in MultiIndex Series:

和解决方案是一样的MultiIndex Series

df2 = df.groupby('A', as_index=False)['C'].sum()
print (df2)
     A  C
0  bar  5
1  foo  9

df2 = df.groupby('A')['C'].sum().reset_index()
print (df2)
     A  C
0  bar  5
1  foo  9

Question 3

问题 3

How to aggregate mainly strings columns (to lists, tuples, strings with separator)?

如何主要聚合字符串列(到lists, tuples, strings with separator)?

df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
                   'C' : ['three', 'one', 'two', 'two', 'three','two', 'one'],
                   'D' : [1,2,3,2,3,1,2]})
print (df)
   A      B      C  D
0  a    one  three  1
1  c    two    one  2
2  b  three    two  3
3  b    two    two  2
4  a    two  three  3
5  c    one    two  1
6  b  three    one  2

Instead of aggregeta function is possible pass list, tuple, setfor converting column:

代替 agregeta 函数是可能的 pass list, tuple,set用于转换列:

df1 = df.groupby('A')['B'].agg(list).reset_index()
print (df1)
   A                    B
0  a           [one, two]
1  b  [three, two, three]
2  c           [two, one]

Alternative is use GroupBy.apply:

替代方法是使用GroupBy.apply

df1 = df.groupby('A')['B'].apply(list).reset_index()
print (df1)
   A                    B
0  a           [one, two]
1  b  [three, two, three]
2  c           [two, one]

For converting to strings with separator use .joinonly if string column:

.join仅当字符串列使用分隔符转换为字符串时:

df2 = df.groupby('A')['B'].agg(','.join).reset_index()
print (df2)
   A                B
0  a          one,two
1  b  three,two,three
2  c          two,one

If numeric column use lambda function with astypefor converting to strings:

如果数字列使用 lambda 函数 withastype转换为strings:

df3 = (df.groupby('A')['D']
         .agg(lambda x: ','.join(x.astype(str)))
         .reset_index())
print (df3)
   A      D
0  a    1,3
1  b  3,2,2
2  c    2,1

Another solution is converting to strings before groupby:

另一个解决方案是在之前转换为字符串groupby

df3 = (df.assign(D = df['D'].astype(str))
         .groupby('A')['D']
         .agg(','.join).reset_index())
print (df3)
   A      D
0  a    1,3
1  b  3,2,2
2  c    2,1

For converting all columns pass no list of column(s) after groupby. There is no column Dbecause automatic exclusion of 'nuisance' columns, it means all numeric columns are excluded.

为了转换所有列,在groupby. 没有列,D因为自动排除 'nuisance' 列,这意味着排除所有数字列。

df4 = df.groupby('A').agg(','.join).reset_index()
print (df4)
   A                B            C
0  a          one,two  three,three
1  b  three,two,three  two,two,one
2  c          two,one      one,two

So it's necessary to convert all columns into strings, then get all columns:

所以有必要将所有列转换为字符串,然后获取所有列:

df5 = (df.groupby('A')
         .agg(lambda x: ','.join(x.astype(str)))
         .reset_index())
print (df5)
   A                B            C      D
0  a          one,two  three,three    1,3
1  b  three,two,three  two,two,one  3,2,2
2  c          two,one      one,two    2,1

Question 4

问题 4

How to aggregate counts?

如何聚合计数?

df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
                   'C' : ['three', np.nan, np.nan, 'two', 'three','two', 'one'],
                   'D' : [np.nan,2,3,2,3,np.nan,2]})
print (df)
   A      B      C    D
0  a    one  three  NaN
1  c    two    NaN  2.0
2  b  three    NaN  3.0
3  b    two    two  2.0
4  a    two  three  3.0
5  c    one    two  NaN
6  b  three    one  2.0

Function GroupBy.sizefor sizeof each group:

功能GroupBy.sizesize各组:

df1 = df.groupby('A').size().reset_index(name='COUNT')
print (df1)
   A  COUNT
0  a      2
1  b      3
2  c      2

Function GroupBy.countexclude missing values:

函数GroupBy.count排除缺失值:

df2 = df.groupby('A')['C'].count().reset_index(name='COUNT')
print (df2)
   A  COUNT
0  a      2
1  b      2
2  c      1

Function should be used fo multiple columns for count non missing values:

函数应该用于多列计数非缺失值:

df3 = df.groupby('A').count().add_suffix('_COUNT').reset_index()
print (df3)
   A  B_COUNT  C_COUNT  D_COUNT
0  a        2        2        1
1  b        3        2        3
2  c        2        1        1

Related function Series.value_countsreturn size object containing counts of unique values in descending order so that the first element is the most frequently-occurring element. Excludes NaNs values by default.

相关函数Series.value_counts返回包含按降序排列的唯一值计数的大小对象,以便第一个元素是最常出现的元素。NaN默认情况下排除s 值。

df4 = (df['A'].value_counts()
              .rename_axis('A')
              .reset_index(name='COUNT'))
print (df4)
   A  COUNT
0  b      3
1  a      2
2  c      2

If you want same output like using function groupby+ sizeadd Series.sort_index:

如果您想要与使用 function groupby+ sizeadd相同的输出Series.sort_index

df5 = (df['A'].value_counts()
              .sort_index()
              .rename_axis('A')
              .reset_index(name='COUNT'))
print (df5)
   A  COUNT
0  a      2
1  b      3
2  c      2

Question 5

问题 5

How to create new column filled by aggregated values?

如何创建由聚合值填充的新列?

Method GroupBy.transformreturns an object that is indexed the same (same size) as the one being grouped

方法GroupBy.transform返回一个与被分组的对象索引相同(相同大小)的对象

Pandas documentationfor more information.

Pandas文档以获取更多信息。

np.random.seed(123)

df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
                    'B' : ['one', 'two', 'three','two', 'two', 'one'],
                    'C' : np.random.randint(5, size=6),
                    'D' : np.random.randint(5, size=6)})
print (df)
     A      B  C  D
0  foo    one  2  3
1  foo    two  4  1
2  bar  three  2  1
3  foo    two  1  0
4  bar    two  3  1
5  foo    one  2  1


df['C1'] = df.groupby('A')['C'].transform('sum')
df['C2'] = df.groupby(['A','B'])['C'].transform('sum')


df[['C3','D3']] = df.groupby('A')['C','D'].transform('sum')
df[['C4','D4']] = df.groupby(['A','B'])['C','D'].transform('sum')

print (df)

     A      B  C  D  C1  C2  C3  D3  C4  D4
0  foo    one  2  3   9   4   9   5   4   4
1  foo    two  4  1   9   5   9   5   5   1
2  bar  three  2  1   5   2   5   2   2   1
3  foo    two  1  0   9   5   9   5   5   1
4  bar    two  3  1   5   3   5   2   3   1
5  foo    one  2  1   9   4   9   5   4   4

回答by Abhishek R

If you are coming from an R or SQL background here are 3 examples that will teach you everything you need to do aggregation the way you are already familiar with:

如果您有 R 或 SQL 背景,这里有 3 个示例,它们将教您以您已经熟悉的方式进行聚合所需的一切:

Let us first create a Pandas dataframe

让我们首先创建一个 Pandas 数据框

import pandas as pd

df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
                   'key2' : ['a','a','b','b','c'],
                   'value1' : [1,2,2,3,3],
                   'value2' : [9,8,7,6,5]})

df.head(5)

Here is how the table we created looks like:

这是我们创建的表的样子:

|----------------|-------------|------------|------------|
|      key1      |     key2    |    value1  |    value2  |
|----------------|-------------|------------|------------|
|       a        |       a     |      1     |       9    |
|       a        |       a     |      2     |       8    |
|       a        |       b     |      2     |       7    |
|       b        |       b     |      3     |       6    |
|       a        |       c     |      3     |       5    |
|----------------|-------------|------------|------------|

1. Aggregating With Row Reduction Similar to SQL Group By

1. 类似于SQL的行缩减聚合 Group By

df_agg = df.groupby(['key1','key2']).agg(mean_of_value_1=('value1', 'mean'), 
                                         sum_of_value_2=('value2', 'sum'),
                                         count_of_value1=('value1','size')
                                         ).reset_index()


df_agg.head(5)

The resulting data table will look like this:

生成的数据表将如下所示:

|----------------|-------------|--------------------|-------------------|---------------------|
|      key1      |     key2    |    mean_of_value1  |    sum_of_value2  |    count_of_value1  |
|----------------|-------------|--------------------|-------------------|---------------------|
|       a        |      a1     |         1.5        |        17         |           2         |
|       a        |      b1     |         2.0        |         7         |           1         |   
|       a        |      c1     |         3.0        |         5         |           1         |        
|       b        |      b1     |         3.0        |         6         |           1         |     
|----------------|-------------|--------------------|-------------------|---------------------|

The SQL Equivalent of this is:

等效的 SQL 是:

SELECT
      key1
     ,key2
     ,AVG(value1) AS mean_of_value_1
     ,SUM(value2) AS sum_of_value_2
     ,COUNT(*) AS count_of_value1
FROM
    df
GROUP BY
     key1
    ,key2

2. Create Column Without Reduction in Rows (EXCEL - SUMIF, COUNTIF)

2. 创建列而不减少行 ( EXCEL - SUMIF, COUNTIF)

If you want to do a SUMIF, COUNTIF etc like how you would do in Excel where there is no reduction in rows then you need to do this instead.

如果你想做一个 SUMIF、COUNTIF 等,就像你在 Excel 中所做的那样,没有减少行数,那么你需要这样做。

df['Total_of_value1_by_key1'] = df.groupby('key1')['value1'].transform('sum')

df.head(5)

The resulting data frame will look like this with the same number of rows as the original:

生成的数据框将如下所示,其行数与原始数据相同:

|----------------|-------------|------------|------------|-------------------------|
|      key1      |     key2    |    value1  |    value2  | Total_of_value1_by_key1 |
|----------------|-------------|------------|------------|-------------------------|
|       a        |       a     |      1     |       9    |            8            |
|       a        |       a     |      2     |       8    |            8            |
|       a        |       b     |      2     |       7    |            8            |
|       b        |       b     |      3     |       6    |            3            |
|       a        |       c     |      3     |       5    |            8            |
|----------------|-------------|------------|------------|-------------------------|

3. Creating a RANK Column ROW_NUMBER() OVER (PARTITION BY ORDER BY)

3. 创建 RANK 列 ROW_NUMBER() OVER (PARTITION BY ORDER BY)

Finally, there might be cases where you want to create a Rank column which is the SQL Equivalent of ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 DESC, Data2 ASC)

最后,在某些情况下,您可能希望创建一个 Rank 列,该列是 SQL 等价的 ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 DESC, Data2 ASC)

Here is how you do that.

这是你如何做到的。

 df['RN'] = df.sort_values(['value1','value2'], ascending=[False,True]) \
              .groupby(['key1']) \
              .cumcount() + 1

 df.head(5) 

Note: we make the code multi-line by adding \in the end of each line.

注意:我们通过\在每行的末尾添加来使代码多行。

Here is how the resulting data frame looks like:

生成的数据框如下所示:

|----------------|-------------|------------|------------|------------|
|      key1      |     key2    |    value1  |    value2  |     RN     |
|----------------|-------------|------------|------------|------------|
|       a        |       a     |      1     |       9    |      4     |
|       a        |       a     |      2     |       8    |      3     |
|       a        |       b     |      2     |       7    |      2     |
|       b        |       b     |      3     |       6    |      1     |
|       a        |       c     |      3     |       5    |      1     |
|----------------|-------------|------------|------------|------------|

In all the examples above, the final data table will have a table structure and won't have the pivot structure that you might get in other syntaxes.

在上面的所有示例中,最终数据表将具有表结构,并且不会具有您可能在其他语法中获得的数据透视结构。

Other aggregating operators:

其他聚合运算符:

mean()Compute mean of groups

mean()计算组的平均值

sum()Compute sum of group values

sum()计算组值的总和

size()Compute group sizes

size()计算组大小

count()Compute count of group

count()计算组数

std()Standard deviation of groups

std()组的标准差

var()Compute variance of groups

var()计算组的方差

sem()Standard error of the mean of groups

sem()组均值的标准误

describe()Generates descriptive statistics

describe()生成描述性统计

first()Compute first of group values

first()计算第一个组值

last()Compute last of group values

last()计算组值的最后一个

nth()Take nth value, or a subset if n is a list

nth()如果 n 是列表,则取第 n 个值或子集

min()Compute min of group values

min()计算组值的最小值

max()Compute max of group values

max()计算组值的最大值

Hope this helps.

希望这可以帮助。