pandas 熊猫中的聚合
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/53781634/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Aggregation in pandas
提问by jezrael
- How to perform aggregation with pandas?
- No DataFrame after aggregation! What happened?
- How to aggregate mainly strings columns (to
list
s,tuple
s,strings with separator
)? - How to aggregate counts?
- How to create new column filled by aggregated values?
- 如何使用Pandas进行聚合?
- 聚合后没有 DataFrame!发生了什么?
- 如何主要聚合字符串列(到
list
s,tuple
s,strings with separator
)? - 如何聚合计数?
- 如何创建由聚合值填充的新列?
I've seen these recurring questions asking about various faces of the pandas aggregate functionality. Most of the information regarding aggregation and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.
我已经看到这些反复出现的问题询问Pandas聚合功能的各个方面。今天关于聚合及其各种用例的大部分信息都分散在数十个措辞恶劣、无法搜索的帖子中。这里的目的是为后代整理一些更重要的观点。
This Q/A is meant to be the next instalment in a series of helpful user-guides:
本问答是一系列有用的用户指南的下一部分:
- How to pivot a dataframe,
- Pandas concat
- How do I operate on a DataFrame with a Series for every column
- Pandas Merging 101
Please note that this post is not meant to be a replacement for the documentation about aggregationand about groupby, so please read that as well!
回答by jezrael
Question 1
问题 1
How to perform aggregation with pandas ?
如何使用 Pandas 进行聚合?
Expanded aggregation documentation.
扩展的聚合文档。
Aggregating functions are the ones that reduce the dimension of the returned objects. It means output Series/DataFrame have less or same rows like original. Some common aggregating functions are tabulated below:
聚合函数是减少返回对象维度的函数。这意味着输出 Series/DataFrame 与原始行有更少或相同的行。下表列出了一些常见的聚合函数:
Function Description mean() Compute mean of groups sum() Compute sum of group values size() Compute group sizes count() Compute count of group std() Standard deviation of groups var() Compute variance of groups sem() Standard error of the mean of groups describe() Generates descriptive statistics first() Compute first of group values last() Compute last of group values nth() Take nth value, or a subset if n is a list min() Compute min of group values max() Compute max of group values
np.random.seed(123)
df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
'B' : ['one', 'two', 'three','two', 'two', 'one'],
'C' : np.random.randint(5, size=6),
'D' : np.random.randint(5, size=6),
'E' : np.random.randint(5, size=6)})
print (df)
A B C D E
0 foo one 2 3 0
1 foo two 4 1 0
2 bar three 2 1 1
3 foo two 1 0 3
4 bar two 3 1 4
5 foo one 2 1 0
Aggregation by filtered columns and cython implemented functions:
按过滤列和cython 实现的函数聚合:
df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
A B C
0 bar three 2
1 bar two 3
2 foo one 4
3 foo two 5
Aaggregate function is using for all columns without specified in groupby
function, here A, B
columns:
Aaggregate 函数用于所有没有在groupby
函数中指定的A, B
列,这里的列:
df2 = df.groupby(['A', 'B'], as_index=False).sum()
print (df2)
A B C D E
0 bar three 2 1 1
1 bar two 3 1 4
2 foo one 4 4 0
3 foo two 5 1 3
You can also specified only some columns used for aggregation in list after groupby
function:
您还可以在groupby
函数后的列表中仅指定用于聚合的某些列:
df3 = df.groupby(['A', 'B'], as_index=False)['C','D'].sum()
print (df3)
A B C D
0 bar three 2 1
1 bar two 3 1
2 foo one 4 4
3 foo two 5 1
Same results by using function DataFrameGroupBy.agg
:
使用功能相同的结果DataFrameGroupBy.agg
:
df1 = df.groupby(['A', 'B'], as_index=False)['C'].agg('sum')
print (df1)
A B C
0 bar three 2
1 bar two 3
2 foo one 4
3 foo two 5
df2 = df.groupby(['A', 'B'], as_index=False).agg('sum')
print (df2)
A B C D E
0 bar three 2 1 1
1 bar two 3 1 4
2 foo one 4 4 0
3 foo two 5 1 3
For multiplied functions applied for one column use list of tuple
s - names of new columns and aggregted functions:
对于应用于一列的乘法函数,使用tuple
s列表- 新列和聚合函数的名称:
df4 = (df.groupby(['A', 'B'])['C']
.agg([('average','mean'),('total','sum')])
.reset_index())
print (df4)
A B average total
0 bar three 2.0 2
1 bar two 3.0 3
2 foo one 2.0 4
3 foo two 2.5 5
If want to pass multiple functions is possible pass list
of tuple
s:
如果想通过多种功能,可以通list
的tuple
S:
df5 = (df.groupby(['A', 'B'])
.agg([('average','mean'),('total','sum')]))
print (df5)
C D E
average total average total average total
A B
bar three 2.0 2 1.0 1 1.0 1
two 3.0 3 1.0 1 4.0 4
foo one 2.0 4 2.0 4 0.0 0
two 2.5 5 0.5 1 1.5 3
Then get MultiIndex
in columns:
然后进入MultiIndex
列:
print (df5.columns)
MultiIndex(levels=[['C', 'D', 'E'], ['average', 'total']],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
And for converting to columns, flattening MultiIndex
use map
with join
:
并转换成列,压扁MultiIndex
使用map
具有join
:
df5.columns = df5.columns.map('_'.join)
df5 = df5.reset_index()
print (df5)
A B C_average C_total D_average D_total E_average E_total
0 bar three 2.0 2 1.0 1 1.0 1
1 bar two 3.0 3 1.0 1 4.0 4
2 foo one 2.0 4 2.0 4 0.0 0
3 foo two 2.5 5 0.5 1 1.5 3
Another solution is pass list of aggregate functions, then flatten MultiIndex
and for another columns names use str.replace
:
另一个解决方案是传递聚合函数列表,然后展平MultiIndex
并为其他列名称使用str.replace
:
df5 = df.groupby(['A', 'B']).agg(['mean','sum'])
df5.columns = (df5.columns.map('_'.join)
.str.replace('sum','total')
.str.replace('mean','average'))
df5 = df5.reset_index()
print (df5)
A B C_average C_total D_average D_total E_average E_total
0 bar three 2.0 2 1.0 1 1.0 1
1 bar two 3.0 3 1.0 1 4.0 4
2 foo one 2.0 4 2.0 4 0.0 0
3 foo two 2.5 5 0.5 1 1.5 3
If want specified each column with aggregated function separately pass dictionary
:
如果要分别指定带有聚合函数的每一列,请传递dictionary
:
df6 = (df.groupby(['A', 'B'], as_index=False)
.agg({'C':'sum','D':'mean'})
.rename(columns={'C':'C_total', 'D':'D_average'}))
print (df6)
A B C_total D_average
0 bar three 2 1.0
1 bar two 3 1.0
2 foo one 4 2.0
3 foo two 5 0.5
You can pass custom function too:
您也可以传递自定义函数:
def func(x):
return x.iat[0] + x.iat[-1]
df7 = (df.groupby(['A', 'B'], as_index=False)
.agg({'C':'sum','D': func})
.rename(columns={'C':'C_total', 'D':'D_sum_first_and_last'}))
print (df7)
A B C_total D_sum_first_and_last
0 bar three 2 2
1 bar two 3 2
2 foo one 4 4
3 foo two 5 1
Question 2
问题2
No DataFrame after aggregation! What happened?
聚合后没有 DataFrame!发生了什么?
Aggregation by 2 or more columns:
按 2 列或更多列聚合:
df1 = df.groupby(['A', 'B'])['C'].sum()
print (df1)
A B
bar three 2
two 3
foo one 4
two 5
Name: C, dtype: int32
First check Index
and type
of pandas object:
首先检查Index
和type
Pandas对象:
print (df1.index)
MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']],
labels=[[0, 0, 1, 1], [1, 2, 0, 2]],
names=['A', 'B'])
print (type(df1))
<class 'pandas.core.series.Series'>
There are 2 solutions how get MultiIndex Series
to columns:
有两种解决方案如何获取MultiIndex Series
列:
- add parameter
as_index=False
- 添加参数
as_index=False
df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
A B C
0 bar three 2
1 bar two 3
2 foo one 4
3 foo two 5
- use
Series.reset_index
:
df1 = df.groupby(['A', 'B'])['C'].sum().reset_index()
print (df1)
A B C
0 bar three 2
1 bar two 3
2 foo one 4
3 foo two 5
If group by one column:
如果按一列分组:
df2 = df.groupby('A')['C'].sum()
print (df2)
A
bar 5
foo 9
Name: C, dtype: int32
... get Series
with Index
:
...获取Series
与Index
:
print (df2.index)
Index(['bar', 'foo'], dtype='object', name='A')
print (type(df2))
<class 'pandas.core.series.Series'>
And solution is same like in MultiIndex Series
:
和解决方案是一样的MultiIndex Series
:
df2 = df.groupby('A', as_index=False)['C'].sum()
print (df2)
A C
0 bar 5
1 foo 9
df2 = df.groupby('A')['C'].sum().reset_index()
print (df2)
A C
0 bar 5
1 foo 9
Question 3
问题 3
How to aggregate mainly strings columns (to list
s, tuple
s, strings with separator
)?
如何主要聚合字符串列(到list
s, tuple
s, strings with separator
)?
df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
'C' : ['three', 'one', 'two', 'two', 'three','two', 'one'],
'D' : [1,2,3,2,3,1,2]})
print (df)
A B C D
0 a one three 1
1 c two one 2
2 b three two 3
3 b two two 2
4 a two three 3
5 c one two 1
6 b three one 2
Instead of aggregeta function is possible pass list
, tuple
, set
for converting column:
代替 agregeta 函数是可能的 pass list
, tuple
,set
用于转换列:
df1 = df.groupby('A')['B'].agg(list).reset_index()
print (df1)
A B
0 a [one, two]
1 b [three, two, three]
2 c [two, one]
Alternative is use GroupBy.apply
:
替代方法是使用GroupBy.apply
:
df1 = df.groupby('A')['B'].apply(list).reset_index()
print (df1)
A B
0 a [one, two]
1 b [three, two, three]
2 c [two, one]
For converting to strings with separator use .join
only if string column:
.join
仅当字符串列使用分隔符转换为字符串时:
df2 = df.groupby('A')['B'].agg(','.join).reset_index()
print (df2)
A B
0 a one,two
1 b three,two,three
2 c two,one
If numeric column use lambda function with astype
for converting to string
s:
如果数字列使用 lambda 函数 withastype
转换为string
s:
df3 = (df.groupby('A')['D']
.agg(lambda x: ','.join(x.astype(str)))
.reset_index())
print (df3)
A D
0 a 1,3
1 b 3,2,2
2 c 2,1
Another solution is converting to strings before groupby
:
另一个解决方案是在之前转换为字符串groupby
:
df3 = (df.assign(D = df['D'].astype(str))
.groupby('A')['D']
.agg(','.join).reset_index())
print (df3)
A D
0 a 1,3
1 b 3,2,2
2 c 2,1
For converting all columns pass no list of column(s) after groupby
.
There is no column D
because automatic exclusion of 'nuisance' columns,
it means all numeric columns are excluded.
为了转换所有列,在groupby
. 没有列,D
因为自动排除 'nuisance' 列,这意味着排除所有数字列。
df4 = df.groupby('A').agg(','.join).reset_index()
print (df4)
A B C
0 a one,two three,three
1 b three,two,three two,two,one
2 c two,one one,two
So it's necessary to convert all columns into strings, then get all columns:
所以有必要将所有列转换为字符串,然后获取所有列:
df5 = (df.groupby('A')
.agg(lambda x: ','.join(x.astype(str)))
.reset_index())
print (df5)
A B C D
0 a one,two three,three 1,3
1 b three,two,three two,two,one 3,2,2
2 c two,one one,two 2,1
Question 4
问题 4
How to aggregate counts?
如何聚合计数?
df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
'C' : ['three', np.nan, np.nan, 'two', 'three','two', 'one'],
'D' : [np.nan,2,3,2,3,np.nan,2]})
print (df)
A B C D
0 a one three NaN
1 c two NaN 2.0
2 b three NaN 3.0
3 b two two 2.0
4 a two three 3.0
5 c one two NaN
6 b three one 2.0
Function GroupBy.size
for size
of each group:
功能GroupBy.size
为size
各组:
df1 = df.groupby('A').size().reset_index(name='COUNT')
print (df1)
A COUNT
0 a 2
1 b 3
2 c 2
Function GroupBy.count
exclude missing values:
函数GroupBy.count
排除缺失值:
df2 = df.groupby('A')['C'].count().reset_index(name='COUNT')
print (df2)
A COUNT
0 a 2
1 b 2
2 c 1
Function should be used fo multiple columns for count non missing values:
函数应该用于多列计数非缺失值:
df3 = df.groupby('A').count().add_suffix('_COUNT').reset_index()
print (df3)
A B_COUNT C_COUNT D_COUNT
0 a 2 2 1
1 b 3 2 3
2 c 2 1 1
Related function Series.value_counts
return size object containing counts of unique values in descending order so that the first element is the most frequently-occurring element. Excludes NaN
s values by default.
相关函数Series.value_counts
返回包含按降序排列的唯一值计数的大小对象,以便第一个元素是最常出现的元素。NaN
默认情况下排除s 值。
df4 = (df['A'].value_counts()
.rename_axis('A')
.reset_index(name='COUNT'))
print (df4)
A COUNT
0 b 3
1 a 2
2 c 2
If you want same output like using function groupby
+ size
add Series.sort_index
:
如果您想要与使用 function groupby
+ size
add相同的输出Series.sort_index
:
df5 = (df['A'].value_counts()
.sort_index()
.rename_axis('A')
.reset_index(name='COUNT'))
print (df5)
A COUNT
0 a 2
1 b 3
2 c 2
Question 5
问题 5
How to create new column filled by aggregated values?
如何创建由聚合值填充的新列?
Method GroupBy.transform
returns an object that is indexed the same (same size) as the one being grouped
方法GroupBy.transform
返回一个与被分组的对象索引相同(相同大小)的对象
Pandas documentationfor more information.
Pandas文档以获取更多信息。
np.random.seed(123)
df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
'B' : ['one', 'two', 'three','two', 'two', 'one'],
'C' : np.random.randint(5, size=6),
'D' : np.random.randint(5, size=6)})
print (df)
A B C D
0 foo one 2 3
1 foo two 4 1
2 bar three 2 1
3 foo two 1 0
4 bar two 3 1
5 foo one 2 1
df['C1'] = df.groupby('A')['C'].transform('sum')
df['C2'] = df.groupby(['A','B'])['C'].transform('sum')
df[['C3','D3']] = df.groupby('A')['C','D'].transform('sum')
df[['C4','D4']] = df.groupby(['A','B'])['C','D'].transform('sum')
print (df)
A B C D C1 C2 C3 D3 C4 D4
0 foo one 2 3 9 4 9 5 4 4
1 foo two 4 1 9 5 9 5 5 1
2 bar three 2 1 5 2 5 2 2 1
3 foo two 1 0 9 5 9 5 5 1
4 bar two 3 1 5 3 5 2 3 1
5 foo one 2 1 9 4 9 5 4 4
回答by Abhishek R
If you are coming from an R or SQL background here are 3 examples that will teach you everything you need to do aggregation the way you are already familiar with:
如果您有 R 或 SQL 背景,这里有 3 个示例,它们将教您以您已经熟悉的方式进行聚合所需的一切:
Let us first create a Pandas dataframe
让我们首先创建一个 Pandas 数据框
import pandas as pd
df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
'key2' : ['a','a','b','b','c'],
'value1' : [1,2,2,3,3],
'value2' : [9,8,7,6,5]})
df.head(5)
Here is how the table we created looks like:
这是我们创建的表的样子:
|----------------|-------------|------------|------------|
| key1 | key2 | value1 | value2 |
|----------------|-------------|------------|------------|
| a | a | 1 | 9 |
| a | a | 2 | 8 |
| a | b | 2 | 7 |
| b | b | 3 | 6 |
| a | c | 3 | 5 |
|----------------|-------------|------------|------------|
1. Aggregating With Row Reduction Similar to SQL Group By
1. 类似于SQL的行缩减聚合 Group By
df_agg = df.groupby(['key1','key2']).agg(mean_of_value_1=('value1', 'mean'),
sum_of_value_2=('value2', 'sum'),
count_of_value1=('value1','size')
).reset_index()
df_agg.head(5)
The resulting data table will look like this:
生成的数据表将如下所示:
|----------------|-------------|--------------------|-------------------|---------------------|
| key1 | key2 | mean_of_value1 | sum_of_value2 | count_of_value1 |
|----------------|-------------|--------------------|-------------------|---------------------|
| a | a1 | 1.5 | 17 | 2 |
| a | b1 | 2.0 | 7 | 1 |
| a | c1 | 3.0 | 5 | 1 |
| b | b1 | 3.0 | 6 | 1 |
|----------------|-------------|--------------------|-------------------|---------------------|
The SQL Equivalent of this is:
等效的 SQL 是:
SELECT
key1
,key2
,AVG(value1) AS mean_of_value_1
,SUM(value2) AS sum_of_value_2
,COUNT(*) AS count_of_value1
FROM
df
GROUP BY
key1
,key2
2. Create Column Without Reduction in Rows (EXCEL - SUMIF, COUNTIF
)
2. 创建列而不减少行 ( EXCEL - SUMIF, COUNTIF
)
If you want to do a SUMIF, COUNTIF etc like how you would do in Excel where there is no reduction in rows then you need to do this instead.
如果你想做一个 SUMIF、COUNTIF 等,就像你在 Excel 中所做的那样,没有减少行数,那么你需要这样做。
df['Total_of_value1_by_key1'] = df.groupby('key1')['value1'].transform('sum')
df.head(5)
The resulting data frame will look like this with the same number of rows as the original:
生成的数据框将如下所示,其行数与原始数据相同:
|----------------|-------------|------------|------------|-------------------------|
| key1 | key2 | value1 | value2 | Total_of_value1_by_key1 |
|----------------|-------------|------------|------------|-------------------------|
| a | a | 1 | 9 | 8 |
| a | a | 2 | 8 | 8 |
| a | b | 2 | 7 | 8 |
| b | b | 3 | 6 | 3 |
| a | c | 3 | 5 | 8 |
|----------------|-------------|------------|------------|-------------------------|
3. Creating a RANK Column ROW_NUMBER() OVER (PARTITION BY ORDER BY)
3. 创建 RANK 列 ROW_NUMBER() OVER (PARTITION BY ORDER BY)
Finally, there might be cases where you want to create a Rank column which is the SQL Equivalent of ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 DESC, Data2 ASC)
最后,在某些情况下,您可能希望创建一个 Rank 列,该列是 SQL 等价的 ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 DESC, Data2 ASC)
Here is how you do that.
这是你如何做到的。
df['RN'] = df.sort_values(['value1','value2'], ascending=[False,True]) \
.groupby(['key1']) \
.cumcount() + 1
df.head(5)
Note: we make the code multi-line by adding \
in the end of each line.
注意:我们通过\
在每行的末尾添加来使代码多行。
Here is how the resulting data frame looks like:
生成的数据框如下所示:
|----------------|-------------|------------|------------|------------|
| key1 | key2 | value1 | value2 | RN |
|----------------|-------------|------------|------------|------------|
| a | a | 1 | 9 | 4 |
| a | a | 2 | 8 | 3 |
| a | b | 2 | 7 | 2 |
| b | b | 3 | 6 | 1 |
| a | c | 3 | 5 | 1 |
|----------------|-------------|------------|------------|------------|
In all the examples above, the final data table will have a table structure and won't have the pivot structure that you might get in other syntaxes.
在上面的所有示例中,最终数据表将具有表结构,并且不会具有您可能在其他语法中获得的数据透视结构。
Other aggregating operators:
其他聚合运算符:
mean()
Compute mean of groups
mean()
计算组的平均值
sum()
Compute sum of group values
sum()
计算组值的总和
size()
Compute group sizes
size()
计算组大小
count()
Compute count of group
count()
计算组数
std()
Standard deviation of groups
std()
组的标准差
var()
Compute variance of groups
var()
计算组的方差
sem()
Standard error of the mean of groups
sem()
组均值的标准误
describe()
Generates descriptive statistics
describe()
生成描述性统计
first()
Compute first of group values
first()
计算第一个组值
last()
Compute last of group values
last()
计算组值的最后一个
nth()
Take nth value, or a subset if n is a list
nth()
如果 n 是列表,则取第 n 个值或子集
min()
Compute min of group values
min()
计算组值的最小值
max()
Compute max of group values
max()
计算组值的最大值
Hope this helps.
希望这可以帮助。