Python 使用 pandas GroupBy 获取每个组的统计信息(例如计数、平均值等)?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19384532/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
提问by Roman
I have a data frame df
and I use several columns from it to groupby
:
我有一个数据框df
,我使用其中的几列到groupby
:
df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()
In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.
通过上述方式,我几乎得到了我需要的表格(数据框)。缺少的是包含每个组中行数的附加列。换句话说,我的意思是,但我也想知道使用了多少个数字来获得这些平均值。例如,在第一组中有 8 个值,在第二组中有 10 个值,依此类推。
In short: How do I get group-wisestatistics for a dataframe?
简而言之:如何获取数据框的分组统计信息?
采纳答案by Boud
On groupby
object, the agg
function can take a list to apply several aggregation methodsat once. This should give you the result you need:
在groupby
对象上,该agg
函数可以使用一个列表来一次应用多个聚合方法。这应该给你你需要的结果:
df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
回答by Pedro M Duarte
Quick Answer:
快速回答:
The simplest way to get row counts per group is by calling .size()
, which returns a Series
:
获取每组行数的最简单方法是调用.size()
,它返回一个Series
:
df.groupby(['col1','col2']).size()
Usually you want this result as a DataFrame
(instead of a Series
) so you can do:
通常,您希望将此结果作为 a DataFrame
(而不是 a Series
),因此您可以执行以下操作:
df.groupby(['col1', 'col2']).size().reset_index(name='counts')
If you want to find out how to calculate the row counts and other statistics for each group continue reading below.
如果您想了解如何计算每个组的行数和其他统计信息,请继续阅读下面的内容。
Detailed example:
详细示例:
Consider the following example dataframe:
考虑以下示例数据帧:
In [2]: df
Out[2]:
col1 col2 col3 col4 col5 col6
0 A B 0.20 -0.61 -0.49 1.49
1 A B -1.53 -1.01 -0.39 1.82
2 A B -0.44 0.27 0.72 0.11
3 A B 0.28 -1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D -0.13 -1.65 -1.64 0.50
6 C D -1.42 -0.11 -0.18 -0.44
7 E F -0.00 1.42 -0.26 1.17
8 E F 0.91 -0.47 1.35 -0.34
9 G H 1.48 -0.63 -1.14 0.17
First let's use .size()
to get the row counts:
首先让我们使用.size()
来获取行数:
In [3]: df.groupby(['col1', 'col2']).size()
Out[3]:
col1 col2
A B 4
C D 3
E F 2
G H 1
dtype: int64
Then let's use .size().reset_index(name='counts')
to get the row counts:
然后让我们使用.size().reset_index(name='counts')
来获取行数:
In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]:
col1 col2 counts
0 A B 4
1 C D 3
2 E F 2
3 G H 1
Including results for more statistics
包括更多统计数据的结果
When you want to calculate statistics on grouped data, it usually looks like this:
当您要计算分组数据的统计信息时,通常如下所示:
In [5]: (df
...: .groupby(['col1', 'col2'])
...: .agg({
...: 'col3': ['mean', 'count'],
...: 'col4': ['median', 'min', 'count']
...: }))
Out[5]:
col4 col3
median min count mean count
col1 col2
A B -0.810 -1.32 4 -0.372500 4
C D -0.110 -1.65 3 -0.476667 3
E F 0.475 -0.47 2 0.455000 2
G H -0.630 -0.63 1 1.480000 1
The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.
由于嵌套的列标签,并且因为行计数是基于每列的,所以上面的结果处理起来有点烦人。
To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join
. It looks like this:
为了更好地控制输出,我通常将统计信息拆分为单独的聚合,然后使用join
. 它看起来像这样:
In [6]: gb = df.groupby(['col1', 'col2'])
...: counts = gb.size().to_frame(name='counts')
...: (counts
...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
...: .reset_index()
...: )
...:
Out[6]:
col1 col2 counts col3_mean col4_median col4_min
0 A B 4 -0.372500 -0.810 -1.32
1 C D 3 -0.476667 -0.110 -1.65
2 E F 2 0.455000 0.475 -0.47
3 G H 1 1.480000 -0.630 -0.63
Footnotes
脚注
The code used to generate the test data is shown below:
用于生成测试数据的代码如下所示:
In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['E', 'F'],
...: ['E', 'F'],
...: ['G', 'H']
...: ])
...:
...: df = pd.DataFrame(
...: np.hstack([keys,np.random.randn(10,4).round(2)]),
...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
...: )
...:
...: df[['col3', 'col4', 'col5', 'col6']] = \
...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
...:
Disclaimer:
免责声明:
If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN
entries in the mean calculation without telling you about it.
如果您聚合的某些列具有空值,那么您确实希望将组行计数视为每列的独立聚合。否则你可能会被误导,因为有多少记录实际上被用来计算平均值,因为熊猫会NaN
在不告诉你的情况下删除平均值计算中的条目。
回答by Nimesh
We can easily do it by using groupby and count. But, we should remember to use reset_index().
我们可以通过使用 groupby 和 count 轻松完成。但是,我们应该记住使用 reset_index()。
df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()
回答by cs95
One Function to Rule Them All: GroupBy.describe
一个功能来统治他们: GroupBy.describe
Returns count
, mean
, std
, and other useful statistics per-group.
返回count
,mean
,std
,和其他有用的统计每个组。
df.groupby(['col1', 'col2'])['col3', 'col4'].describe()
# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
from IPython.display import display
with pd.option_context('precision', 2):
display(df.groupby(['A', 'B'])['C'].describe())
count mean std min 25% 50% 75% max
A B
bar one 1.0 0.40 NaN 0.40 0.40 0.40 0.40 0.40
three 1.0 2.24 NaN 2.24 2.24 2.24 2.24 2.24
two 1.0 -0.98 NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one 2.0 1.36 0.58 0.95 1.15 1.36 1.56 1.76
three 1.0 -0.15 NaN -0.15 -0.15 -0.15 -0.15 -0.15
two 2.0 1.42 0.63 0.98 1.20 1.42 1.65 1.87
To get specific statistics, just select them,
要获得特定的统计数据,只需选择它们,
df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]
count mean
A B
bar one 1.0 0.400157
three 1.0 2.240893
two 1.0 -0.977278
foo one 2.0 1.357070
three 1.0 -0.151357
two 2.0 1.423148
describe
works for multiple columns (change ['C']
to ['C', 'D']
—or remove it altogether—and see what happens, the result is a MultiIndexed columned dataframe).
describe
适用于多列(更改['C']
为['C', 'D']
- 或完全删除它 - 看看会发生什么,结果是一个 MultiIndexed 列数据框)。
You also get different statistics for string data. Here's an example,
您还可以获得不同的字符串数据统计信息。这是一个例子,
df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)
with pd.option_context('precision', 2):
display(df2.groupby(['A', 'B'])
.describe(include='all')
.dropna(how='all', axis=1))
C D
count mean std min 25% 50% 75% max count unique top freq
A B
bar one 14.0 0.40 5.76e-17 0.40 0.40 0.40 0.40 0.40 14 1 a 14
three 14.0 2.24 4.61e-16 2.24 2.24 2.24 2.24 2.24 14 1 b 14
two 9.0 -0.98 0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98 9 1 c 9
foo one 22.0 1.43 4.10e-01 0.95 0.95 1.76 1.76 1.76 22 2 a 13
three 15.0 -0.15 0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15 15 1 c 15
two 26.0 1.49 4.48e-01 0.98 0.98 1.87 1.87 1.87 26 2 b 15
For more information, see the documentation.
有关更多信息,请参阅文档。
回答by Mahendra
Create a group object and call methods like below example:
创建一个组对象并调用如下示例所示的方法:
grp = df.groupby(['col1', 'col2', 'col3'])
grp.max()
grp.mean()
grp.describe()
回答by Jake Drew
回答by Ichsan
Please try this code
请试试这个代码
new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df
I think that code will add a column called 'count it' which count of each group
我认为该代码将添加一个名为“count it”的列,该列对每个组进行计数