Python 在 Pandas 聚合函数中命名返回的列?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19078325/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Naming returned columns in Pandas aggregate function?
提问by David Chouinard
I'm having trouble with Pandas' groupby functionality. I've read the documentation, but I can't see to figure out how to apply aggregate functions to multiple columns andhave custom names for those columns.
我在使用 Pandas 的 groupby 功能时遇到了问题。我已阅读文档,但我无法弄清楚如何将聚合函数应用于多个列并为这些列自定义名称。
This comes very close, but the data structure returned has nested column headings:
这非常接近,但返回的数据结构具有嵌套的列标题:
data.groupby("Country").agg(
{"column1": {"foo": sum()}, "column2": {"mean": np.mean, "std": np.std}})
(ie. I want to take the mean and std of column2, but return those columns as "mean" and "std")
(即。我想取 column2 的均值和标准差,但将这些列返回为“均值”和“标准差”)
What am I missing?
我错过了什么?
采纳答案by unutbu
This will drop the outermost level from the hierarchical column index:
这将从分层列索引中删除最外层:
df = data.groupby(...).agg(...)
df.columns = df.columns.droplevel(0)
If you'd like to keep the outermost level, you can use the ravel() function on the multi-level column to form new labels:
如果你想保持最外层,你可以在多级列上使用 ravel() 函数来形成新标签:
df.columns = ["_".join(x) for x in df.columns.ravel()]
For example:
例如:
import pandas as pd
import pandas.rpy.common as com
import numpy as np
data = com.load_data('Loblolly')
print(data.head())
# height age Seed
# 1 4.51 3 301
# 15 10.89 5 301
# 29 28.72 10 301
# 43 41.74 15 301
# 57 52.70 20 301
df = data.groupby('Seed').agg(
{'age':['sum'],
'height':['mean', 'std']})
print(df.head())
# age height
# sum std mean
# Seed
# 301 78 22.638417 33.246667
# 303 78 23.499706 34.106667
# 305 78 23.927090 35.115000
# 307 78 22.222266 31.328333
# 309 78 23.132574 33.781667
df.columns = df.columns.droplevel(0)
print(df.head())
yields
产量
sum std mean
Seed
301 78 22.638417 33.246667
303 78 23.499706 34.106667
305 78 23.927090 35.115000
307 78 22.222266 31.328333
309 78 23.132574 33.781667
Alternatively, to keep the first level of the index:
或者,要保留索引的第一级:
df = data.groupby('Seed').agg(
{'age':['sum'],
'height':['mean', 'std']})
df.columns = ["_".join(x) for x in df.columns.ravel()]
yields
产量
age_sum height_std height_mean
Seed
301 78 22.638417 33.246667
303 78 23.499706 34.106667
305 78 23.927090 35.115000
307 78 22.222266 31.328333
309 78 23.132574 33.781667
回答by Gadi Oron
If you want to have a behavior similar to JMP, creating column titles that keep all info from the multi index you can use:
如果您想拥有类似于 JMP 的行为,请创建列标题以保留多索引中的所有信息,您可以使用:
newidx = []
for (n1,n2) in df.columns.ravel():
newidx.append("%s-%s" % (n1,n2))
df.columns=newidx
It will change your dataframe from:
它将从以下位置更改您的数据框:
I V
mean std first
V
4200.0 25.499536 31.557133 4200.0
4300.0 25.605662 31.678046 4300.0
4400.0 26.679005 32.919996 4400.0
4500.0 26.786458 32.811633 4500.0
to
到
I-mean I-std V-first
V
4200.0 25.499536 31.557133 4200.0
4300.0 25.605662 31.678046 4300.0
4400.0 26.679005 32.919996 4400.0
4500.0 26.786458 32.811633 4500.0
回答by joelostblom
For pandas >= 0.25
对于熊猫 >= 0.25
The functionality to name returned aggregate columns has been reintroduced in the master branchand is targeted for pandas 0.25. The new syntax is .agg(new_col_name=('col_name', 'agg_func')
. Detailed example from the PR linked above:
命名返回聚合列的功能已在 master 分支中重新引入,并针对 Pandas 0.25。新语法是.agg(new_col_name=('col_name', 'agg_func')
. 来自上面链接的 PR 的详细示例:
In [2]: df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
...: 'height': [9.1, 6.0, 9.5, 34.0],
...: 'weight': [7.9, 7.5, 9.9, 198.0]})
...:
In [3]: df
Out[3]:
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
In [4]: df.groupby('kind').agg(min_height=('height', 'min'),
max_weight=('weight', 'max'))
Out[4]:
min_height max_weight
kind
cat 9.1 9.9
dog 6.0 198.0
It will also be possible to use multiple lambda expressions with this syntax and the two-step rename syntax I suggested earlier (below) as per this PR. Again, copying from the example in the PR:
也可以使用具有此语法的多个 lambda 表达式以及我之前(如下)根据此 PR建议的两步重命名语法。再次复制 PR 中的示例:
In [2]: df = pd.DataFrame({"A": ['a', 'a'], 'B': [1, 2], 'C': [3, 4]})
In [3]: df.groupby("A").agg({'B': [lambda x: 0, lambda x: 1]})
Out[3]:
B
<lambda> <lambda 1>
A
a 0 1
and then .rename()
, or in one go:
然后.rename()
,或者一次性:
In [4]: df.groupby("A").agg(b=('B', lambda x: 0), c=('B', lambda x: 1))
Out[4]:
b c
A
a 0 0
For pandas < 0.25
对于熊猫 < 0.25
The currently accepted answer by unutbu describes are great way of doing this in pandas versions <= 0.20. However, as of pandas 0.20, using this method raises a warning indicating that the syntax will not be available in future versions of pandas.
unutbu 描述的当前接受的答案是在熊猫版本 <= 0.20 中执行此操作的好方法。但是,从 pandas 0.20 开始,使用此方法会引发警告,表明该语法在未来版本的 pandas 中将不可用。
Series:
系列:
FutureWarning: using a dict on a Series for aggregation is deprecated and will be removed in a future version
FutureWarning:不推荐在 Series 上使用 dict 进行聚合,并将在未来版本中删除
DataFrames:
数据帧:
FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
FutureWarning:不推荐使用重命名的 dict,并将在未来版本中删除
According to the pandas 0.20 changelog, the recommended way of renaming columns while aggregating is as follows.
根据pandas 0.20 changelog,聚合时重命名列的推荐方法如下。
# Create a sample data frame
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': range(5),
'C': range(5)})
# ==== SINGLE COLUMN (SERIES) ====
# Syntax soon to be deprecated
df.groupby('A').B.agg({'foo': 'count'})
# Recommended replacement syntax
df.groupby('A').B.agg(['count']).rename(columns={'count': 'foo'})
# ==== MULTI COLUMN ====
# Syntax soon to be deprecated
df.groupby('A').agg({'B': {'foo': 'sum'}, 'C': {'bar': 'min'}})
# Recommended replacement syntax
df.groupby('A').agg({'B': 'sum', 'C': 'min'}).rename(columns={'B': 'foo', 'C': 'bar'})
# As the recommended syntax is more verbose, parentheses can
# be used to introduce line breaks and increase readability
(df.groupby('A')
.agg({'B': 'sum', 'C': 'min'})
.rename(columns={'B': 'foo', 'C': 'bar'})
)
Please see the 0.20 changelogfor additional details.
有关其他详细信息,请参阅0.20 更改日志。
Update 2017-01-03 in response to @JunkMechanic's comment.
更新 2017-01-03 以响应@JunkMechanic 的评论。
With the old style dictionary syntax, it was possible to pass multiple lambda
functions to .agg
, since these would be renamed with the key in the passed dictionary:
使用旧式字典语法,可以将多个lambda
函数传递给.agg
,因为这些函数将使用传递的字典中的键重命名:
>>> df.groupby('A').agg({'B': {'min': lambda x: x.min(), 'max': lambda x: x.max()}})
B
max min
A
1 2 0
2 4 3
Multiple functions can also be passed to a single column as a list:
多个函数也可以作为列表传递给单个列:
>>> df.groupby('A').agg({'B': [np.min, np.max]})
B
amin amax
A
1 0 2
2 3 4
However, this does not work with lambda functions, since they are anonymous and all return <lambda>
, which causes a name collision:
但是,这不适用于 lambda 函数,因为它们是匿名的并且都 return <lambda>
,这会导致名称冲突:
>>> df.groupby('A').agg({'B': [lambda x: x.min(), lambda x: x.max]})
SpecificationError: Function names must be unique, found multiple named <lambda>
To avoid the SpecificationError
, named functions can be defined a priori instead of using lambda
. Suitable function names also avoid calling .rename
on the data frame afterwards. These functions can be passed with the same list syntax as above:
为了避免SpecificationError
,可以先验地定义命名函数而不是使用lambda
。合适的函数名称也避免了之后调用.rename
数据帧。可以使用与上述相同的列表语法传递这些函数:
>>> def my_min(x):
>>> return x.min()
>>> def my_max(x):
>>> return x.max()
>>> df.groupby('A').agg({'B': [my_min, my_max]})
B
my_min my_max
A
1 0 2
2 3 4
回答by udothemath1984
With the inspiration of @Joel Ostblom
在@Joel Ostblom 的启发下
For those who already have a workable dictionary for merely aggregation, you can use/modify the following code for the newer version aggregation, separating aggregation and renaming part. Please be aware of the nested dictionary if there are more than 1 item.
对于那些已经有一个仅用于聚合的可行字典的人,您可以使用/修改以下代码进行新版本聚合,将聚合和重命名部分分开。如果有超过 1 项,请注意嵌套字典。
def agg_translate_agg_rename(input_agg_dict):
agg_dict = {}
rename_dict = {}
for k, v in input_agg_dict.items():
if len(v) == 1:
agg_dict[k] = list(v.values())[0]
rename_dict[k] = list(v.keys())[0]
else:
updated_index = 1
for nested_dict_k, nested_dict_v in v.items():
modified_key = k + "_" + str(updated_index)
agg_dict[modified_key] = nested_dict_v
rename_dict[modified_key] = nested_dict_k
updated_index += 1
return agg_dict, rename_dict
one_dict = {"column1": {"foo": 'sum'}, "column2": {"mean": 'mean', "std": 'std'}}
agg, rename = agg_translator_aa(one_dict)
We get
我们得到
agg = {'column1': 'sum', 'column2_1': 'mean', 'column2_2': 'std'}
rename = {'column1': 'foo', 'column2_1': 'mean', 'column2_2': 'std'}
Please let me know if there is a smarter way to do it. Thanks.
请让我知道是否有更聪明的方法来做到这一点。谢谢。
回答by user3780389
I agree with the OP that it seems more natural and consistent to name and define the output columns in the same place (e.g. as is done with tidyverse's summarize
in R), but a work-around in pandas for now is to create the new columns with desired names via assign
beforedoing the aggregation:
我同意 OP,在同一位置命名和定义输出列似乎更自然和一致(例如,就像在 R 中使用tidyverse 所做的summarize
那样),但现在在Pandas 中的解决方法是创建新列在进行聚合之前通过所需的名称:assign
data.assign(
f=data['column1'],
mean=data['column2'],
std=data['column2']
).groupby('Country').agg(dict(f=sum, mean=np.mean, std=np.std)).reset_index()
(Using reset_index
turns 'Country'
, 'f'
, 'mean'
, and 'std'
all into regular columns with a separate integer index.)
(使用reset_index
turn 'Country'
、'f'
、'mean'
、 和'std'
all 变成具有单独整数索引的常规列。)
回答by saneryee
such as this kind of dataframe, there are two levels of thecolumn name:
比如这种dataframe,列名有两级:
shop_id item_id date_block_num item_cnt_day
target
0 0 30 1 31
we can use this code:
我们可以使用这个代码:
df.columns = [col[0] if col[-1]=='' else col[-1] for col in df.columns.values]
df.columns = [col[0] if col[-1]=='' else col[-1] for col in df.columns.values]
result is:
结果是:
shop_id item_id date_block_num target
0 0 30 1 31