Python 在 Pandas 聚合函数中命名返回的列？

Question

提问by David Chouinard

I'm having trouble with Pandas' groupby functionality. I've read the documentation, but I can't see to figure out how to apply aggregate functions to multiple columns andhave custom names for those columns.

我在使用 Pandas 的 groupby 功能时遇到了问题。我已阅读文档，但我无法弄清楚如何将聚合函数应用于多个列并为这些列自定义名称。

This comes very close, but the data structure returned has nested column headings:

这非常接近，但返回的数据结构具有嵌套的列标题：

data.groupby("Country").agg(
        {"column1": {"foo": sum()}, "column2": {"mean": np.mean, "std": np.std}})

(ie. I want to take the mean and std of column2, but return those columns as "mean" and "std")

（即。我想取 column2 的均值和标准差，但将这些列返回为“均值”和“标准差”）

What am I missing?

我错过了什么？

Answer 1

采纳答案by unutbu

This will drop the outermost level from the hierarchical column index:

这将从分层列索引中删除最外层：

df = data.groupby(...).agg(...)
df.columns = df.columns.droplevel(0)

If you'd like to keep the outermost level, you can use the ravel() function on the multi-level column to form new labels:

如果你想保持最外层，你可以在多级列上使用 ravel() 函数来形成新标签：

df.columns = ["_".join(x) for x in df.columns.ravel()]

For example:

例如：

import pandas as pd
import pandas.rpy.common as com
import numpy as np

data = com.load_data('Loblolly')
print(data.head())
#     height  age Seed
# 1     4.51    3  301
# 15   10.89    5  301
# 29   28.72   10  301
# 43   41.74   15  301
# 57   52.70   20  301

df = data.groupby('Seed').agg(
    {'age':['sum'],
     'height':['mean', 'std']})
print(df.head())
#       age     height           
#       sum        std       mean
# Seed                           
# 301    78  22.638417  33.246667
# 303    78  23.499706  34.106667
# 305    78  23.927090  35.115000
# 307    78  22.222266  31.328333
# 309    78  23.132574  33.781667

df.columns = df.columns.droplevel(0)
print(df.head())

yields

产量

      sum        std       mean
Seed                           
301    78  22.638417  33.246667
303    78  23.499706  34.106667
305    78  23.927090  35.115000
307    78  22.222266  31.328333
309    78  23.132574  33.781667

Alternatively, to keep the first level of the index:

或者，要保留索引的第一级：

df = data.groupby('Seed').agg(
    {'age':['sum'],
     'height':['mean', 'std']})
df.columns = ["_".join(x) for x in df.columns.ravel()]

yields

产量

      age_sum   height_std  height_mean
Seed                           
301        78    22.638417    33.246667
303        78    23.499706    34.106667
305        78    23.927090    35.115000
307        78    22.222266    31.328333
309        78    23.132574    33.781667

Answer 2

回答by Gadi Oron

If you want to have a behavior similar to JMP, creating column titles that keep all info from the multi index you can use:

如果您想拥有类似于 JMP 的行为，请创建列标题以保留多索引中的所有信息，您可以使用：

newidx = []
for (n1,n2) in df.columns.ravel():
    newidx.append("%s-%s" % (n1,n2))
df.columns=newidx

It will change your dataframe from:

它将从以下位置更改您的数据框：

    I                       V
    mean        std         first
V
4200.0  25.499536   31.557133   4200.0
4300.0  25.605662   31.678046   4300.0
4400.0  26.679005   32.919996   4400.0
4500.0  26.786458   32.811633   4500.0

to

到

    I-mean      I-std       V-first
V
4200.0  25.499536   31.557133   4200.0
4300.0  25.605662   31.678046   4300.0
4400.0  26.679005   32.919996   4400.0
4500.0  26.786458   32.811633   4500.0

Answer 3

回答by joelostblom

For pandas >= 0.25

对于熊猫 >= 0.25

The functionality to name returned aggregate columns has been reintroduced in the master branchand is targeted for pandas 0.25. The new syntax is .agg(new_col_name=('col_name', 'agg_func'). Detailed example from the PR linked above:

命名返回聚合列的功能已在 master 分支中重新引入，并针对 Pandas 0.25。新语法是.agg(new_col_name=('col_name', 'agg_func'). 来自上面链接的 PR 的详细示例：

In [2]: df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
   ...:                    'height': [9.1, 6.0, 9.5, 34.0],
   ...:                    'weight': [7.9, 7.5, 9.9, 198.0]})
   ...:

In [3]: df
Out[3]:
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

In [4]: df.groupby('kind').agg(min_height=('height', 'min'), 
                               max_weight=('weight', 'max'))
Out[4]:
      min_height  max_weight
kind
cat          9.1         9.9
dog          6.0       198.0

It will also be possible to use multiple lambda expressions with this syntax and the two-step rename syntax I suggested earlier (below) as per this PR. Again, copying from the example in the PR:

也可以使用具有此语法的多个 lambda 表达式以及我之前（如下）根据此 PR建议的两步重命名语法。再次复制 PR 中的示例：

In [2]: df = pd.DataFrame({"A": ['a', 'a'], 'B': [1, 2], 'C': [3, 4]})

In [3]: df.groupby("A").agg({'B': [lambda x: 0, lambda x: 1]})
Out[3]:
         B
  <lambda> <lambda 1>
A
a        0          1

and then .rename(), or in one go:

然后.rename()，或者一次性：

In [4]: df.groupby("A").agg(b=('B', lambda x: 0), c=('B', lambda x: 1))
Out[4]:
   b  c
A
a  0  0

For pandas < 0.25

对于熊猫 < 0.25

The currently accepted answer by unutbu describes are great way of doing this in pandas versions <= 0.20. However, as of pandas 0.20, using this method raises a warning indicating that the syntax will not be available in future versions of pandas.

unutbu 描述的当前接受的答案是在熊猫版本 <= 0.20 中执行此操作的好方法。但是，从 pandas 0.20 开始，使用此方法会引发警告，表明该语法在未来版本的 pandas 中将不可用。

Series:

系列：

FutureWarning: using a dict on a Series for aggregation is deprecated and will be removed in a future version

FutureWarning：不推荐在 Series 上使用 dict 进行聚合，并将在未来版本中删除

DataFrames:

数据帧：

FutureWarning: using a dict with renaming is deprecated and will be removed in a future version

FutureWarning：不推荐使用重命名的 dict，并将在未来版本中删除

According to the pandas 0.20 changelog, the recommended way of renaming columns while aggregating is as follows.

根据pandas 0.20 changelog，聚合时重命名列的推荐方法如下。

# Create a sample data frame
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
                   'B': range(5),
                   'C': range(5)})

# ==== SINGLE COLUMN (SERIES) ====
# Syntax soon to be deprecated
df.groupby('A').B.agg({'foo': 'count'})
# Recommended replacement syntax
df.groupby('A').B.agg(['count']).rename(columns={'count': 'foo'})

# ==== MULTI COLUMN ====
# Syntax soon to be deprecated
df.groupby('A').agg({'B': {'foo': 'sum'}, 'C': {'bar': 'min'}})
# Recommended replacement syntax
df.groupby('A').agg({'B': 'sum', 'C': 'min'}).rename(columns={'B': 'foo', 'C': 'bar'})
# As the recommended syntax is more verbose, parentheses can
# be used to introduce line breaks and increase readability
(df.groupby('A')
    .agg({'B': 'sum', 'C': 'min'})
    .rename(columns={'B': 'foo', 'C': 'bar'})
)

Please see the 0.20 changelogfor additional details.

有关其他详细信息，请参阅0.20 更改日志。

Update 2017-01-03 in response to @JunkMechanic's comment.

更新 2017-01-03 以响应@JunkMechanic 的评论。

With the old style dictionary syntax, it was possible to pass multiple lambdafunctions to .agg, since these would be renamed with the key in the passed dictionary:

使用旧式字典语法，可以将多个lambda函数传递给.agg，因为这些函数将使用传递的字典中的键重命名：

>>> df.groupby('A').agg({'B': {'min': lambda x: x.min(), 'max': lambda x: x.max()}})

    B    
  max min
A        
1   2   0
2   4   3

Multiple functions can also be passed to a single column as a list:

多个函数也可以作为列表传递给单个列：

>>> df.groupby('A').agg({'B': [np.min, np.max]})

     B     
  amin amax
A          
1    0    2
2    3    4

However, this does not work with lambda functions, since they are anonymous and all return <lambda>, which causes a name collision:

但是，这不适用于 lambda 函数，因为它们是匿名的并且都 return <lambda>，这会导致名称冲突：

>>> df.groupby('A').agg({'B': [lambda x: x.min(), lambda x: x.max]})
SpecificationError: Function names must be unique, found multiple named <lambda>

To avoid the SpecificationError, named functions can be defined a priori instead of using lambda. Suitable function names also avoid calling .renameon the data frame afterwards. These functions can be passed with the same list syntax as above:

为了避免SpecificationError，可以先验地定义命名函数而不是使用lambda。合适的函数名称也避免了之后调用.rename数据帧。可以使用与上述相同的列表语法传递这些函数：

>>> def my_min(x):
>>>     return x.min()

>>> def my_max(x):
>>>     return x.max()

>>> df.groupby('A').agg({'B': [my_min, my_max]})

       B       
  my_min my_max
A              
1      0      2
2      3      4

Answer 4

回答by udothemath1984

With the inspiration of @Joel Ostblom

在@Joel Ostblom 的启发下

For those who already have a workable dictionary for merely aggregation, you can use/modify the following code for the newer version aggregation, separating aggregation and renaming part. Please be aware of the nested dictionary if there are more than 1 item.

对于那些已经有一个仅用于聚合的可行字典的人，您可以使用/修改以下代码进行新版本聚合，将聚合和重命名部分分开。如果有超过 1 项，请注意嵌套字典。

def agg_translate_agg_rename(input_agg_dict):
    agg_dict = {}
    rename_dict = {}
    for k, v in input_agg_dict.items():
        if len(v) == 1:
            agg_dict[k] = list(v.values())[0]
            rename_dict[k] = list(v.keys())[0]
        else:
            updated_index = 1
            for nested_dict_k, nested_dict_v in v.items():
                modified_key = k + "_" + str(updated_index)
                agg_dict[modified_key] = nested_dict_v
                rename_dict[modified_key] = nested_dict_k
                updated_index += 1
    return agg_dict, rename_dict

one_dict = {"column1": {"foo": 'sum'}, "column2": {"mean": 'mean', "std": 'std'}}
agg, rename = agg_translator_aa(one_dict)

We get

我们得到

agg = {'column1': 'sum', 'column2_1': 'mean', 'column2_2': 'std'}
rename = {'column1': 'foo', 'column2_1': 'mean', 'column2_2': 'std'}

Please let me know if there is a smarter way to do it. Thanks.

请让我知道是否有更聪明的方法来做到这一点。谢谢。

Answer 5

回答by user3780389

I agree with the OP that it seems more natural and consistent to name and define the output columns in the same place (e.g. as is done with tidyverse's summarizein R), but a work-around in pandas for now is to create the new columns with desired names via assignbeforedoing the aggregation:

我同意 OP，在同一位置命名和定义输出列似乎更自然和一致（例如，就像在 R 中使用tidyverse 所做的summarize那样），但现在在Pandas 中的解决方法是创建新列在进行聚合之前通过所需的名称：assign

data.assign(
    f=data['column1'],
    mean=data['column2'],
    std=data['column2']
).groupby('Country').agg(dict(f=sum, mean=np.mean, std=np.std)).reset_index()

(Using reset_indexturns 'Country', 'f', 'mean', and 'std'all into regular columns with a separate integer index.)

（使用reset_indexturn 'Country'、'f'、'mean'、和'std'all 变成具有单独整数索引的常规列。）

Answer 6

回答by saneryee

such as this kind of dataframe, there are two levels of thecolumn name:

比如这种dataframe，列名有两级：

 shop_id  item_id   date_block_num item_cnt_day       
                                  target              
0   0       30          1            31

we can use this code：

我们可以使用这个代码：

df.columns = [col[0] if col[-1]=='' else col[-1] for col in df.columns.values]

result is:

结果是：

 shop_id  item_id   date_block_num target              
0   0       30          1            31

Python 在 Pandas 聚合函数中命名返回的列？

提问by David Chouinard

采纳答案by unutbu

回答by Gadi Oron

回答by joelostblom

For pandas >= 0.25

对于熊猫 >= 0.25

For pandas < 0.25

对于熊猫 < 0.25

Update 2017-01-03 in response to @JunkMechanic's comment.

更新 2017-01-03 以响应@JunkMechanic 的评论。

回答by udothemath1984

回答by user3780389

回答by saneryee

相关推荐

最近更新

标签

Python 在 Pandas 聚合函数中命名返回的列？

提问by David Chouinard

采纳答案by unutbu

回答by Gadi Oron

回答by joelostblom

For pandas >= 0.25

对于熊猫 >= 0.25

For pandas < 0.25

对于熊猫 < 0.25

Update 2017-01-03 in response to @JunkMechanic's comment.

更新 2017-01-03 以响应@JunkMechanic 的评论。

回答by udothemath1984

回答by user3780389

回答by saneryee

相关推荐

Python 查找单词在字符串中的位置

Python matplotlib中具有重叠点的散点图的可视化

类型错误：在 Python3 中写入文件时需要类似字节的对象，而不是“str”

Python Windows 路径斜线

相关推荐

最近更新

标签