Python 在 Pandas Groupby 函数中重命名列名

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19523277/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:57:43  来源:igfitidea点击:

Renaming Column Names in Pandas Groupby function

pythonpandasgroup-bypandas-groupbyrename

提问by Baktaawar

Q1) I want to do a groupby, SQL-style aggregation and rename the output column:

Q1) 我想做一个 groupby、SQL 风格的聚合并重命名输出列

Example dataset:

示例数据集:

>>> df
    ID     Region  count
0  100       Asia      2
1  101     Europe      3
2  102         US      1
3  103     Africa      5
4  100     Russia      5
5  101  Australia      7
6  102         US      8
7  104       Asia     10
8  105     Europe     11
9  110     Africa     23

I want to group the observations of this dataset by IDand Regionand summing the countfor each group. So I used something like this...

我想通过ID和对这个数据集的观察进行分组Regioncount并对每组求和。所以我用了这样的东西......

>>> print(df.groupby(['ID','Region'],as_index=False).count().sum())

    ID     Region  count
0  100       Asia      2
1  100     Russia      5
2  101  Australia      7
3  101     Europe      3
4  102         US      9
5  103     Africa      5
6  104       Asia     10
7  105     Europe     11
8  110     Africa     23

On using as_index=FalseI am able to get "SQL-Like" output. My problem is that I am unable to rename the aggregate variable counthere. So in SQL if wanted to do the above thing I would do something like this:

在使用时,as_index=False我能够获得“类似 SQL”的输出。我的问题是我无法在这里重命名聚合变量count。所以在 SQL 中,如果想做上面的事情,我会做这样的事情:

select ID, Region, sum(count) as Total_Numbers
from df
group by ID, Region
order by ID, Region

As we see, it's very easy for me to rename the aggregate variable countto Total_Numbersin SQL. I wanted to do the same thing in Pandas but unable to find such an option in group-by function. Can somebody help?

正如我们看到的,它很容易让我集合变量命名count,以Total_Numbers在SQL。我想在 Pandas 中做同样的事情,但无法在 group-by 功能中找到这样的选项。有人可以帮忙吗?

The second question (more of an observation) is whether...

第二个问题(更多的是观察)是...

Q2) Is it possible to directly use column names in Pandas dataframe functions without enclosing them in quotes?

Q2) 是否可以直接在 Pandas 数据框函数中使用列名而不用引号将它们括起来?

I understand that the variable names are strings, so have to be inside quotes, but I see if use them outside dataframe function and as an attribute we don't require them to be inside quotes. Like df.ID.sum()etc. It's only when we use it in a DataFrame function like df.sort()or df.groupbywe have to use it inside quotes. This is actually a bit of pain as in SQL or in SAS or other languages we simply use the variable name without quoting them. Any suggestion on this?

我知道变量名是字符串,所以必须在引号内,但我看看是否在数据框函数之外使用它们,作为一个属性,我们不需要它们在引号内。像df.ID.sum()等等。只有当我们在 DataFrame 函数中使用它时,df.sort()或者df.groupby我们必须在引号内使用它。这实际上有点麻烦,因为在 SQL 或 SAS 或其他语言中,我们只是使用变量名而不引用它们。对此有何建议?

Kindly reply to both questions (Q1 is the main, Q2 more of an opinion).

请回答这两个问题(Q1 是主要的,Q2 是更多的意见)。

采纳答案by Roman Pekar

For the first question I think answer would be:

对于第一个问题,我认为答案是:

<your DataFrame>.rename(columns={'count':'Total_Numbers'})

or

或者

<your DataFrame>.columns = ['ID', 'Region', 'Total_Numbers']

As for second one I'd say the answer would be no. It's possible to use it like 'df.ID' because of python datamodel:

至于第二个,我会说答案是否定的。由于python 数据模型,可以像 'df.ID' 一样使用它:

Attribute references are translated to lookups in this dictionary, e.g., m.x is equivalent to m.dict["x"]

属性引用被翻译成在这个字典中查找,例如,mx 等价于 m。字典["x"]

回答by Ted Petrou

The current (as of version 0.20) method for changing column names after a groupby operation is to chain the renamemethod. See this deprecation notein the documentation for more detail.

当前(从 0.20 版开始)在 groupby 操作后更改列名的rename方法是链接该方法。有关更多详细信息,请参阅文档中的弃用说明

Deprecated Answer as of pandas version 0.20

从熊猫版本 0.20 起已弃用答案

This is the first result in google and although the top answer works it does not really answer the question. There is a better answer hereand a long discussion on githubabout the full functionality of passing dictionaries to the aggmethod.

这是谷歌的第一个结果,尽管最佳答案有效,但并没有真正回答问题。这里有一个更好的答案在 github 上有一个关于将字典传递给agg方法的完整功能的长时间讨论

These answers unfortunately do not exist in the documentation but the general format for grouping, aggregating and then renaming columns uses a dictionary of dictionaries. The keys to the outer dictionary are column names that are to be aggregated. The inner dictionaries have keys that the new column names with values as the aggregating function.

不幸的是,文档中不存在这些答案,但是分组、聚合和重命名列的一般格式使用字典字典。外部字典的键是要聚合的列名。内部字典具有键,新列使用值作为聚合函数命名。

Before we get there, let's create a four column DataFrame.

在我们到达那里之前,让我们创建一个四列 DataFrame。

df = pd.DataFrame({'A' : list('wwwwxxxx'), 
                   'B':list('yyzzyyzz'), 
                   'C':np.random.rand(8), 
                   'D':np.random.rand(8)})

   A  B         C         D
0  w  y  0.643784  0.828486
1  w  y  0.308682  0.994078
2  w  z  0.518000  0.725663
3  w  z  0.486656  0.259547
4  x  y  0.089913  0.238452
5  x  y  0.688177  0.753107
6  x  z  0.955035  0.462677
7  x  z  0.892066  0.368850

Let's say we want to group by columns A, Band aggregate column Cwith meanand medianand aggregate column Dwith max. The following code would do this.

比方说,我们要通过列组A, B和汇总列Cmeanmedian聚合列,并Dmax。以下代码将执行此操作。

df.groupby(['A', 'B']).agg({'C':['mean', 'median'], 'D':'max'})

            D         C          
          max      mean    median
A B                              
w y  0.994078  0.476233  0.476233
  z  0.725663  0.502328  0.502328
x y  0.753107  0.389045  0.389045
  z  0.462677  0.923551  0.923551

This returns a DataFrame with a hierarchical index. The original question asked about renaming the columns in the same step. This is possible using a dictionary of dictionaries:

这将返回一个带有分层索引的 DataFrame。原始问题询问有关在同一步骤中重命名列的问题。使用字典可以做到这一点:

df.groupby(['A', 'B']).agg({'C':{'C_mean': 'mean', 'C_median': 'median'}, 
                            'D':{'D_max': 'max'}})

            D         C          
        D_max    C_mean  C_median
A B                              
w y  0.994078  0.476233  0.476233
  z  0.725663  0.502328  0.502328
x y  0.753107  0.389045  0.389045
  z  0.462677  0.923551  0.923551

This renames the columns all in one go but still leaves the hierarchical index which the top level can be dropped with df.columns = df.columns.droplevel(0).

这会一次性重命名所有列,但仍保留可以删除顶层的分层索引df.columns = df.columns.droplevel(0)