Python 在 Pandas Groupby 函数中重命名列名
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19523277/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Renaming Column Names in Pandas Groupby function
提问by Baktaawar
Q1) I want to do a groupby, SQL-style aggregation and rename the output column:
Q1) 我想做一个 groupby、SQL 风格的聚合并重命名输出列:
Example dataset:
示例数据集:
>>> df
ID Region count
0 100 Asia 2
1 101 Europe 3
2 102 US 1
3 103 Africa 5
4 100 Russia 5
5 101 Australia 7
6 102 US 8
7 104 Asia 10
8 105 Europe 11
9 110 Africa 23
I want to group the observations of this dataset by ID
and Region
and summing the count
for each group. So I used something like this...
我想通过ID
和对这个数据集的观察进行分组Region
,count
并对每组求和。所以我用了这样的东西......
>>> print(df.groupby(['ID','Region'],as_index=False).count().sum())
ID Region count
0 100 Asia 2
1 100 Russia 5
2 101 Australia 7
3 101 Europe 3
4 102 US 9
5 103 Africa 5
6 104 Asia 10
7 105 Europe 11
8 110 Africa 23
On using as_index=False
I am able to get "SQL-Like" output. My problem is that I am unable to rename the aggregate variable count
here. So in SQL if wanted to do the above thing I would do something like this:
在使用时,as_index=False
我能够获得“类似 SQL”的输出。我的问题是我无法在这里重命名聚合变量count
。所以在 SQL 中,如果想做上面的事情,我会做这样的事情:
select ID, Region, sum(count) as Total_Numbers
from df
group by ID, Region
order by ID, Region
As we see, it's very easy for me to rename the aggregate variable count
to Total_Numbers
in SQL. I wanted to do the same thing in Pandas but unable to find such an option in group-by function. Can somebody help?
正如我们看到的,它很容易让我集合变量命名count
,以Total_Numbers
在SQL。我想在 Pandas 中做同样的事情,但无法在 group-by 功能中找到这样的选项。有人可以帮忙吗?
The second question (more of an observation) is whether...
第二个问题(更多的是观察)是...
Q2) Is it possible to directly use column names in Pandas dataframe functions without enclosing them in quotes?
Q2) 是否可以直接在 Pandas 数据框函数中使用列名而不用引号将它们括起来?
I understand that the variable names are strings, so have to be inside quotes, but I see if use them outside dataframe function and as an attribute we don't require them to be inside quotes. Like df.ID.sum()
etc. It's only when we use it in a DataFrame function like df.sort()
or df.groupby
we have to use it inside quotes. This is actually a bit of pain as in SQL or in SAS or other languages we simply use the variable name without quoting them. Any suggestion on this?
我知道变量名是字符串,所以必须在引号内,但我看看是否在数据框函数之外使用它们,作为一个属性,我们不需要它们在引号内。像df.ID.sum()
等等。只有当我们在 DataFrame 函数中使用它时,df.sort()
或者df.groupby
我们必须在引号内使用它。这实际上有点麻烦,因为在 SQL 或 SAS 或其他语言中,我们只是使用变量名而不引用它们。对此有何建议?
Kindly reply to both questions (Q1 is the main, Q2 more of an opinion).
请回答这两个问题(Q1 是主要的,Q2 是更多的意见)。
采纳答案by Roman Pekar
For the first question I think answer would be:
对于第一个问题,我认为答案是:
<your DataFrame>.rename(columns={'count':'Total_Numbers'})
or
或者
<your DataFrame>.columns = ['ID', 'Region', 'Total_Numbers']
As for second one I'd say the answer would be no. It's possible to use it like 'df.ID' because of python datamodel:
至于第二个,我会说答案是否定的。由于python 数据模型,可以像 'df.ID' 一样使用它:
Attribute references are translated to lookups in this dictionary, e.g., m.x is equivalent to m.dict["x"]
属性引用被翻译成在这个字典中查找,例如,mx 等价于 m。字典["x"]
回答by Ted Petrou
The current (as of version 0.20) method for changing column names after a groupby operation is to chain the rename
method. See this deprecation notein the documentation for more detail.
当前(从 0.20 版开始)在 groupby 操作后更改列名的rename
方法是链接该方法。有关更多详细信息,请参阅文档中的弃用说明。
Deprecated Answer as of pandas version 0.20
从熊猫版本 0.20 起已弃用答案
This is the first result in google and although the top answer works it does not really answer the question. There is a better answer hereand a long discussion on githubabout the full functionality of passing dictionaries to the agg
method.
这是谷歌的第一个结果,尽管最佳答案有效,但并没有真正回答问题。这里有一个更好的答案,在 github 上有一个关于将字典传递给agg
方法的完整功能的长时间讨论。
These answers unfortunately do not exist in the documentation but the general format for grouping, aggregating and then renaming columns uses a dictionary of dictionaries. The keys to the outer dictionary are column names that are to be aggregated. The inner dictionaries have keys that the new column names with values as the aggregating function.
不幸的是,文档中不存在这些答案,但是分组、聚合和重命名列的一般格式使用字典字典。外部字典的键是要聚合的列名。内部字典具有键,新列使用值作为聚合函数命名。
Before we get there, let's create a four column DataFrame.
在我们到达那里之前,让我们创建一个四列 DataFrame。
df = pd.DataFrame({'A' : list('wwwwxxxx'),
'B':list('yyzzyyzz'),
'C':np.random.rand(8),
'D':np.random.rand(8)})
A B C D
0 w y 0.643784 0.828486
1 w y 0.308682 0.994078
2 w z 0.518000 0.725663
3 w z 0.486656 0.259547
4 x y 0.089913 0.238452
5 x y 0.688177 0.753107
6 x z 0.955035 0.462677
7 x z 0.892066 0.368850
Let's say we want to group by columns A, B
and aggregate column C
with mean
and median
and aggregate column D
with max
. The following code would do this.
比方说,我们要通过列组A, B
和汇总列C
与mean
和median
聚合列,并D
用max
。以下代码将执行此操作。
df.groupby(['A', 'B']).agg({'C':['mean', 'median'], 'D':'max'})
D C
max mean median
A B
w y 0.994078 0.476233 0.476233
z 0.725663 0.502328 0.502328
x y 0.753107 0.389045 0.389045
z 0.462677 0.923551 0.923551
This returns a DataFrame with a hierarchical index. The original question asked about renaming the columns in the same step. This is possible using a dictionary of dictionaries:
这将返回一个带有分层索引的 DataFrame。原始问题询问有关在同一步骤中重命名列的问题。使用字典可以做到这一点:
df.groupby(['A', 'B']).agg({'C':{'C_mean': 'mean', 'C_median': 'median'},
'D':{'D_max': 'max'}})
D C
D_max C_mean C_median
A B
w y 0.994078 0.476233 0.476233
z 0.725663 0.502328 0.502328
x y 0.753107 0.389045 0.389045
z 0.462677 0.923551 0.923551
This renames the columns all in one go but still leaves the hierarchical index which the top level can be dropped with df.columns = df.columns.droplevel(0)
.
这会一次性重命名所有列,但仍保留可以删除顶层的分层索引df.columns = df.columns.droplevel(0)
。