如何根据类别将 Pandas 数据框行转换为列?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39635993/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert pandas dataframe rows into columns, based on category?
提问by Nandhini Anand
I have a pandas data frame with a category variable and some number variables. Something like this:
我有一个带有类别变量和一些数字变量的Pandas数据框。像这样的东西:
ls = [{'count':5, 'module':'payroll', 'id':2}, {'count': 53, 'module': 'general','id':2}, {'id': 5,'count': 35, 'module': 'tax'}, ]
df = pd.DataFrame.from_dict(ls)
The df looks like this:
df 看起来像这样:
df
Out[15]:
count id module
0 5 2 payroll
1 53 2 general
2 35 5 tax
I want convert(transpose is the right word?) the module variables into columns and group by the id. So something like:
我想转换(转置是正确的词?)将模块变量转换为列并按 id 分组。所以像:
general_count id payroll_count tax_count
0 53.0 2 5.0 NaN
1 NaN 5 NaN 35.0
One approach to this would be to use apply:
一种方法是使用apply:
df['payroll_count'] = df.id.apply(lambda x: df[df.id==x][df.module=='payroll'])
However, this suffers from multiple drawbacks:
但是,这存在多个缺点:
Costly, and takes too much time
Creates artifacts and empty dataframes that need to be cleaned up.
成本高,需要太多时间
创建需要清理的工件和空数据帧。
I sense there's a better way to achieve this with pandas groupby, but can't find a way to this same operation more efficiently. Please help.
我觉得使用pandas groupby有更好的方法来实现这一点,但找不到更有效地进行相同操作的方法。请帮忙。
回答by jezrael
You can use groupby
by columns which first create new index
and last column
. then need aggreagate some way - I use mean
, then convert one column DataFrame
to Series
by DataFrame.squeeze
(then is not necessary remove top level of Multiindex in columns) and reshape by unstack
. Last add_suffix
to column name:
您可以groupby
按首先创建 newindex
和 last 的列使用column
。然后需要以某种方式聚合 - 我使用mean
,然后将一列转换DataFrame
为Series
by DataFrame.squeeze
(然后没有必要删除列中 Multiindex 的顶级)并通过unstack
. 最后add_suffix
到列名:
df = df.groupby(['id','module']).mean().squeeze().unstack().add_suffix('_count')
print (df)
module general_count payroll_count tax_count
id
2 53.0 5.0 NaN
5 NaN NaN 35.0
Another solution with pivot
, then need remove Multiindex
from columns by list comprehension
:
另一个解决方案pivot
,然后需要Multiindex
从列中删除list comprehension
:
df = df.pivot(index='id', columns='module')
df.columns = ['_'.join((col[1], col[0])) for col in df.columns]
print (df)
general_count payroll_count tax_count
id
2 53.0 5.0 NaN
5 NaN NaN 35.0
回答by Zero
You could use set_index
and unstack
你可以使用set_index
和unstack
In [2]: df.set_index(['id','module'])['count'].unstack().add_suffix('_count').reset_index()
Out[2]:
module id general_count payroll_count tax_count
0 2 53.0 5.0 NaN
1 5 NaN NaN 35.0