pandas 熊猫分组结果为多列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35024023/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:34:15  来源:igfitidea点击:

Pandas groupby result into multiple columns

pythonpandas

提问by AJG519

I have a dataframe in which I'm looking to group and then partition the values within a group into multiple columns.

我有一个数据框,我希望在其中进行分组,然后将组内的值分成多列。

For example: say I have the following dataframe:

例如:假设我有以下数据框:

>>> import pandas as pd
>>> import numpy as np
>>> df=pd.DataFrame()
>>> df['Group']=['A','C','B','A','C','C']
>>> df['ID']=[1,2,3,4,5,6]
>>> df['Value']=np.random.randint(1,100,6)
>>> df
  Group  ID  Value
0     A   1     66
1     C   2      2
2     B   3     98
3     A   4     90
4     C   5     85
5     C   6     38
>>> 

I want to groupby the "Group" field, get the sum of the "Value" field, and get new fields, each of which holds the ID values of the group.

我想对“组”字段进行分组,获取“值”字段的总和,并获取新字段,其中每个字段都包含组的 ID 值。

Currently I am able to do this as follows, but I am looking for a cleaner methodology:

目前我可以按如下方式执行此操作,但我正在寻找一种更清洁的方法:

First, I create a dataframe with a list of the IDs in each group.

首先,我创建了一个数据框,其中包含每个组中的 ID 列表。

>>> g=df.groupby('Group')
>>> result=g.agg({'Value':np.sum, 'ID':lambda x:x.tolist()})
>>> result
              ID  Value
Group                  
A         [1, 4]     98
B            [3]     76
C      [2, 5, 6]    204
>>> 

And then I use pd.Series to split those up into columns, rename them, and then join it back.

然后我使用 pd.Series 将它们分成几列,重命名它们,然后将其重新连接起来。

>>> id_df=result.ID.apply(lambda x:pd.Series(x))
>>> id_cols=['ID'+str(x) for x in range(1,len(id_df.columns)+1)]
>>> id_df.columns=id_cols
>>> 
>>> result.join(id_df)[id_cols+['Value']]
       ID1  ID2  ID3  Value
Group                      
A        1    4  NaN     98
B        3  NaN  NaN     76
C        2    5    6    204
>>> 

Is there a way to do this without first having to create the list of values?

有没有办法在不必首先创建值列表的情况下做到这一点?

回答by unutbu

You could use

你可以用

id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()

to create id_dfwithout the intermediate resultDataFrame.

id_df没有中间resultDataFrame 的情况下创建。



import pandas as pd
import numpy as np
np.random.seed(2016)

df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
                   'ID': [1, 2, 3, 4, 5, 6],
                   'Value': np.random.randint(1, 100, 6)})

grouped = df.groupby('Group')
values = grouped['Value'].agg('sum')
id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()
id_df = id_df.rename(columns={i: 'ID{}'.format(i + 1) for i in range(id_df.shape[1])})
result = pd.concat([id_df, values], axis=1)
print(result)

yields

产量

       ID1  ID2  ID3  Value
Group                      
A        1    4  NaN     77
B        3  NaN  NaN     84
C        2    5    6     86

回答by BMW

Using get_dummies and MultiLabelBinarizer (scikit-learn):

使用 get_dummies 和 MultiLabelBinarizer (scikit-learn):

import pandas as pd
import numpy as np
from sklearn import preprocessing
df = pd.DataFrame()
df['Group']=['A','C','B','A','C','C']
df['ID']=[1,2,3,4,5,6]
df['Value']=np.random.randint(1,100,6)

mlb = preprocessing.MultiLabelBinarizer(classes=classes).fit([])

df2 = pd.get_dummies(df, '', '', columns=['ID']).groupby(by='Group').sum()
df3 = pd.DataFrame(mlb.inverse_transform(df2[df['ID'].unique()].values), index=df2.index)
df3.columns = ['ID' + str(x + 1) for x in range(df3.shape[0])]
pd.concat([df3, df2['Value']], axis=1)


       ID1  ID2  ID3  Value
Group                      
A        1    4  NaN     63
B        3  NaN  NaN     59
C        2    5    6    230

回答by Scott Boston

Another way of doing this is to first added a "helper" column on to your data, then pivot your dataframe using the "helper" column, in the case below "ID_Count":

另一种方法是首先在您的数据上添加一个“helper”列,然后使用“helper”列旋转您的数据框,在“ID_Count”下面的情况下:

Using @unutbu setup:

使用@unutbu 设置:

import pandas as pd
import numpy as np
np.random.seed(2016)

df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
                   'ID': [1, 2, 3, 4, 5, 6],
                   'Value': np.random.randint(1, 100, 6)})
#Create group
grp = df.groupby('Group')

#Create helper column 
df['ID_Count'] = grp['ID'].cumcount() + 1

#Pivot dataframe using helper column and add 'Value' column to pivoted output.
df_out = df.pivot('Group','ID_Count','ID').add_prefix('ID').assign(Value = grp['Value'].sum())

Output:

输出:

ID_Count  ID1  ID2  ID3  Value
Group                         
A         1.0  4.0  NaN     77
B         3.0  NaN  NaN     84
C         2.0  5.0  6.0     86