pandas 熊猫分组结果为多列

Question

提问by AJG519

I have a dataframe in which I'm looking to group and then partition the values within a group into multiple columns.

我有一个数据框，我希望在其中进行分组，然后将组内的值分成多列。

For example: say I have the following dataframe:

例如：假设我有以下数据框：

>>> import pandas as pd
>>> import numpy as np
>>> df=pd.DataFrame()
>>> df['Group']=['A','C','B','A','C','C']
>>> df['ID']=[1,2,3,4,5,6]
>>> df['Value']=np.random.randint(1,100,6)
>>> df
  Group  ID  Value
0     A   1     66
1     C   2      2
2     B   3     98
3     A   4     90
4     C   5     85
5     C   6     38
>>>

I want to groupby the "Group" field, get the sum of the "Value" field, and get new fields, each of which holds the ID values of the group.

我想对“组”字段进行分组，获取“值”字段的总和，并获取新字段，其中每个字段都包含组的 ID 值。

Currently I am able to do this as follows, but I am looking for a cleaner methodology:

目前我可以按如下方式执行此操作，但我正在寻找一种更清洁的方法：

First, I create a dataframe with a list of the IDs in each group.

首先，我创建了一个数据框，其中包含每个组中的 ID 列表。

>>> g=df.groupby('Group')
>>> result=g.agg({'Value':np.sum, 'ID':lambda x:x.tolist()})
>>> result
              ID  Value
Group                  
A         [1, 4]     98
B            [3]     76
C      [2, 5, 6]    204
>>>

And then I use pd.Series to split those up into columns, rename them, and then join it back.

然后我使用 pd.Series 将它们分成几列，重命名它们，然后将其重新连接起来。

>>> id_df=result.ID.apply(lambda x:pd.Series(x))
>>> id_cols=['ID'+str(x) for x in range(1,len(id_df.columns)+1)]
>>> id_df.columns=id_cols
>>> 
>>> result.join(id_df)[id_cols+['Value']]
       ID1  ID2  ID3  Value
Group                      
A        1    4  NaN     98
B        3  NaN  NaN     76
C        2    5    6    204
>>>

Is there a way to do this without first having to create the list of values?

有没有办法在不必首先创建值列表的情况下做到这一点？

Answer 1

回答by unutbu

You could use

你可以用

id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()

to create id_dfwithout the intermediate resultDataFrame.

在id_df没有中间resultDataFrame 的情况下创建。

import pandas as pd
import numpy as np
np.random.seed(2016)

df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
                   'ID': [1, 2, 3, 4, 5, 6],
                   'Value': np.random.randint(1, 100, 6)})

grouped = df.groupby('Group')
values = grouped['Value'].agg('sum')
id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()
id_df = id_df.rename(columns={i: 'ID{}'.format(i + 1) for i in range(id_df.shape[1])})
result = pd.concat([id_df, values], axis=1)
print(result)

yields

产量

       ID1  ID2  ID3  Value
Group                      
A        1    4  NaN     77
B        3  NaN  NaN     84
C        2    5    6     86

Answer 2

回答by BMW

Using get_dummies and MultiLabelBinarizer (scikit-learn):

使用 get_dummies 和 MultiLabelBinarizer (scikit-learn)：

import pandas as pd
import numpy as np
from sklearn import preprocessing
df = pd.DataFrame()
df['Group']=['A','C','B','A','C','C']
df['ID']=[1,2,3,4,5,6]
df['Value']=np.random.randint(1,100,6)

mlb = preprocessing.MultiLabelBinarizer(classes=classes).fit([])

df2 = pd.get_dummies(df, '', '', columns=['ID']).groupby(by='Group').sum()
df3 = pd.DataFrame(mlb.inverse_transform(df2[df['ID'].unique()].values), index=df2.index)
df3.columns = ['ID' + str(x + 1) for x in range(df3.shape[0])]
pd.concat([df3, df2['Value']], axis=1)


       ID1  ID2  ID3  Value
Group                      
A        1    4  NaN     63
B        3  NaN  NaN     59
C        2    5    6    230

Answer 3

回答by Scott Boston

Another way of doing this is to first added a "helper" column on to your data, then pivot your dataframe using the "helper" column, in the case below "ID_Count":

另一种方法是首先在您的数据上添加一个“helper”列，然后使用“helper”列旋转您的数据框，在“ID_Count”下面的情况下：

Using @unutbu setup:

使用@unutbu 设置：

import pandas as pd
import numpy as np
np.random.seed(2016)

df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
                   'ID': [1, 2, 3, 4, 5, 6],
                   'Value': np.random.randint(1, 100, 6)})
#Create group
grp = df.groupby('Group')

#Create helper column 
df['ID_Count'] = grp['ID'].cumcount() + 1

#Pivot dataframe using helper column and add 'Value' column to pivoted output.
df_out = df.pivot('Group','ID_Count','ID').add_prefix('ID').assign(Value = grp['Value'].sum())

Output:

输出：

ID_Count  ID1  ID2  ID3  Value
Group                         
A         1.0  4.0  NaN     77
B         3.0  NaN  NaN     84
C         2.0  5.0  6.0     86

pandas 熊猫分组结果为多列

提问by AJG519

回答by unutbu

回答by BMW

回答by Scott Boston

相关推荐

最近更新

标签

pandas 熊猫分组结果为多列

提问by AJG519

回答by unutbu

回答by BMW

回答by Scott Boston

相关推荐

使用 Pandas 的每小时日期时间直方图

pandas 带有熊猫和 matplotlib 的条形图顶部的平均线

pandas AttributeError: 'list' 对象在尝试从 dicts 列表创建 DataFrame 时没有属性 'keys'

使用 for 循环重命名 Pandas 数据框列

相关推荐

最近更新

标签