pandas 按多列对数据框进行分组并将结果附加到数据框

Question

提问by user3591836

This is similar to Attach a calculated column to an existing dataframe, however, that solution doesn't work when grouping by more than one column in pandas v0.14.

这类似于将计算列附加到现有数据框，但是，在 Pandas v0.14 中按多于一列分组时，该解决方案不起作用。

For example:

例如：

$ df = pd.DataFrame([
    [1, 1, 1],
    [1, 2, 1],
    [1, 2, 2],
    [1, 3, 1],
    [2, 1, 1]],
    columns=['id', 'country', 'source'])

The following calculation works:

以下计算有效：

$ df.groupby(['id','country'])['source'].apply(lambda x: x.unique().tolist())


0       [1]
1    [1, 2]
2    [1, 2]
3       [1]
4       [1]
Name: source, dtype: object

But assigning the output to a new column result in an error:

但是将输出分配给新列会导致错误：

df['source_list'] = df.groupby(['id','country'])['source'].apply(
                               lambda x: x.unique().tolist())

TypeError: incompatible index of inserted column with frame index

类型错误：插入列的索引与帧索引不兼容

Answer 1

回答by Roman Pekar

Merge grouped result with the initial DataFrame:

将分组结果与初始 DataFrame 合并：

>>> df1 = df.groupby(['id','country'])['source'].apply(
             lambda x: x.tolist()).reset_index()

>>> df1
  id  country      source
0  1        1       [1.0]
1  1        2  [1.0, 2.0]
2  1        3       [1.0]
3  2        1       [1.0]

>>> df2 = df[['id', 'country']]
>>> df2
  id  country
1  1        1
2  1        2
3  1        2
4  1        3
5  2        1

>>> pd.merge(df1, df2, on=['id', 'country'])
  id  country      source
0  1        1       [1.0]
1  1        2  [1.0, 2.0]
2  1        2  [1.0, 2.0]
3  1        3       [1.0]
4  2        1       [1.0]

Answer 2

回答by David O'Neill

This can be achieved without the merge by reassigning the result of the groupby.applyto the original dataframe.

这可以在没有合并的情况下通过将的结果重新分配groupby.apply给原始数据帧来实现。

df = df.groupby(['id', 'country']).apply(lambda group: _add_sourcelist_col(group))

with your _add_sourcelist_colfunction being,

你的_add_sourcelist_col功能是，

def _add_sourcelist_col(group):
    group['source_list'] = list(set(group.tolist()))
    return group

Note that additional columns can also be added in your defined function. Just simply add them to each group dataframe, and be sure to return the group at the end of your function declaration.

请注意，还可以在您定义的函数中添加其他列。只需将它们添加到每个组数据框中，并确保在函数声明的末尾返回组。

Edit: I'll leave the info above as it might still be useful, but I misinterpreted part of the original quesiton. What the OP was trying to accomplish can be done using,

编辑：我会留下上面的信息，因为它可能仍然有用，但我误解了原始问题的一部分。OP试图完成的事情可以使用，

df = df.groupby(['id', 'country']).apply(lambda x: addsource(x))

def addsource(x):
    x['source_list'] = list(set(x.source.tolist()))
    return x

Answer 3

回答by saladi

An alternative method that avoids the post-facto merge is providing the index in the function applied to each group, e.g.

避免事后合并的另一种方法是在应用于每个组的函数中提供索引，例如

def calculate_on_group(x):
    fill_val = x.unique().tolist()
    return pd.Series([fill_val] * x.size, index=x.index)

df['source_list'] = df.groupby(['id','country'])['source'].apply(calculate_on_group)

pandas 按多列对数据框进行分组并将结果附加到数据框

提问by user3591836

回答by Roman Pekar

回答by David O'Neill

回答by saladi

相关推荐

最近更新

标签

pandas 按多列对数据框进行分组并将结果附加到数据框

提问by user3591836

回答by Roman Pekar

回答by David O'Neill

回答by saladi

相关推荐

Pandas：添加交叉表总计

pandas pip安装熊猫错误

Pandas 数据框到列表字典

将 Pandas 数据框传递给类

相关推荐

最近更新

标签