pandas 按多列对数据框进行分组并将结果附加到数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27192072/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:43:39  来源:igfitidea点击:

Group dataframe by multiple columns and append the result to the dataframe

pandaspandas-groupby

提问by user3591836

This is similar to Attach a calculated column to an existing dataframe, however, that solution doesn't work when grouping by more than one column in pandas v0.14.

这类似于将计算列附加到现有数据框,但是,在 Pandas v0.14 中按多于一列分组时,该解决方案不起作用。

For example:

例如:

$ df = pd.DataFrame([
    [1, 1, 1],
    [1, 2, 1],
    [1, 2, 2],
    [1, 3, 1],
    [2, 1, 1]],
    columns=['id', 'country', 'source'])

The following calculation works:

以下计算有效:

$ df.groupby(['id','country'])['source'].apply(lambda x: x.unique().tolist())


0       [1]
1    [1, 2]
2    [1, 2]
3       [1]
4       [1]
Name: source, dtype: object

But assigning the output to a new column result in an error:

但是将输出分配给新列会导致错误:

df['source_list'] = df.groupby(['id','country'])['source'].apply(
                               lambda x: x.unique().tolist())

TypeError: incompatible index of inserted column with frame index

类型错误:插入列的索引与帧索引不兼容

回答by Roman Pekar

Merge grouped result with the initial DataFrame:

将分组结果与初始 DataFrame 合并:

>>> df1 = df.groupby(['id','country'])['source'].apply(
             lambda x: x.tolist()).reset_index()

>>> df1
  id  country      source
0  1        1       [1.0]
1  1        2  [1.0, 2.0]
2  1        3       [1.0]
3  2        1       [1.0]

>>> df2 = df[['id', 'country']]
>>> df2
  id  country
1  1        1
2  1        2
3  1        2
4  1        3
5  2        1

>>> pd.merge(df1, df2, on=['id', 'country'])
  id  country      source
0  1        1       [1.0]
1  1        2  [1.0, 2.0]
2  1        2  [1.0, 2.0]
3  1        3       [1.0]
4  2        1       [1.0]

回答by David O'Neill

This can be achieved without the merge by reassigning the result of the groupby.applyto the original dataframe.

这可以在没有合并的情况下通过将 的结果重新分配groupby.apply给原始数据帧来实现。

df = df.groupby(['id', 'country']).apply(lambda group: _add_sourcelist_col(group))

with your _add_sourcelist_colfunction being,

你的_add_sourcelist_col功能是,

def _add_sourcelist_col(group):
    group['source_list'] = list(set(group.tolist()))
    return group

Note that additional columns can also be added in your defined function. Just simply add them to each group dataframe, and be sure to return the group at the end of your function declaration.

请注意,还可以在您定义的函数中添加其他列。只需将它们添加到每个组数据框中,并确保在函数声明的末尾返回组。

Edit: I'll leave the info above as it might still be useful, but I misinterpreted part of the original quesiton. What the OP was trying to accomplish can be done using,

编辑:我会留下上面的信息,因为它可能仍然有用,但我误解了原始问题的一部分。OP试图完成的事情可以使用,

df = df.groupby(['id', 'country']).apply(lambda x: addsource(x))

def addsource(x):
    x['source_list'] = list(set(x.source.tolist()))
    return x

回答by saladi

An alternative method that avoids the post-facto merge is providing the index in the function applied to each group, e.g.

避免事后合并的另一种方法是在应用于每个组的函数中提供索引,例如

def calculate_on_group(x):
    fill_val = x.unique().tolist()
    return pd.Series([fill_val] * x.size, index=x.index)

df['source_list'] = df.groupby(['id','country'])['source'].apply(calculate_on_group)