Python Pandas:Groupby Sum 和连接字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47600818/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:51:04  来源:igfitidea点击:

Python Pandas: Groupby Sum AND Concatenate Strings

pythonpandas

提问by kevintrankt

Sample Pandas Dataframe:

示例 Pandas 数据框:

ID Name COMMENT1 COMMENT2 NUM
1  dan  hi       hello    1
1  dan  you      friend   2
3  jon  yeah     nope     3
2  jon  dog      cat      .5
3  jon  yes      no       .1

I am trying to create a dataframe that groups by ID and NAME that concatenates COMMENT1 and COMMENT2 that also sums NUM.

我正在尝试创建一个按 ID 和 NAME 分组的数据框,该数据框将 COMMENT1 和 COMMENT2 连接起来,它们的总和也为 NUM。

This is what I'm looking for:

这就是我要找的:

ID Name COMMENT1     COMMENT2        NUM
1  dan  hi you       hello friend    3
3  jon  yeah yes     nope no         3.1
2  jon  dog          cat             .5

I tried using this:

我尝试使用这个:

input_df = input_df.groupby(['ID', 'NAME', 'COMMENT1', 'COMMENT2']).sum().reset_index()

But it doesn't work.

但它不起作用。

If I use this:

如果我使用这个:

input_df = input_df.groupby(['ID']).sum().reset_index()

It sums the NUM column but leaves out all other columns.

它对 NUM 列求和,但忽略了所有其他列。

回答by YOBEN_S

Let us make it into one line

让我们把它变成一行

df.groupby(['ID','Name'],as_index=False).agg(lambda x : x.sum() if x.dtype=='float64' else ' '.join(x))
Out[1510]: 
   ID Name  COMMENT1      COMMENT2  NUM
0   1  dan    hi you  hello friend  3.0
1   2  jon       dog           cat  0.5
2   3  jon  yeah yes       nope no  3.1

回答by hamx0r

You can also just tell .agg()which aggregator functions to use for each column, and for the string columns, pass ' '.join(notice there're no parenthesis since you don't want to call .joinbut rather pass it as the argument itself):

您还可以告诉.agg()每个列使用哪些聚合器函数,对于字符串列,传递' '.join(注意没有括号,因为您不想调用.join而是将其作为参数本身传递):

df.groupby(['ID','Name'],as_index=False).agg({'COMMENT1': ' '.join, 'COMMENT2': ' '.join, 'NUM': 'sum'})

df.groupby(['ID','Name'],as_index=False).agg({'COMMENT1': ' '.join, 'COMMENT2': ' '.join, 'NUM': 'sum'})

回答by Thom Ives

Converting your data example into a csv file, we can do the following:

将您的数据示例转换为 csv 文件,我们可以执行以下操作:

import pandas as pd

def grouping_Cols_by_Cols(DF, grouping_Columns, num_Columns):
    # numerical columns can mess us up ...
    column_Names = DF.columns.tolist()
    # so, convert all columns' values to strings
    for column_Name in column_Names:
        DF[column_Name] = DF[column_Name].map(str) + ' '
    DF = DF.groupby(by=grouping_Columns).sum()

    # NOW, convert the numerical string columns to an expression ...
    for num_Col in num_Columns:
        column_Names = DF.columns.tolist()
        num_Col_i = column_Names.index(num_Col)
        for i in range(len(DF)):
            String = DF[num_Col].iloc[i] 
            value = eval(String.rstrip(' ').replace(' ','+'))
            DF.iat[i,num_Col_i] = value

    return DF

###############################################################
### Operations Section
###############################################################

df = pd.read_csv("UnCombinedData.csv")

grouping_Columns = ['ID','Name']
num_Columns = ['NUM']
df = grouping_Cols_by_Cols(df,grouping_Columns, num_Columns)

print df

With a little more work, the defined function could auto detect, which columns have numbers in them and add them to a numerical columns list.

多做一点工作,定义的函数可以自动检测哪些列中有数字并将它们添加到数字列列表中。

I think this is similar, but not exact, to problems and challenges encountered in this post.

我想这是类似的,但不完全一样,在遇到问题和挑战这个职位