Python Pandas:Groupby Sum 和连接字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47600818/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas: Groupby Sum AND Concatenate Strings
提问by kevintrankt
Sample Pandas Dataframe:
示例 Pandas 数据框:
ID Name COMMENT1 COMMENT2 NUM 1 dan hi hello 1 1 dan you friend 2 3 jon yeah nope 3 2 jon dog cat .5 3 jon yes no .1
I am trying to create a dataframe that groups by ID and NAME that concatenates COMMENT1 and COMMENT2 that also sums NUM.
我正在尝试创建一个按 ID 和 NAME 分组的数据框,该数据框将 COMMENT1 和 COMMENT2 连接起来,它们的总和也为 NUM。
This is what I'm looking for:
这就是我要找的:
ID Name COMMENT1 COMMENT2 NUM 1 dan hi you hello friend 3 3 jon yeah yes nope no 3.1 2 jon dog cat .5
I tried using this:
我尝试使用这个:
input_df = input_df.groupby(['ID', 'NAME', 'COMMENT1', 'COMMENT2']).sum().reset_index()
But it doesn't work.
但它不起作用。
If I use this:
如果我使用这个:
input_df = input_df.groupby(['ID']).sum().reset_index()
It sums the NUM column but leaves out all other columns.
它对 NUM 列求和,但忽略了所有其他列。
回答by YOBEN_S
Let us make it into one line
让我们把它变成一行
df.groupby(['ID','Name'],as_index=False).agg(lambda x : x.sum() if x.dtype=='float64' else ' '.join(x))
Out[1510]:
ID Name COMMENT1 COMMENT2 NUM
0 1 dan hi you hello friend 3.0
1 2 jon dog cat 0.5
2 3 jon yeah yes nope no 3.1
回答by hamx0r
You can also just tell .agg()
which aggregator functions to use for each column, and for the string columns, pass ' '.join
(notice there're no parenthesis since you don't want to call .join
but rather pass it as the argument itself):
您还可以告诉.agg()
每个列使用哪些聚合器函数,对于字符串列,传递' '.join
(注意没有括号,因为您不想调用.join
而是将其作为参数本身传递):
df.groupby(['ID','Name'],as_index=False).agg({'COMMENT1': ' '.join, 'COMMENT2': ' '.join, 'NUM': 'sum'})
df.groupby(['ID','Name'],as_index=False).agg({'COMMENT1': ' '.join, 'COMMENT2': ' '.join, 'NUM': 'sum'})
回答by Thom Ives
Converting your data example into a csv file, we can do the following:
将您的数据示例转换为 csv 文件,我们可以执行以下操作:
import pandas as pd
def grouping_Cols_by_Cols(DF, grouping_Columns, num_Columns):
# numerical columns can mess us up ...
column_Names = DF.columns.tolist()
# so, convert all columns' values to strings
for column_Name in column_Names:
DF[column_Name] = DF[column_Name].map(str) + ' '
DF = DF.groupby(by=grouping_Columns).sum()
# NOW, convert the numerical string columns to an expression ...
for num_Col in num_Columns:
column_Names = DF.columns.tolist()
num_Col_i = column_Names.index(num_Col)
for i in range(len(DF)):
String = DF[num_Col].iloc[i]
value = eval(String.rstrip(' ').replace(' ','+'))
DF.iat[i,num_Col_i] = value
return DF
###############################################################
### Operations Section
###############################################################
df = pd.read_csv("UnCombinedData.csv")
grouping_Columns = ['ID','Name']
num_Columns = ['NUM']
df = grouping_Cols_by_Cols(df,grouping_Columns, num_Columns)
print df
With a little more work, the defined function could auto detect, which columns have numbers in them and add them to a numerical columns list.
多做一点工作,定义的函数可以自动检测哪些列中有数字并将它们添加到数字列列表中。
I think this is similar, but not exact, to problems and challenges encountered in this post.
我想这是类似的,但不完全一样,在遇到问题和挑战这个职位。