Python 熊猫按groupby求和,但排除某些列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32751229/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas sum by groupby, but exclude certain columns
提问by user308827
What is the best way to do a groupby on a Pandas dataframe, but exclude some columns from that groupby? e.g. I have the following dataframe:
在 Pandas 数据帧上进行分组但从该分组中排除某些列的最佳方法是什么?例如,我有以下数据框:
Code Country Item_Code Item Ele_Code Unit Y1961 Y1962 Y1963
2 Afghanistan 15 Wheat 5312 Ha 10 20 30
2 Afghanistan 25 Maize 5312 Ha 10 20 30
4 Angola 15 Wheat 7312 Ha 30 40 50
4 Angola 25 Maize 7312 Ha 30 40 50
I want to groupby the column Country and Item_Code and only compute the sum of the rows falling under the columns Y1961, Y1962 and Y1963. The resulting dataframe should look like this:
我想对 Country 和 Item_Code 列进行分组,并且只计算 Y1961、Y1962 和 Y1963 列下的行的总和。生成的数据框应如下所示:
Code Country Item_Code Item Ele_Code Unit Y1961 Y1962 Y1963
2 Afghanistan 15 C3 5312 Ha 20 40 60
4 Angola 25 C4 7312 Ha 60 80 100
Right now I am doing this:
现在我正在这样做:
df.groupby('Country').sum()
However this adds up the values in the Item_Code column as well. Is there any way I can specify which columns to include in the sum()
operation and which ones to exclude?
但是,这也会将 Item_Code 列中的值相加。有什么方法可以指定要在sum()
操作中包含哪些列以及要排除哪些列?
采纳答案by Andy Hayden
You can select the columns of a groupby:
您可以选择 groupby 的列:
In [11]: df.groupby(['Country', 'Item_Code'])[["Y1961", "Y1962", "Y1963"]].sum()
Out[11]:
Y1961 Y1962 Y1963
Country Item_Code
Afghanistan 15 10 20 30
25 10 20 30
Angola 15 30 40 50
25 30 40 50
Note that the list passed must be a subset of the columns otherwise you'll see a KeyError.
请注意,传递的列表必须是列的子集,否则您将看到 KeyError。
回答by leroyJr
The agg
function will do this for you. Pass the columns and function as a dict with column, output:
该agg
函数将为您执行此操作。将列和函数作为带有列的 dict 传递,输出:
df.groupby(['Country', 'Item_Code']).agg({'Y1961': np.sum, 'Y1962': [np.sum, np.mean]}) # Added example for two output columns from a single input column
This will display only the group by columns, and the specified aggregate columns. In this example I included two agg functions applied to 'Y1962'.
这将仅显示 group by 列和指定的聚合列。在这个例子中,我包含了两个应用于“Y1962”的 agg 函数。
To get exactly what you hoped to see, included the other columns in the group by, and apply sums to the Y variables in the frame:
为了准确获得您希望看到的内容,将其他列包含在 group by 中,并将总和应用于框架中的 Y 变量:
df.groupby(['Code', 'Country', 'Item_Code', 'Item', 'Ele_Code', 'Unit']).agg({'Y1961': np.sum, 'Y1962': np.sum, 'Y1963': np.sum})
回答by Superstar
If you are looking for a more generalized way to apply to many columns, what you can do is to build a list of column names and pass it as the index of the grouped dataframe. In your case, for example:
如果您正在寻找一种更通用的方法来应用于许多列,您可以做的是构建一个列名列表并将其作为分组数据框的索引传递。在您的情况下,例如:
columns = ['Y'+str(i) for year in range(1967, 2011)]
df.groupby('Country')[columns].agg('sum')