Python 熊猫按groupby求和,但排除某些列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32751229/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:10:47  来源:igfitidea点击:

Pandas sum by groupby, but exclude certain columns

pythonpandasgroup-byaggregate

提问by user308827

What is the best way to do a groupby on a Pandas dataframe, but exclude some columns from that groupby? e.g. I have the following dataframe:

在 Pandas 数据帧上进行分组但从该分组中排除某些列的最佳方法是什么?例如,我有以下数据框:

Code   Country      Item_Code   Item    Ele_Code    Unit    Y1961    Y1962   Y1963
2      Afghanistan  15          Wheat   5312        Ha      10       20      30
2      Afghanistan  25          Maize   5312        Ha      10       20      30
4      Angola       15          Wheat   7312        Ha      30       40      50
4      Angola       25          Maize   7312        Ha      30       40      50

I want to groupby the column Country and Item_Code and only compute the sum of the rows falling under the columns Y1961, Y1962 and Y1963. The resulting dataframe should look like this:

我想对 Country 和 Item_Code 列进行分组,并且只计算 Y1961、Y1962 和 Y1963 列下的行的总和。生成的数据框应如下所示:

Code   Country      Item_Code   Item    Ele_Code    Unit    Y1961    Y1962   Y1963
2      Afghanistan  15          C3      5312        Ha      20       40       60
4      Angola       25          C4      7312        Ha      60       80      100

Right now I am doing this:

现在我正在这样做:

df.groupby('Country').sum()

However this adds up the values in the Item_Code column as well. Is there any way I can specify which columns to include in the sum()operation and which ones to exclude?

但是,这也会将 Item_Code 列中的值相加。有什么方法可以指定要在sum()操作中包含哪些列以及要排除哪些列?

采纳答案by Andy Hayden

You can select the columns of a groupby:

您可以选择 groupby 的列:

In [11]: df.groupby(['Country', 'Item_Code'])[["Y1961", "Y1962", "Y1963"]].sum()
Out[11]:
                       Y1961  Y1962  Y1963
Country     Item_Code
Afghanistan 15            10     20     30
            25            10     20     30
Angola      15            30     40     50
            25            30     40     50

Note that the list passed must be a subset of the columns otherwise you'll see a KeyError.

请注意,传递的列表必须是列的子集,否则您将看到 KeyError。

回答by leroyJr

The aggfunction will do this for you. Pass the columns and function as a dict with column, output:

agg函数将为您执行此操作。将列和函数作为带有列的 dict 传递,输出:

df.groupby(['Country', 'Item_Code']).agg({'Y1961': np.sum, 'Y1962': [np.sum, np.mean]})  # Added example for two output columns from a single input column

This will display only the group by columns, and the specified aggregate columns. In this example I included two agg functions applied to 'Y1962'.

这将仅显示 group by 列和指定的聚合列。在这个例子中,我包含了两个应用于“Y1962”的 agg 函数。

To get exactly what you hoped to see, included the other columns in the group by, and apply sums to the Y variables in the frame:

为了准确获得您希望看到的内容,将其他列包含在 group by 中,并将总和应用于框架中的 Y 变量:

df.groupby(['Code', 'Country', 'Item_Code', 'Item', 'Ele_Code', 'Unit']).agg({'Y1961': np.sum, 'Y1962': np.sum, 'Y1963': np.sum})

回答by Superstar

If you are looking for a more generalized way to apply to many columns, what you can do is to build a list of column names and pass it as the index of the grouped dataframe. In your case, for example:

如果您正在寻找一种更通用的方法来应用于许多列,您可以做的是构建一个列名列表并将其作为分组数据框的索引传递。在您的情况下,例如:

columns = ['Y'+str(i) for year in range(1967, 2011)]

df.groupby('Country')[columns].agg('sum')