Python Pandas groupby 应用 lambda 参数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47551251/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:50:17  来源:igfitidea点击:

Python Pandas groupby apply lambda arguments

pythonpandaslambdapandas-groupby

提问by g_uint

In a coursera video about Python Pandas groupby (in the Introduction to Data Science in Python course) the following example is given:

在关于 Python Pandas groupby 的课程视频中(在 Python 数据科学入门课程中),给出了以下示例:

df.groupby('Category').apply(lambda df,a,b: sum(df[a] * df[b]), 'Weight (oz.)', 'Quantity')

Where df is a DataFrame, and the lambda is applied to calculate the sum of two columns. If I understand correctly, the groupby object (returned by groupby) that the apply function is called on is a series of tuples consisting of the index that was grouped by and the part of the DataFrame that is that specific grouping.

其中 df 是一个 DataFrame,并且应用 lambda 来计算两列的总和。如果我理解正确,调用 apply 函数的 groupby 对象(由 groupby 返回)是一系列元组,由分组的索引和作为特定分组的 DataFrame 部分组成。

What I don't understand is the way that the lambda is used:

我不明白的是 lambda 的使用方式:

There are three arguments specified (lambda df,a,b), but only two are explicitly passed ('Weight (oz.)' and 'Quantity'). How does the interpreter know that arguments 'a' and 'b' are the ones specified as arguments and df is used 'as-is'?

指定了三个参数 (lambda df,a,b),但只有两个被显式传递('Weight (oz.)' 和 'Quantity')。解释器如何知道参数 'a' 和 'b' 是指定为参数的参数,而 df 是按原样使用的?

I have looked at the docs but could not find a definitive answer for such a specific example. I am thinking this has to do something with df being in scope but cannot find information to support and detail that thought.

我查看了文档,但找不到针对此类特定示例的明确答案。我认为这与 df 在范围内有关,但无法找到支持和详细说明该想法的信息。

采纳答案by RSHAP

The apply method itself passes each "group" of the groupby object as the first argument to the function. So it knows to associate 'Weight' and "Quantity" to aand bbased on position. (eg they are the 2nd and 3rd arguments if you count the first "group" argument.

apply 方法本身将 groupby 对象的每个“组”作为第一个参数传递给函数。所以它知道将“重量”和“数量”ab位置相关联并基于位置。(例如,如果您计算第一个“组”参数,则它们是第二个和第三个参数。

df = pd.DataFrame(np.random.randint(0,11,(10,3)), columns = ['num1','num2','num3'])
df['category'] = ['a','a','a','b','b','b','b','c','c','c']
df = df[['category','num1','num2','num3']]
df

  category  num1  num2  num3
0        a     2     5     2
1        a     5     5     2
2        a     7     3     4
3        b    10     9     1
4        b     4     7     6
5        b     0     5     2
6        b     7     7     5
7        c     2     2     1
8        c     4     3     2
9        c     1     4     6

gb = df.groupby('category')

implicit argument is each "group" or in this case each category

隐式参数是每个“组”或在这种情况下每个类别

gb.apply(lambda grp: grp.sum()) 

The "grp" is the first argument to the lambda function notice I don't have to specify anything for it as it is already, automatically taken to be each group of the groupby object

“grp”是 lambda 函数的第一个参数注意我不必为它指定任何东西,因为它已经自动被视为 groupby 对象的每个组

         category  num1  num2  num3
category                           
a             aaa    14    13     8
b            bbbb    21    28    14
c             ccc     7     9     9

So apply goes through each of these and performs a sum operation

所以 apply 遍历每一个并执行求和运算

print(gb.groups)
{'a': Int64Index([0, 1, 2], dtype='int64'), 'b': Int64Index([3, 4, 5, 6], dtype='int64'), 'c': Int64Index([7, 8, 9], dtype='int64')}

print('1st GROUP:\n', df.loc[gb.groups['a']])
1st GROUP:
  category  num1  num2  num3
0        a     2     5     2
1        a     5     5     2
2        a     7     3     4    


print('SUM of 1st group:\n', df.loc[gb.groups['a']].sum())

SUM of 1st group:
category    aaa
num1         14
num2         13
num3          8
dtype: object

Notice how this is the same as the first row of our previous operation

注意这与我们之前操作的第一行是如何相同的

So apply is implicitlypassing each group to the function argument as the first argument.

所以 apply隐式地将每个组作为第一个参数传递给函数参数。

From the docs

文档

GroupBy.apply(func, *args, **kwargs)

args, kwargs : tuple and dict

Optional positional and keyword arguments to pass to func

GroupBy.apply(func, *args, **kwargs)

args, kwargs : 元组和字典

传递给 func 的可选位置和关键字参数

Additional Args passed in "*args" get passed afterthe implict group argument.

在“*args”中传递的附加参数在隐式组参数之后传递。

so using your code

所以使用你的代码

gb.apply(lambda df,a,b: sum(df[a] * df[b]), 'num1', 'num2')

category
a     56
b    167
c     20
dtype: int64

here 'num1' and 'num2' are being passed as additionalarguments to each call of the lambda function

这里 'num1' 和 'num2' 作为附加参数传递给 lambda 函数的每次调用

So apply goes through each of these and performs your lambda operation

所以 apply 遍历每一个并执行你的 lambda 操作

# copy and paste your lambda function
fun = lambda df,a,b: sum(df[a] * df[b])

print(gb.groups)
{'a': Int64Index([0, 1, 2], dtype='int64'), 'b': Int64Index([3, 4, 5, 6], dtype='int64'), 'c': Int64Index([7, 8, 9], dtype='int64')}

print('1st GROUP:\n', df.loc[gb.groups['a']])

1st GROUP:
   category  num1  num2  num3
0        a     2     5     2
1        a     5     5     2
2        a     7     3     4

print('Output of 1st group for function "fun":\n', 
fun(df.loc[gb.groups['a']], 'num1','num2'))

Output of 1st group for function "fun":
56