Python 在组对象上应用 vs 变换

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27517425/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:54:40  来源:igfitidea点击:

Apply vs transform on a group object

pythonpandas

提问by Amelio Vazquez-Reina

Consider the following dataframe:

考虑以下数据框:

     A      B         C         D
0  foo    one  0.162003  0.087469
1  bar    one -1.156319 -1.526272
2  foo    two  0.833892 -1.666304
3  bar  three -2.026673 -0.322057
4  foo    two  0.411452 -0.954371
5  bar    two  0.765878 -0.095968
6  foo    one -0.654890  0.678091
7  foo  three -1.789842 -1.130922

The following commands work:

以下命令有效:

> df.groupby('A').apply(lambda x: (x['C'] - x['D']))
> df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())

but none of the following work:

但以下均无效:

> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)

> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
 TypeError: cannot concatenate a non-NDFrame object

Why?The example on the documentationseems to suggest that calling transformon a group allows one to do row-wise operation processing:

为什么?文档中的示例似乎表明调用transform组允许一个人进行逐行操作处理:

# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)

In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?

换句话说,我认为转换本质上是一种特定类型的应用(不聚合的应用)。我哪里错了?

For reference, below is the construction of the original dataframe above:

作为参考,下面是上面原始数据框的构造:

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C' : randn(8), 'D' : randn(8)})

采纳答案by Ted Petrou

Two major differences between applyand transform

apply和之间的两个主要区别transform

There are two major differences between the transformand applygroupby methods.

thetransformapplygroupby 方法之间有两个主要区别。

  • Input:
    • applyimplicitly passes all the columns for each group as a DataFrameto the custom function.
    • while transformpasses each column for each group individually as a Seriesto the custom function.
  • Output:
    • The custom function passed to applycan return a scalar, or a Series or DataFrame (or numpy array or even list).
    • The custom function passed to transformmust return a sequence(a one dimensional Series, array or list) the same length as the group.
  • 输入:
    • apply将每个组的所有列作为DataFrame隐式传递给自定义函数。
    • transform将每个组的每一列作为一个系列单独传递给自定义函数。
  • 输出:
    • 传递给的自定义函数apply可以返回标量、系列或数据帧(或 numpy 数组甚至列表)
    • 传递给的自定义函数transform必须返回与 group 长度相同的序列(一维系列、数组或列表)。

So, transformworks on just one Series at a time and applyworks on the entire DataFrame at once.

因此,transform一次只处理一个系列,同时apply处理整个 DataFrame。

Inspecting the custom function

检查自定义函数

It can help quite a bit to inspect the input to your custom function passed to applyor transform.

检查传递给apply或 的自定义函数的输入会很有帮助transform

Examples

例子

Let's create some sample data and inspect the groups so that you can see what I am talking about:

让我们创建一些示例数据并检查组,以便您了解我在说什么:

import pandas as pd
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'], 
                   'a':[4,5,1,3], 'b':[6,10,3,11]})

     State  a   b
0    Texas  4   6
1    Texas  5  10
2  Florida  1   3
3  Florida  3  11

Let's create a simple custom function that prints out the type of the implicitly passed object and then raised an error so that execution can be stopped.

让我们创建一个简单的自定义函数,该函数打印出隐式传递对象的类型,然后引发错误以便停止执行。

def inspect(x):
    print(type(x))
    raise

Now let's pass this function to both the groupby applyand transformmethods to see what object is passed to it:

现在让我们将此函数传递给 groupbyapplytransform方法,以查看传递给它的对象:

df.groupby('State').apply(inspect)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
RuntimeError

As you can see, a DataFrame is passed into the inspectfunction. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn't worry about.

如您所见,一个 DataFrame 被传递到inspect函数中。您可能想知道为什么 DataFrame 类型会被打印两次。Pandas 两次运行第一组。它这样做是为了确定是否有一种快速的方法来完成计算。这是一个小细节,您不必担心。

Now, let's do the same thing with transform

现在,让我们做同样的事情 transform

df.groupby('State').transform(inspect)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
RuntimeError

It is passed a Series - a totally different Pandas object.

它传递了一个 Series - 一个完全不同的 Pandas 对象。

So, transformis only allowed to work with a single Series at a time. It is notimpossible for it to act on two columns at the same time. So, if we try and subtract column afrom binside of our custom function we would get an error with transform. See below:

因此,一次transform只能使用一个系列。它同时作用于两列也不是不可能。因此,如果我们尝试ab自定义函数内部减去列,我们将收到transform. 见下文:

def subtract_two(x):
    return x['a'] - x['b']

df.groupby('State').transform(subtract_two)
KeyError: ('a', 'occurred at index a')

We get a KeyError as pandas is attempting to find the Series index awhich does not exist. You can complete this operation with applyas it has the entire DataFrame:

我们得到一个 KeyError 因为熊猫试图找到a不存在的系列索引。您可以使用apply整个 DataFrame完成此操作:

df.groupby('State').apply(subtract_two)

State     
Florida  2   -2
         3   -8
Texas    0   -2
         1   -5
dtype: int64

The output is a Series and a little confusing as the original index is kept, but we have access to all columns.

输出是一个系列,因为保留了原始索引,所以有点混乱,但我们可以访问所有列。



Displaying the passed pandas object

显示传递的熊猫对象

It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use printstatements by I like to use the displayfunction from the IPython.displaymodule so that the DataFrames get nicely outputted in HTML in a jupyter notebook:

在自定义函数中显示整个 Pandas 对象更有帮助,因此您可以准确地看到您正在操作的内容。您可以使用printI like 的语句来使用模块中的display函数,IPython.display以便数据帧在 jupyter 笔记本中以 HTML 格式很好地输出:

from IPython.display import display
def subtract_two(x):
    display(x)
    return x['a'] - x['b']

Screenshot: enter image description here

截屏: 在此处输入图片说明



Transform must return a single dimensional sequence the same size as the group

变换必须返回与组大小相同的单维序列

The other difference is that transformmust return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transformmust return a sequence of two rows. If it does not then an error is raised:

另一个区别是transform必须返回与组大小相同的单维序列。在此特定实例中,每个组有两行,因此transform必须返回两行的序列。如果没有,则会引发错误:

def return_three(x):
    return np.array([1, 2, 3])

df.groupby('State').transform(return_three)
ValueError: transform must return a scalar value for each group

The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:

错误消息并没有真正描述问题。您必须返回与组长度相同的序列。所以,这样的函数会起作用:

def rand_group_len(x):
    return np.random.rand(len(x))

df.groupby('State').transform(rand_group_len)

          a         b
0  0.962070  0.151440
1  0.440956  0.782176
2  0.642218  0.483257
3  0.056047  0.238208


Returning a single scalar object also works for transform

返回单个标量对象也适用于 transform

If you return just a single scalar from your custom function, then transformwill use it for each of the rows in the group:

如果您从自定义函数中只返回一个标量,那么transform将对组中的每一行使用它:

def group_sum(x):
    return x.sum()

df.groupby('State').transform(group_sum)

   a   b
0  9  16
1  9  16
2  4  14
3  4  14

回答by Primer

As I felt similarly confused with .transformoperation vs. .applyI found a few answers shedding some light on the issue. This answerfor example was very helpful.

由于我对.transform操作与操作同样感到困惑。.apply我发现一些答案对这个问题有所了解。例如,这个答案非常有帮助。

My takeout so far is that .transformwill work (or deal) with Series(columns) in isolation from each other. What this means is that in your last two calls:

到目前为止,我的结论是.transform将(或处理)Series(列)彼此隔离。这意味着在您的最后两次调用中:

df.groupby('A').transform(lambda x: (x['C'] - x['D']))
df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())

You asked .transformto take values from two columns and 'it' actually does not 'see' both of them at the same time (so to speak). transformwill look at the dataframe columns one by one and return back a series (or group of series) 'made' of scalars which are repeated len(input_column)times.

您要求.transform从两列中获取值,而“它”实际上不会同时“看到”它们(可以这么说)。transform将一个接一个地查看数据框列,并返回一个由重复len(input_column)次数“制成”的标量的系列(或系列组)。

So this scalar, that should be used by .transformto make the Seriesis a result of some reduction function applied on an input Series(and only on ONE series/column at a time).

所以这个标量,应该被用来.transform使Series是应用在输入上的一些归约函数的结果Series(并且一次只在一个系列/列上)。

Consider this example (on your dataframe):

考虑这个例子(在你的数据帧上):

zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)

will yield:

将产生:

       C      D
0  0.989  0.128
1 -0.478  0.489
2  0.889 -0.589
3 -0.671 -1.150
4  0.034 -0.285
5  1.149  0.662
6 -1.404 -0.907
7 -0.509  1.653

Which is exactly the same as if you would use it on only on one column at a time:

这与您一次仅在一列上使用它完全相同:

df.groupby('A')['C'].transform(zscore)

yielding:

产生:

0    0.989
1   -0.478
2    0.889
3   -0.671
4    0.034
5    1.149
6   -1.404
7   -0.509

Note that .applyin the last example (df.groupby('A')['C'].apply(zscore)) would work in exactly the same way, but it would fail if you tried using it on a dataframe:

请注意,.apply在最后一个示例中 ( df.groupby('A')['C'].apply(zscore)) 将以完全相同的方式工作,但如果您尝试在数据帧上使用它,则会失败:

df.groupby('A').apply(zscore)

gives error:

给出错误:

ValueError: operands could not be broadcast together with shapes (6,) (2,)

So where else is .transformuseful? The simplest case is trying to assign results of reduction function back to original dataframe.

那么还有哪些地方.transform有用呢?最简单的情况是尝试将归约函数的结果分配回原始数据帧。

df['sum_C'] = df.groupby('A')['C'].transform(sum)
df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group

yielding:

产生:

     A      B      C      D  sum_C
1  bar    one  1.998  0.593  3.973
3  bar  three  1.287 -0.639  3.973
5  bar    two  0.687 -1.027  3.973
4  foo    two  0.205  1.274  4.373
2  foo    two  0.128  0.924  4.373
6  foo    one  2.113 -0.516  4.373
7  foo  three  0.657 -1.179  4.373
0  foo    one  1.270  0.201  4.373

Trying the same with .applywould give NaNsin sum_C. Because .applywould return a reduced Series, which it does not know how to broadcast back:

尝试用同样.apply会给NaNssum_C。因为.apply会返回一个 reduce Series,它不知道如何广播回来:

df.groupby('A')['C'].apply(sum)

giving:

给予:

A
bar    3.973
foo    4.373

There are also cases when .transformis used to filter the data:

也有.transform用于过滤数据的情况:

df[df.groupby(['B'])['D'].transform(sum) < -1]

     A      B      C      D
3  bar  three  1.287 -0.639
7  foo  three  0.657 -1.179

I hope this adds a bit more clarity.

我希望这会增加一些清晰度。

回答by Cheng

I am going to use a very simple snippet to illustrate the difference:

我将使用一个非常简单的片段来说明差异:

test = pd.DataFrame({'id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2]})
grouping = test.groupby('id')['price']

The DataFrame looks like this:

数据框看起来像这样:

    id  price   
0   1   1   
1   2   2   
2   3   3   
3   1   2   
4   2   3   
5   3   1   
6   1   3   
7   2   1   
8   3   2   

There are 3 customer IDs in this table, each customer made three transactions and paid 1,2,3 dollars each time.

该表中有3个客户ID,每个客户进行了3次交易,每次支付1、2、3美元。

Now, I want to find the minimum payment made by each customer. There are two ways of doing it:

现在,我想找到每个客户的最低付款额。有两种方法可以做到:

  1. Using apply:

    grouping.min()

  1. 使用apply

    grouping.min()

The return looks like this:

返回看起来是这样的:

id
1    1
2    1
3    1
Name: price, dtype: int64

pandas.core.series.Series # return type
Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
# lenght is 3
  1. Using transform:

    grouping.transform(min)

  1. 使用transform

    grouping.transform(min)

The return looks like this:

返回看起来是这样的:

0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
Name: price, dtype: int64

pandas.core.series.Series # return type
RangeIndex(start=0, stop=9, step=1) # The returned Series' index
# length is 9    

Both methods return a Seriesobject, but the lengthof the first one is 3 and the lengthof the second one is 9.

两种方法都返回一个Series对象,但length第一个的值为 3,length第二个的值为 9。

If you want to answer What is the minimum price paid by each customer, then the applymethod is the more suitable one to choose.

如果你想回答What is the minimum price paid by each customer,那么apply方法是更合适的选择。

If you want to answer What is the difference between the amount paid for each transaction vs the minimum payment, then you want to use transform, because:

如果您想回答What is the difference between the amount paid for each transaction vs the minimum payment,那么您想使用transform,因为:

test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
test.price - test.minimum # returns the difference for each row

Applydoes not work here simply because it returns a Series of size 3, but the original df's length is 9. You cannot integrate it back to the original df easily.

Apply在这里不起作用,因为它返回大小为 3 的系列,但原始 df 的长度为 9。您无法轻松地将其集成回原始 df。

回答by shui

tmp = df.groupby(['A'])['c'].transform('mean')

is like

就好像

tmp1 = df.groupby(['A']).agg({'c':'mean'})
tmp = df['A'].map(tmp1['c'])

or

或者

tmp1 = df.groupby(['A'])['c'].mean()
tmp = df['A'].map(tmp1)