Pandas GroupBy.apply 方法复制第一组

Question

提问by NC maize breeding Jim

My first SO question: I am confused about this behavior of apply method of groupby in pandas (0.12.0-4), it appears to apply the function TWICE to the first row of a data frame. For example:

我的第一个 SO 问题：我对 Pandas (0.12.0-4) 中 groupby 的 apply 方法的这种行为感到困惑，它似乎将函数 TWICE 应用于数据框的第一行。例如：

>>> from pandas import Series, DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
>>> print(df)
   class  count  
0     A      1  
1     B      0    
2     C      2

I first check that the groupby function works ok, and it seems to be fine:

我首先检查 groupby 函数是否正常工作，似乎没问题：

>>> for group in df.groupby('class', group_keys = True):
>>>     print(group)
('A',   class  count
0     A      1)
('B',   class  count
1     B      0)
('C',   class  count
2     C      2)

Then I try to do something similar using apply on the groupby object and I get the first row output twice:

然后我尝试在 groupby 对象上使用 apply 做类似的事情，我得到第一行输出两次：

>>> def checkit(group):
>>>     print(group)
>>> df.groupby('class', group_keys = True).apply(checkit)
  class  count
0     A      1
  class  count
0     A      1
  class  count
1     B      0
  class  count
2     C      2

Any help would be appreciated! Thanks.

任何帮助，将不胜感激！谢谢。

Edit: @Jeff provides the answer below. I am dense and did not understand it immediately, so here is a simple example to show that despite the double printout of the first group in the example above, the apply method operates only once on the first group and does not mutate the original data frame:

编辑：@Jeff 在下面提供了答案。一头雾水，一时没看懂，所以这里举个简单的例子来说明，尽管上面例子中第一组打印了两次，apply方法对第一组只操作了一次，不会对原始数据框进行变异：

>>> def addone(group):
>>>     group['count'] += 1
>>>     return group

>>> df.groupby('class', group_keys = True).apply(addone)
>>> print(df)

      class  count
0     A      1
1     B      0
2     C      2

But by assigning the return of the method to a new object, we see that it works as expected:

但是通过将方法的返回值分配给一个新对象，我们看到它按预期工作：

df2 = df.groupby('class', group_keys = True).apply(addone) print(df2)

df2 = df.groupby('class', group_keys = True).apply(addone) 打印(df2)

      class  count
0     A      2
1     B      1
2     C      3

Answer 1

采纳答案by Zero

This is by design, as described here and here

这是设计使然，如此处和此处所述

The applyfunction needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkitin your case) twice to achieve this.

该apply函数需要知道返回数据的形状，以智能地确定它将如何组合。为此，它调用函数（checkit在您的情况下）两次以实现此目的。

Depending on your actual use case, you can replace the call to applywith aggregate, transformor filter, as described in detail here. These functions require the return value to be a particular shape, and so don't call the function twice.

根据您的实际使用情况，您可以取代呼叫apply与aggregate，transform或filter，如详细说明这里。这些函数要求返回值是特定的形状，因此不要两次调用该函数。

However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.

但是 - 如果您正在调用的函数没有副作用，则该函数在第一个值上被调用两次很可能无关紧要。

Answer 2

回答by cs95

This "issue" has now been fixed: Upgrade to 0.25+

这个“问题”现已得到修复：升级到 0.25+

Starting from v0.25, GroupBy.apply()will only evaluate the first group once. See GH24748.

从 v0.25 开始，GroupBy.apply()只会评估第一组一次。见GH24748。

Relevant example from documentation:

文档中的相关示例：

pd.__version__                                                                                                          
# '0.25.0.dev0+590.g44d5498d8'

df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})                                                                      

def func(group): 
    print(group.name) 
    return group

New behaviour (>=v0.25):

新行为 (>=v0.25)：

df.groupby('a').apply(func)                                                                                            
x
y

   a  b
0  x  1
1  y  2

Old behaviour (<=v0.24.x):

旧行为 (<=v0.24.x)：

df.groupby('a').apply(func)
x
x
y

   a  b
0  x  1
1  y  2

Pandas still uses the first group to determine whether applycan take a fast path or not. But at least it no longer has to evaluate the first group twice. Nice work, devs!

Pandas 仍然使用第一组来确定是否apply可以走快速路径。但至少它不再需要对第一组进行两次评估。干得好，开发人员！

Answer 3

回答by geosmart

you can use for loop to avoid the groupby.apply duplicate first row,

您可以使用 for 循环来避免 groupby.apply 重复的第一行，

log_sample.csv

日志样本.csv

guestid,keyword
1,null
2,null
2,null
3,null
3,null
3,null
4,null
4,null
4,null
4,null

my code snippit

我的代码片段

df=pd.read_csv("log_sample.csv") 
grouped = df.groupby("guestid")

for guestid, df_group in grouped:
    print(list(df_group['guestid'])) 

df.head(100)

output

输出

[1]
[2, 2]
[3, 3, 3]
[4, 4, 4, 4]

Pandas GroupBy.apply 方法复制第一组

提问by NC maize breeding Jim

采纳答案by Zero

回答by cs95

This "issue" has now been fixed: Upgrade to 0.25+

这个“问题”现已得到修复：升级到 0.25+

回答by geosmart

相关推荐

最近更新

标签

Pandas GroupBy.apply 方法复制第一组

提问by NC maize breeding Jim

采纳答案by Zero

回答by cs95

This "issue" has now been fixed: Upgrade to 0.25+

这个“问题”现已得到修复：升级到 0.25+

回答by geosmart

相关推荐

日期时间对象上的 Pandas fillna

Pandas：使用循环和分层索引将多个 csv 文件导入数据帧

在 Pandas 中相应地复制另一列的值时，将具有列表类型值的列展平

python pandas：将带参数的函数应用于系列。更新

相关推荐

最近更新

标签