pandas 熊猫变换()与应用()

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41476436/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:43:11  来源:igfitidea点击:

Pandas transform() vs apply()

pythonpandastransformapply

提问by 3novak

I don't understand why applyand transformreturn different dtypes when called on the same data frame. The way I explained the two functions to myself before went something along the lines of "applycollapses the data, and transformdoes exactly the same thing as applybut preserves the original index and doesn't collapse." Consider the following.

我不明白为什么applytransform相同的数据帧上调用时返回不同的dtypes。之前我向自己解释这两个函数的方式是“apply折叠数据,并transformapply但保留原始索引并且不折叠”完全相同。考虑以下。

df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
                   'cat': [1,1,0,0,1,0,0,0,0,1]})

Let's identify those ids which have a nonzero entry in the catcolumn.

让我们确定那些idcat列中具有非零条目的 s 。

>>> df.groupby('id')['cat'].apply(lambda x: (x == 1).any())
id
1     True
2     True
3    False
4     True
Name: cat, dtype: bool

Great. If we wanted to create an indicator column, however, we could do the following.

伟大的。但是,如果我们想创建一个指标列,我们可以执行以下操作。

>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    0
8    0
9    1
Name: cat, dtype: int64

I don't understand why the dtype is now int64instead of the boolean returned by the any()function.

我不明白为什么现在 dtypeint64而不是any()函数返回的布尔值。

When I change the original data frame to contain some booleans (note that the zeros remain), the transform approach returns booleans in an objectcolumn. This is an extra mystery to me since all of the values are boolean, but it's listed as objectapparently to match the dtypeof the original mixed-type column of integers and booleans.

当我将原始数据框更改为包含一些布尔值(请注意零仍然存在)时,转换方法会在object列中返回布尔值。这对我来说是一个额外的谜,因为所有值都是布尔值,但它被列为object显然与dtype整数和布尔值的原始混合类型列的匹配。

df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
                   'cat': [True,True,0,0,True,0,0,0,0,True]})

>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0     True
1     True
2     True
3     True
4     True
5     True
6     True
7    False
8    False
9     True
Name: cat, dtype: object

However, when I use all booleans, the transform function returns a boolean column.

但是,当我使用所有布尔值时,转换函数返回一个布尔列。

df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
                   'cat': [True,True,False,False,True,False,False,False,False,True]})

>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0     True
1     True
2     True
3     True
4     True
5     True
6     True
7    False
8    False
9     True
Name: cat, dtype: bool

Using my acute pattern-recognition skills, it appears that the dtypeof the resulting column mirrors that of the original column. I would appreciate any hints about why this occurs or what's going on under the hood in the transformfunction. Cheers.

使用我敏锐的模式识别技能,dtype结果列的似乎反映了原始列的 。我很感激有关为什么会发生这种情况或transform函数内部发生了什么的任何提示。干杯。

采纳答案by MaxU

It looks like SeriesGroupBy.transform()tries to cast the result dtype to the same one as the original column has, but DataFrameGroupBy.transform()doesn't seem to do that:

看起来SeriesGroupBy.transform()试图将结果 dtype 转换为与原始列相同的数据类型,但DataFrameGroupBy.transform()似乎没有这样做:

In [139]: df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
Out[139]:
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    0
8    0
9    1
Name: cat, dtype: int64

#                         v       v
In [140]: df.groupby('id')[['cat']].transform(lambda x: (x == 1).any())
Out[140]:
     cat
0   True
1   True
2   True
3   True
4   True
5   True
6   True
7  False
8  False
9   True

In [141]: df.dtypes
Out[141]:
cat    int64
id     int64
dtype: object

回答by ClementWalter

Just adding another illustrative example with sum as I find it more explicit:

只需添加另一个带有 sum 的说明性示例,因为我发现它更明确:

df = (
    pd.DataFrame(pd.np.random.rand(10, 3), columns=['a', 'b', 'c'])
        .assign(a=lambda df: df.a > 0.5)
)

Out[70]: 
       a         b         c
0  False  0.126448  0.487302
1  False  0.615451  0.735246
2  False  0.314604  0.585689
3  False  0.442784  0.626908
4  False  0.706729  0.508398
5  False  0.847688  0.300392
6  False  0.596089  0.414652
7  False  0.039695  0.965996
8   True  0.489024  0.161974
9  False  0.928978  0.332414

df.groupby('a').apply(sum)  # drop rows

         a         b         c
a                             
False  0.0  4.618465  4.956997
True   1.0  0.489024  0.161974


df.groupby('a').transform(sum)  # keep dims

          b         c
0  4.618465  4.956997
1  4.618465  4.956997
2  4.618465  4.956997
3  4.618465  4.956997
4  4.618465  4.956997
5  4.618465  4.956997
6  4.618465  4.956997
7  4.618465  4.956997
8  0.489024  0.161974
9  4.618465  4.956997

However when applied to pd.DataFrameand not pd.GroupByobject I was not able to see any difference.

但是,当应用于pd.DataFrame而不是pd.GroupBy对象时,我看不出任何区别。