pandas 熊猫变换（）与应用（）

Question

提问by 3novak

I don't understand why applyand transformreturn different dtypes when called on the same data frame. The way I explained the two functions to myself before went something along the lines of "applycollapses the data, and transformdoes exactly the same thing as applybut preserves the original index and doesn't collapse." Consider the following.

我不明白为什么apply和transform相同的数据帧上调用时返回不同的dtypes。之前我向自己解释这两个函数的方式是“apply折叠数据，并transform与apply但保留原始索引并且不折叠”完全相同。考虑以下。

df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
                   'cat': [1,1,0,0,1,0,0,0,0,1]})

Let's identify those ids which have a nonzero entry in the catcolumn.

让我们确定那些id在cat列中具有非零条目的 s 。

>>> df.groupby('id')['cat'].apply(lambda x: (x == 1).any())
id
1     True
2     True
3    False
4     True
Name: cat, dtype: bool

Great. If we wanted to create an indicator column, however, we could do the following.

伟大的。但是，如果我们想创建一个指标列，我们可以执行以下操作。

>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    0
8    0
9    1
Name: cat, dtype: int64

I don't understand why the dtype is now int64instead of the boolean returned by the any()function.

我不明白为什么现在 dtypeint64而不是any()函数返回的布尔值。

When I change the original data frame to contain some booleans (note that the zeros remain), the transform approach returns booleans in an objectcolumn. This is an extra mystery to me since all of the values are boolean, but it's listed as objectapparently to match the dtypeof the original mixed-type column of integers and booleans.

当我将原始数据框更改为包含一些布尔值（请注意零仍然存在）时，转换方法会在object列中返回布尔值。这对我来说是一个额外的谜，因为所有值都是布尔值，但它被列为object显然与dtype整数和布尔值的原始混合类型列的匹配。

df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
                   'cat': [True,True,0,0,True,0,0,0,0,True]})

>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0     True
1     True
2     True
3     True
4     True
5     True
6     True
7    False
8    False
9     True
Name: cat, dtype: object

However, when I use all booleans, the transform function returns a boolean column.

但是，当我使用所有布尔值时，转换函数返回一个布尔列。

df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
                   'cat': [True,True,False,False,True,False,False,False,False,True]})

>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0     True
1     True
2     True
3     True
4     True
5     True
6     True
7    False
8    False
9     True
Name: cat, dtype: bool

Using my acute pattern-recognition skills, it appears that the dtypeof the resulting column mirrors that of the original column. I would appreciate any hints about why this occurs or what's going on under the hood in the transformfunction. Cheers.

使用我敏锐的模式识别技能，dtype结果列的似乎反映了原始列的。我很感激有关为什么会发生这种情况或transform函数内部发生了什么的任何提示。干杯。

Answer 1

采纳答案by MaxU

It looks like SeriesGroupBy.transform()tries to cast the result dtype to the same one as the original column has, but DataFrameGroupBy.transform()doesn't seem to do that:

看起来SeriesGroupBy.transform()试图将结果 dtype 转换为与原始列相同的数据类型，但DataFrameGroupBy.transform()似乎没有这样做：

In [139]: df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
Out[139]:
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    0
8    0
9    1
Name: cat, dtype: int64

#                         v       v
In [140]: df.groupby('id')[['cat']].transform(lambda x: (x == 1).any())
Out[140]:
     cat
0   True
1   True
2   True
3   True
4   True
5   True
6   True
7  False
8  False
9   True

In [141]: df.dtypes
Out[141]:
cat    int64
id     int64
dtype: object

Answer 2

回答by ClementWalter

Just adding another illustrative example with sum as I find it more explicit:

只需添加另一个带有 sum 的说明性示例，因为我发现它更明确：

df = (
    pd.DataFrame(pd.np.random.rand(10, 3), columns=['a', 'b', 'c'])
        .assign(a=lambda df: df.a > 0.5)
)

Out[70]: 
       a         b         c
0  False  0.126448  0.487302
1  False  0.615451  0.735246
2  False  0.314604  0.585689
3  False  0.442784  0.626908
4  False  0.706729  0.508398
5  False  0.847688  0.300392
6  False  0.596089  0.414652
7  False  0.039695  0.965996
8   True  0.489024  0.161974
9  False  0.928978  0.332414

df.groupby('a').apply(sum)  # drop rows

         a         b         c
a                             
False  0.0  4.618465  4.956997
True   1.0  0.489024  0.161974


df.groupby('a').transform(sum)  # keep dims

          b         c
0  4.618465  4.956997
1  4.618465  4.956997
2  4.618465  4.956997
3  4.618465  4.956997
4  4.618465  4.956997
5  4.618465  4.956997
6  4.618465  4.956997
7  4.618465  4.956997
8  0.489024  0.161974
9  4.618465  4.956997

However when applied to pd.DataFrameand not pd.GroupByobject I was not able to see any difference.

但是，当应用于pd.DataFrame而不是pd.GroupBy对象时，我看不出任何区别。

pandas 熊猫变换（）与应用（）

提问by 3novak

采纳答案by MaxU

回答by ClementWalter

相关推荐

最近更新

标签

pandas 熊猫变换（）与应用（）

提问by 3novak

采纳答案by MaxU

回答by ClementWalter

相关推荐

Pandas Resample 应用自定义函数？

pandas 在熊猫数据框中舍入一列

循环遍历不同的 Pandas 数据帧

从 csv 文件读取时，pandas 添加列

相关推荐

最近更新

标签