pandas 熊猫变换()与应用()
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41476436/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas transform() vs apply()
提问by 3novak
I don't understand why apply
and transform
return different dtypes when called on the same data frame. The way I explained the two functions to myself before went something along the lines of "apply
collapses the data, and transform
does exactly the same thing as apply
but preserves the original index and doesn't collapse." Consider the following.
我不明白为什么apply
和transform
相同的数据帧上调用时返回不同的dtypes。之前我向自己解释这两个函数的方式是“apply
折叠数据,并transform
与apply
但保留原始索引并且不折叠”完全相同。考虑以下。
df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
'cat': [1,1,0,0,1,0,0,0,0,1]})
Let's identify those id
s which have a nonzero entry in the cat
column.
让我们确定那些id
在cat
列中具有非零条目的 s 。
>>> df.groupby('id')['cat'].apply(lambda x: (x == 1).any())
id
1 True
2 True
3 False
4 True
Name: cat, dtype: bool
Great. If we wanted to create an indicator column, however, we could do the following.
伟大的。但是,如果我们想创建一个指标列,我们可以执行以下操作。
>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 0
8 0
9 1
Name: cat, dtype: int64
I don't understand why the dtype is now int64
instead of the boolean returned by the any()
function.
我不明白为什么现在 dtypeint64
而不是any()
函数返回的布尔值。
When I change the original data frame to contain some booleans (note that the zeros remain), the transform approach returns booleans in an object
column. This is an extra mystery to me since all of the values are boolean, but it's listed as object
apparently to match the dtype
of the original mixed-type column of integers and booleans.
当我将原始数据框更改为包含一些布尔值(请注意零仍然存在)时,转换方法会在object
列中返回布尔值。这对我来说是一个额外的谜,因为所有值都是布尔值,但它被列为object
显然与dtype
整数和布尔值的原始混合类型列的匹配。
df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
'cat': [True,True,0,0,True,0,0,0,0,True]})
>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 False
9 True
Name: cat, dtype: object
However, when I use all booleans, the transform function returns a boolean column.
但是,当我使用所有布尔值时,转换函数返回一个布尔列。
df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
'cat': [True,True,False,False,True,False,False,False,False,True]})
>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 False
9 True
Name: cat, dtype: bool
Using my acute pattern-recognition skills, it appears that the dtype
of the resulting column mirrors that of the original column. I would appreciate any hints about why this occurs or what's going on under the hood in the transform
function. Cheers.
使用我敏锐的模式识别技能,dtype
结果列的似乎反映了原始列的 。我很感激有关为什么会发生这种情况或transform
函数内部发生了什么的任何提示。干杯。
采纳答案by MaxU
It looks like SeriesGroupBy.transform()
tries to cast the result dtype to the same one as the original column has, but DataFrameGroupBy.transform()
doesn't seem to do that:
看起来SeriesGroupBy.transform()
试图将结果 dtype 转换为与原始列相同的数据类型,但DataFrameGroupBy.transform()
似乎没有这样做:
In [139]: df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
Out[139]:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 0
8 0
9 1
Name: cat, dtype: int64
# v v
In [140]: df.groupby('id')[['cat']].transform(lambda x: (x == 1).any())
Out[140]:
cat
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 False
9 True
In [141]: df.dtypes
Out[141]:
cat int64
id int64
dtype: object
回答by ClementWalter
Just adding another illustrative example with sum as I find it more explicit:
只需添加另一个带有 sum 的说明性示例,因为我发现它更明确:
df = (
pd.DataFrame(pd.np.random.rand(10, 3), columns=['a', 'b', 'c'])
.assign(a=lambda df: df.a > 0.5)
)
Out[70]:
a b c
0 False 0.126448 0.487302
1 False 0.615451 0.735246
2 False 0.314604 0.585689
3 False 0.442784 0.626908
4 False 0.706729 0.508398
5 False 0.847688 0.300392
6 False 0.596089 0.414652
7 False 0.039695 0.965996
8 True 0.489024 0.161974
9 False 0.928978 0.332414
df.groupby('a').apply(sum) # drop rows
a b c
a
False 0.0 4.618465 4.956997
True 1.0 0.489024 0.161974
df.groupby('a').transform(sum) # keep dims
b c
0 4.618465 4.956997
1 4.618465 4.956997
2 4.618465 4.956997
3 4.618465 4.956997
4 4.618465 4.956997
5 4.618465 4.956997
6 4.618465 4.956997
7 4.618465 4.956997
8 0.489024 0.161974
9 4.618465 4.956997
However when applied to pd.DataFrame
and not pd.GroupBy
object I was not able to see any difference.
但是,当应用于pd.DataFrame
而不是pd.GroupBy
对象时,我看不出任何区别。