pandas 熊猫的转换不起作用对 groupby 输出进行排序

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13854476/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:32:04  来源:igfitidea点击:

pandas' transform doesn't work sorting groupby output

pythonaggregatepandas

提问by Robert Smith

Another pandas question.

另一个Pandas问题。

Reading Wes Mckinney's excellent book about Data Analysis and Pandas, I encountered the following thing that I thought should work:

阅读 Wes Mckinney 关于数据分析和 Pandas 的优秀著作,我遇到了以下我认为应该可行的事情:

Suppose I have some info about tips.

假设我有一些关于提示的信息。

In [119]:

tips.head()
Out[119]:
total_bill  tip      sex     smoker    day   time    size  tip_pct
0    16.99   1.01    Female  False   Sun     Dinner  2   0.059447
1    10.34   1.66    Male    False   Sun     Dinner  3   0.160542
2    21.01   3.50    Male    False   Sun     Dinner  3   0.166587
3    23.68   3.31    Male    False   Sun     Dinner  2   0.139780
4    24.59   3.61    Female  False   Sun     Dinner  4   0.146808

and I want to know the five largest tips in relation to the total bill, that is, tip_pctfor smokers and non-smokers separately. So this works:

我想知道与总账单相关的五个最大的提示,即分别tip_pct针对吸烟者和非吸烟者。所以这有效:

def top(df, n=5, column='tip_pct'): 
    return df.sort_index(by=column)[-n:]

In [101]:

tips.groupby('smoker').apply(top)
Out[101]:
           total_bill   tip sex smoker  day time    size    tip_pct
smoker                                  
False   88   24.71   5.85    Male    False   Thur    Lunch   2   0.236746
185  20.69   5.00    Male    False   Sun     Dinner  5   0.241663
51   10.29   2.60    Female  False   Sun     Dinner  2   0.252672
149  7.51    2.00    Male    False   Thur    Lunch   2   0.266312
232  11.61   3.39    Male    False   Sat     Dinner  2   0.291990

True    109  14.31   4.00    Female  True    Sat     Dinner  2   0.279525
183  23.17   6.50    Male    True    Sun     Dinner  4   0.280535
67   3.07    1.00    Female  True    Sat     Dinner  1   0.325733
178  9.60    4.00    Female  True    Sun     Dinner  2   0.416667
172  7.25    5.15    Male    True    Sun     Dinner  2   0.710345

Good enough, but then I wanted to use pandas' transform to do the same like this:

足够好,但后来我想使用Pandas的变换来做同样的事情:

def top_all(df):
    return df.sort_index(by='tip_pct')

tips.groupby('smoker').transform(top_all)

but instead I get this:

但我得到了这个:

TypeError: Transform function invalid for data types

Why? I know that transform requires to return an array of the same dimensions that it accepts as input, so I thought I'd be complying with that requirement just sorting both slices (smokers and non-smokers) of the original DataFrame without changing their respective dimensions. Can anyone explain why it failed?

为什么?我知道转换需要返回它接受作为输入的相同维度的数组,所以我认为我会遵守该要求,只需对原始 DataFrame 的两个切片(吸烟者和非吸烟者)进行排序而不改变它们各自的维度. 谁能解释为什么它失败了?

回答by BrenBarn

transformis not that well documented, but it seems that the way it works is that what the transform function is passed is not the entire group as a dataframe, but a single column of a single group. I don't think it's really meant for what you're trying to do, and your solution with applyis fine.

transform没有那么好的文档,但它的工作方式似乎是转换函数传递的不是作为数据帧的整个组,而是单个组的单个列。我不认为它真的适用于你想要做的事情,你的解决方案apply很好。

So suppose tips.groupby('smoker').transform(func). There will be two groups, call them group1 and group2. The transform does not call func(group1)and func(group2). Instead, it calls func(group1['total_bill']), then func(group1['tip']), etc., and then func(group2['total_bill']), func(group2['tip']). Here's an example:

所以假设tips.groupby('smoker').transform(func). 将有两个组,分别称为 group1 和 group2。转换不会调用func(group1)func(group2)。相反,它调用func(group1['total_bill']), thenfunc(group1['tip'])等,然后调用func(group2['total_bill']), func(group2['tip'])。下面是一个例子:

>>> print d
   A  B  C
0 -2  5  4
1  1 -1  2
2  0  2  1
3 -3  1  2
4  5  0  2
>>> def foo(df):
...     print ">>>"
...     print df
...     print "<<<"
...     return df
>>> print d.groupby('C').transform(foo)
>>>
2    0
Name: A
<<<
>>>
2    2
Name: B
<<<
>>>
1    1
3   -3
4    5
Name: A
<<<
>>>
1   -1
3    1
4    0
Name: B
# etc.

You can see that foois first called with just the A column of the C=1 group of the original data frame, then the B column of that group, then the A column of the C=2 group, etc.

您可以看到,foo首先仅使用原始数据框的 C=1 组的 A 列调用它,然后是该组的 B 列,然后是 C=2 组的 A 列,依此类推。

This makes sense if you think about what transform is for. It's meant for applying transform functions on the groups. But in general, these functions won't make sense when applied to the entire group, only to a given column. For instance, the example in the pandas docs is about z-standardizing using transform. If you have a DataFrame with columns for age and weight, it wouldn't make sense to z-standardize with respect to the overall mean of both these variables. It doesn't even mean anything to take the overall mean of a bunch of numbers, some of which are ages and some of which are weights. You have to z-standardize the age with respect to the mean age and the weight with respect to the mean weight, which means you want to transform separately for each column.

如果您考虑转换的用途,这是有道理的。它用于在组上应用变换函数。但一般来说,这些函数在应用于整个组时没有意义,仅适用于给定的列。例如,pandas 文档中的示例是关于使用transform. 如果您有一个包含年龄和体重列的 DataFrame,则对这两个变量的总体平均值进行 z 标准化是没有意义的。取一堆数字的整体平均值甚至没有任何意义,其中一些是年龄,一些是权重。您必须根据平均年龄对年龄进行 z 标准化,根据平均体重对体重进行 z 标准化,这意味着您要对每一列分别进行转换。

So basically, you don't need to use transform here. applyis the appropriate function here, because applyreally does operate on each group as a single DataFrame, while transformoperates on each column of each group.

所以基本上,你不需要在这里使用转换。 apply是这里合适的函数,因为apply确实将每个组作为单个 DataFrame 进行transform操作,同时对每个组的每一列进行操作。