Python pandas agg 和 apply 函数有什么区别?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/21828398/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is the difference between pandas agg and apply function?
提问by David D
I can't figure out the difference between Pandas .aggregateand .applyfunctions.
Take the following as an example: I load a dataset, do a groupby, define a simple function,
and either user .aggor .apply.
我无法弄清楚 Pandas.aggregate和.apply函数之间的区别。
以以下为例:我加载一个数据集,执行一个groupby,定义一个简单的函数,然后是 user.agg或.apply.
As you may see, the printing statement within my function results in the same output
after using .aggand .apply. The result, on the other hand is different. Why is that?
如您所见,我的函数中的打印语句在使用.aggand后产生相同的输出.apply。另一方面,结果是不同的。这是为什么?
import pandas
import pandas as pd
iris = pd.read_csv('iris.csv')
by_species = iris.groupby('Species')
def f(x):
    ...:     print type(x)
    ...:     print x.head(3)
    ...:     return 1
Using apply:
使用apply:
by_species.apply(f)
#<class 'pandas.core.frame.DataFrame'>
#   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
#0           5.1          3.5           1.4          0.2  setosa
#1           4.9          3.0           1.4          0.2  setosa
#2           4.7          3.2           1.3          0.2  setosa
#<class 'pandas.core.frame.DataFrame'>
#   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
#0           5.1          3.5           1.4          0.2  setosa
#1           4.9          3.0           1.4          0.2  setosa
#2           4.7          3.2           1.3          0.2  setosa
#<class 'pandas.core.frame.DataFrame'>
#    Sepal.Length  Sepal.Width  Petal.Length  Petal.Width     Species
#50           7.0          3.2           4.7          1.4  versicolor
#51           6.4          3.2           4.5          1.5  versicolor
#52           6.9          3.1           4.9          1.5  versicolor
#<class 'pandas.core.frame.DataFrame'>
#     Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
#100           6.3          3.3           6.0          2.5  virginica
#101           5.8          2.7           5.1          1.9  virginica
#102           7.1          3.0           5.9          2.1  virginica
#Out[33]: 
#Species
#setosa        1
#versicolor    1
#virginica     1
#dtype: int64
Using agg
使用 agg
by_species.agg(f)
#<class 'pandas.core.frame.DataFrame'>
#   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
#0           5.1          3.5           1.4          0.2  setosa
#1           4.9          3.0           1.4          0.2  setosa
#2           4.7          3.2           1.3          0.2  setosa
#<class 'pandas.core.frame.DataFrame'>
#    Sepal.Length  Sepal.Width  Petal.Length  Petal.Width     Species
#50           7.0          3.2           4.7          1.4  versicolor
#51           6.4          3.2           4.5          1.5  versicolor
#52           6.9          3.1           4.9          1.5  versicolor
#<class 'pandas.core.frame.DataFrame'>
#     Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
#100           6.3          3.3           6.0          2.5  virginica
#101           5.8          2.7           5.1          1.9  virginica
#102           7.1          3.0           5.9          2.1  virginica
#Out[34]: 
#           Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
#Species                                                         
#setosa                 1            1             1            1
#versicolor             1            1             1            1
#virginica              1            1             1            1
采纳答案by TomAugspurger
applyapplies the function to each group (your Species). Your function returns 1, so you end up with 1 value for each of 3 groups.
apply将该函数应用于每个组(您的Species)。您的函数返回 1,因此您最终得到 3 个组中的每一个的 1 个值。
aggaggregates each column (feature)for each group, so you end up with one value per column per group.
agg聚合每个组的每一列(特征),因此最终每个组的每列都有一个值。
Do read the groupbydocs, they're quite helpful. There are also a bunch of tutorials floating around the web.
请阅读groupby文档,它们非常有帮助。网上还有一堆教程。
回答by Surya
(Note:These comparisons are relevant for DataframeGroupby objects)
(注意:这些比较与 DataframeGroupby 对象相关)
Some plausible advantages of using .agg()compared to .apply(), for DataFrame GroupBy objectswould be:
与,对于 DataFrame GroupBy 对象相比,使用的.agg()一些看似合理的优点是:.apply()
- .agg()gives the flexibility of applying multiple functions at once, or pass a list of function to each column.
- Also, applying different functions at once to different columns of dataframe. 
- .agg()提供一次应用多个函数的灵活性,或将函数列表传递给每列。
- 此外,一次将不同的功能应用于数据帧的不同列。 
That means you have pretty much control over each column with each operation.
这意味着您几乎可以控制每个操作的每一列。
Here is the link for more details: http://pandas.pydata.org/pandas-docs/version/0.13.1/groupby.html
这是更多详细信息的链接:http: //pandas.pydata.org/pandas-docs/version/0.13.1/groupby.html
However, the applyfunction could be limited to apply one function to each column of the dataframe at a time. So, you might have to call the apply function repeatedly to call upon different operations to the same column.
但是,该apply函数可能仅限于一次将一个函数应用于数据帧的每一列。因此,您可能必须重复调用 apply 函数才能对同一列调用不同的操作。
Here are some example comparisons for .apply()vs .agg()for DataframeGroupBy objects :
以下是DataframeGroupBy 对象.apply()vs 的一些示例比较.agg():
Given the following dataframe:
给定以下数据框:
In [261]: df = pd.DataFrame({"name":["Foo", "Baar", "Foo", "Baar"], "score_1":[5,10,15,10], "score_2" :[10,15,10,25], "score_3" : [10,20,30,40]})
In [262]: df
Out[262]: 
   name  score_1  score_2  score_3
0   Foo        5       10       10
1  Baar       10       15       20
2   Foo       15       10       30
3  Baar       10       25       40
Lets first see the operations using .apply():
让我们先看看使用的操作.apply():
In [263]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.sum())
Out[263]: 
name  score_1
Baar  10         40
Foo   5          10
      15         10
Name: score_2, dtype: int64
In [264]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.min())
Out[264]: 
name  score_1
Baar  10         15
Foo   5          10
      15         10
Name: score_2, dtype: int64
In [265]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.mean())
Out[265]: 
name  score_1
Baar  10         20.0
Foo   5          10.0
      15         10.0
Name: score_2, dtype: float64
Now, look at the same operations using .agg( ) effortlessly:
现在,使用 .agg() 毫不费力地查看相同的操作:
In [276]: df.groupby(["name", "score_1"]).agg({"score_3" :[np.sum, np.min, np.mean, np.max], "score_2":lambda x : x.mean()})
Out[276]: 
              score_2 score_3               
             <lambda>     sum amin mean amax
name score_1                                
Baar 10            20      60   20   30   40
Foo  5             10      10   10   10   10
     15            10      30   30   30   30
So, .agg()could be really handy at handling the DataFrameGroupBy objects, as compared to .apply(). But, if you are handling only pure dataframe objects and not DataFrameGroupBy objects, then apply()can be very useful, as apply()can apply a function along any axis of the dataframe.
因此,.agg()与.apply(). 但是,如果您只处理纯数据帧对象而不是 DataFrameGroupBy 对象,那么apply()它会非常有用,因为apply()可以沿数据帧的任何轴应用函数。
(For Eg: axis = 0implies column-wise operation with .apply(),which is a default mode, and axis = 1would imply for row-wise operation while dealing with pure dataframe objects).
(例如:axis = 0意味着.apply(),默认模式下的axis = 1按列操作,并且在处理纯数据帧对象时意味着按行操作)。
回答by Martin Alexandersson
When using apply to a groupby I have encountered that .applywill return the grouped columns. There is a note in the documentation (pandas.pydata.org/pandas-docs/stable/groupby.html):
当使用 apply 到 groupby 时,我遇到了.apply将返回分组列的情况。文档中有一个注释(pandas.pydata.org/pandas-docs/stable/groupby.html):
"...Thus the grouped columns(s) may be included in the output as well as set the indices."
“...因此,分组列可以包含在输出中并设置索引。”
.aggregatewill not return the grouped columns.
.aggregate不会返回分组的列。
回答by Kunal
The main difference between apply and aggregate is:
apply和aggregate之间的主要区别是:
apply()- 
    cannot be applied to multiple groups together 
    For apply() - We have to get_group()
    ERROR : -iris.groupby('Species').apply({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})# It will throw error
    Work Fine:-iris.groupby('Species').get_group('Setosa').apply({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})# It will throw error
        #because functions are applied to one data frame
agg()- 
    can be applied to multiple groups together
    For apply() - We do not have to get_group() 
    iris.groupby('Species').agg({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})
    iris.groupby('Species').get_group('versicolor').agg({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})        

