dplyr 由多个函数汇总/聚合的 Pandas 等价物是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38935541/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:49:06  来源:igfitidea点击:

What is the pandas equivalent of dplyr summarize/aggregate by multiple functions?

pythonrpandaspandas-groupbysummarize

提问by B.Mr.W.

I'm having issues transitioning to pandas from R where dplyrpackage can easily group-by and perform multiple summarizations.

我在从 R 过渡到 Pandas 时遇到了问题,其中dplyr包可以轻松分组并执行多次汇总。

Please help improve my existing Python pandas code for multiple aggregations:

请帮助改进我现有的用于多个聚合的 Python pandas 代码:

import pandas as pd
data = pd.DataFrame(
    {'col1':[1,1,1,1,1,2,2,2,2,2],
    'col2':[1,2,3,4,5,6,7,8,9,0],
     'col3':[-1,-2,-3,-4,-5,-6,-7,-8,-9,0]
    }
)
result = []
for k,v in data.groupby('col1'):
    result.append([k, max(v['col2']), min(v['col3'])])
print pd.DataFrame(result, columns=['col1', 'col2_agg', 'col3_agg'])

Issues:

问题:

  • too verbose
  • probably can be optimized and efficient. (I rewrote a for-loop groupbyimplementation into groupby.aggand the performance enhancement was huge).
  • 太冗长
  • 可能可以优化和高效。(我重写了一个for-loop groupby实现groupby.agg,性能增强是巨大的)。

In R the equivalent code would be:

在 R 中,等效的代码是:

data %>% groupby(col1) %>% summarize(col2_agg=max(col2), col3_agg=min(col3))


UPDATE: @ayhan solved my question, here is a follow-up question that I will post here instead of as comment:

更新:@ayhan 解决了我的问题,这是一个后续问题,我将在这里发布而不是作为评论:

Q2) What is the equivalent of groupby().summarize(newcolumn=max(col2 * col3))i.e. an aggregation/summarization where the function is a compound function of 2+ columns?

Q2) 什么是groupby().summarize(newcolumn=max(col2 * col3))ie 聚合/汇总的等价物,其中函数是 2+ 列的复合函数?

回答by ayhan

The equivalent of

相当于

df %>% groupby(col1) %>% summarize(col2_agg=max(col2), col3_agg=min(col3))

is

df.groupby('col1').agg({'col2': 'max', 'col3': 'min'})

which returns

返回

      col2  col3
col1            
1        5    -5
2        9    -9

The returning object is a pandas.DataFrame with an index called col1and columns named col2and col3. By default, when you group your data pandas sets the grouping column(s) as index for efficient access and modification. However, if you don't want that, there are two alternatives to set col1as a column.

返回的对象是一个名为索引的pandas.DataFramecol1和命名的列col2col3。默认情况下,当您对数据进行分组时,pandas 将分组列设置为索引以进行高效访问和修改。但是,如果您不想这样,有两种选择可以设置col1为列。

  • Pass as_index=False:

    df.groupby('col1', as_index=False).agg({'col2': 'max', 'col3': 'min'})
    
  • Call reset_index:

    df.groupby('col1').agg({'col2': 'max', 'col3': 'min'}).reset_index()
    
  • 通过as_index=False

    df.groupby('col1', as_index=False).agg({'col2': 'max', 'col3': 'min'})
    
  • 电话reset_index

    df.groupby('col1').agg({'col2': 'max', 'col3': 'min'}).reset_index()
    

both yield

两者产量

col1  col2  col3           
   1     5    -5
   2     9    -9

You can also pass multiple functions to groupby.agg.

您还可以将多个函数传递给groupby.agg.

agg_df = df.groupby('col1').agg({'col2': ['max', 'min', 'std'], 
                                 'col3': ['size', 'std', 'mean', 'max']})

This also returns a DataFrame but now it has a MultiIndex for columns.

这也会返回一个 DataFrame,但现在它有一个列的 MultiIndex。

     col2               col3                   
      max min       std size       std mean max
col1                                           
1       5   1  1.581139    5  1.581139   -3  -1
2       9   0  3.535534    5  3.535534   -6   0

MultiIndex is very handy for selection and grouping. Here are some examples:

MultiIndex 对于选择和分组非常方便。这里有些例子:

agg_df['col2']  # select the second column
      max  min       std
col1                    
1       5    1  1.581139
2       9    0  3.535534

agg_df[('col2', 'max')]  # select the maximum of the second column
Out: 
col1
1    5
2    9
Name: (col2, max), dtype: int64

agg_df.xs('max', axis=1, level=1)  # select the maximum of all columns
Out: 
      col2  col3
col1            
1        5    -1
2        9     0

Earlier (before version 0.20.0) it was possible to use dictionaries for renaming the columns in the aggcall. For example

早期(版本 0.20.0之前)可以使用字典来重命名agg调用中的列。例如

df.groupby('col1')['col2'].agg({'max_col2': 'max'})

would return the maximum of the second column as max_col2:

将第二列的最大值返回为max_col2

      max_col2
col1          
1            5
2            9

However, it was deprecated in favor of the rename method:

但是,它已被弃用,以支持重命名方法:

df.groupby('col1')['col2'].agg(['max']).rename(columns={'max': 'col2_max'})

      col2_max
col1          
1            5
2            9

It can get verbose for a DataFrame like agg_dfdefined above. You can use a renaming function to flatten those levels in that case:

对于agg_df上面定义的 DataFrame,它可能会变得冗长。在这种情况下,您可以使用重命名函数来展平这些级别:

agg_df.columns = ['_'.join(col) for col in agg_df.columns]

      col2_max  col2_min  col2_std  col3_size  col3_std  col3_mean  col3_max
col1                                                                        
1            5         1  1.581139          5  1.581139         -3        -1
2            9         0  3.535534          5  3.535534         -6         0

For operations like groupby().summarize(newcolumn=max(col2 * col3)), you can still use agg by first adding a new column with assign.

对于诸如 的操作groupby().summarize(newcolumn=max(col2 * col3)),您仍然可以通过首先添加一个带有 的新列来使用 agg assign

df.assign(new_col=df.eval('col2 * col3')).groupby('col1').agg('max') 

      col2  col3  new_col
col1                     
1        5    -1       -1
2        9     0        0

This returns maximum for old and new columns but as always you can slice that.

这将返回旧列和新列的最大值,但与往常一样,您可以对其进行切片。

df.assign(new_col=df.eval('col2 * col3')).groupby('col1')['new_col'].agg('max')

col1
1   -1
2    0
Name: new_col, dtype: int64

With groupby.applythis would be shorter:

有了groupby.apply这个会更短:

df.groupby('col1').apply(lambda x: (x.col2 * x.col3).max())

col1
1   -1
2    0
dtype: int64

However, groupby.applytreats this as a custom function so it is not vectorized. Up to now, the functions we passed to agg('min', 'max', 'min', 'size' etc.) are vectorized and these are aliases for those optimized functions. You can replace df.groupby('col1').agg('min')with df.groupby('col1').agg(min), df.groupby('col1').agg(np.min)or df.groupby('col1').min()and they will all execute the same function. You will not see the same efficiency when you use custom functions.

但是,groupby.apply将此视为自定义函数,因此不会对其进行矢量化。到目前为止,我们传递给的函数agg('min'、'max'、'min'、'size' 等)是矢量化的,这些是那些优化函数的别名。您可以df.groupby('col1').agg('min')df.groupby('col1').agg(min), df.groupby('col1').agg(np.min)or替换df.groupby('col1').min()它们,它们都将执行相同的功能。当您使用自定义函数时,您不会看到同样的效率。

Lastly, as of version 0.20, aggcan be used on DataFrames directly, without having to group first. See examples here.

最后,从 0.20 版本开始,agg可以直接在 DataFrame 上使用,而无需先分组。请参阅此处的示例。

回答by Puriney

Check the side-by-side comparison given by Pandas documentation here: http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html#grouping-and-summarizing

在此处检查 Pandas 文档给出的并排比较:http: //pandas.pydata.org/pandas-docs/stable/comparison_with_r.html#grouping-and-summarizing

R's dplyr

R的dplyr

gdf <- group_by(df, col1)
summarise(gdf, avg=mean(col1, na.rm=TRUE))  

Pandas

Pandas

gdf = df.groupby('col1')
df.groupby('col1').agg({'col1': 'mean'})