dplyr 由多个函数汇总/聚合的 Pandas 等价物是什么？

Question

提问by B.Mr.W.

I'm having issues transitioning to pandas from R where dplyrpackage can easily group-by and perform multiple summarizations.

我在从 R 过渡到 Pandas 时遇到了问题，其中dplyr包可以轻松分组并执行多次汇总。

Please help improve my existing Python pandas code for multiple aggregations:

请帮助改进我现有的用于多个聚合的 Python pandas 代码：

import pandas as pd
data = pd.DataFrame(
    {'col1':[1,1,1,1,1,2,2,2,2,2],
    'col2':[1,2,3,4,5,6,7,8,9,0],
     'col3':[-1,-2,-3,-4,-5,-6,-7,-8,-9,0]
    }
)
result = []
for k,v in data.groupby('col1'):
    result.append([k, max(v['col2']), min(v['col3'])])
print pd.DataFrame(result, columns=['col1', 'col2_agg', 'col3_agg'])

Issues:

问题：

too verbose
probably can be optimized and efficient. (I rewrote a for-loop groupbyimplementation into groupby.aggand the performance enhancement was huge).

太冗长
可能可以优化和高效。（我重写了一个for-loop groupby实现groupby.agg，性能增强是巨大的）。

In R the equivalent code would be:

在 R 中，等效的代码是：

data %>% groupby(col1) %>% summarize(col2_agg=max(col2), col3_agg=min(col3))

UPDATE: @ayhan solved my question, here is a follow-up question that I will post here instead of as comment:

更新：@ayhan 解决了我的问题，这是一个后续问题，我将在这里发布而不是作为评论：

Q2) What is the equivalent of groupby().summarize(newcolumn=max(col2 * col3))i.e. an aggregation/summarization where the function is a compound function of 2+ columns?

Q2) 什么是groupby().summarize(newcolumn=max(col2 * col3))ie 聚合/汇总的等价物，其中函数是 2+ 列的复合函数？

Answer 1

回答by ayhan

The equivalent of

相当于

df %>% groupby(col1) %>% summarize(col2_agg=max(col2), col3_agg=min(col3))

is

是

df.groupby('col1').agg({'col2': 'max', 'col3': 'min'})

which returns

返回

      col2  col3
col1            
1        5    -5
2        9    -9

The returning object is a pandas.DataFrame with an index called col1and columns named col2and col3. By default, when you group your data pandas sets the grouping column(s) as index for efficient access and modification. However, if you don't want that, there are two alternatives to set col1as a column.

返回的对象是一个名为索引的pandas.DataFramecol1和命名的列col2和col3。默认情况下，当您对数据进行分组时，pandas 将分组列设置为索引以进行高效访问和修改。但是，如果您不想这样，有两种选择可以设置col1为列。

Pass as_index=False:

df.groupby('col1', as_index=False).agg({'col2': 'max', 'col3': 'min'})

Call reset_index:

df.groupby('col1').agg({'col2': 'max', 'col3': 'min'}).reset_index()

通过as_index=False：

df.groupby('col1', as_index=False).agg({'col2': 'max', 'col3': 'min'})

电话reset_index：

df.groupby('col1').agg({'col2': 'max', 'col3': 'min'}).reset_index()

both yield

两者产量

col1  col2  col3           
   1     5    -5
   2     9    -9

You can also pass multiple functions to groupby.agg.

您还可以将多个函数传递给groupby.agg.

agg_df = df.groupby('col1').agg({'col2': ['max', 'min', 'std'], 
                                 'col3': ['size', 'std', 'mean', 'max']})

This also returns a DataFrame but now it has a MultiIndex for columns.

这也会返回一个 DataFrame，但现在它有一个列的 MultiIndex。

     col2               col3                   
      max min       std size       std mean max
col1                                           
1       5   1  1.581139    5  1.581139   -3  -1
2       9   0  3.535534    5  3.535534   -6   0

MultiIndex is very handy for selection and grouping. Here are some examples:

MultiIndex 对于选择和分组非常方便。这里有些例子：

agg_df['col2']  # select the second column
      max  min       std
col1                    
1       5    1  1.581139
2       9    0  3.535534

agg_df[('col2', 'max')]  # select the maximum of the second column
Out: 
col1
1    5
2    9
Name: (col2, max), dtype: int64

agg_df.xs('max', axis=1, level=1)  # select the maximum of all columns
Out: 
      col2  col3
col1            
1        5    -1
2        9     0

Earlier (before version 0.20.0) it was possible to use dictionaries for renaming the columns in the aggcall. For example

早期（版本 0.20.0之前）可以使用字典来重命名agg调用中的列。例如

df.groupby('col1')['col2'].agg({'max_col2': 'max'})

would return the maximum of the second column as max_col2:

将第二列的最大值返回为max_col2：

      max_col2
col1          
1            5
2            9

However, it was deprecated in favor of the rename method:

但是，它已被弃用，以支持重命名方法：

df.groupby('col1')['col2'].agg(['max']).rename(columns={'max': 'col2_max'})

      col2_max
col1          
1            5
2            9

It can get verbose for a DataFrame like agg_dfdefined above. You can use a renaming function to flatten those levels in that case:

对于agg_df上面定义的 DataFrame，它可能会变得冗长。在这种情况下，您可以使用重命名函数来展平这些级别：

agg_df.columns = ['_'.join(col) for col in agg_df.columns]

      col2_max  col2_min  col2_std  col3_size  col3_std  col3_mean  col3_max
col1                                                                        
1            5         1  1.581139          5  1.581139         -3        -1
2            9         0  3.535534          5  3.535534         -6         0

For operations like groupby().summarize(newcolumn=max(col2 * col3)), you can still use agg by first adding a new column with assign.

对于诸如的操作groupby().summarize(newcolumn=max(col2 * col3))，您仍然可以通过首先添加一个带有的新列来使用 agg assign。

df.assign(new_col=df.eval('col2 * col3')).groupby('col1').agg('max') 

      col2  col3  new_col
col1                     
1        5    -1       -1
2        9     0        0

This returns maximum for old and new columns but as always you can slice that.

这将返回旧列和新列的最大值，但与往常一样，您可以对其进行切片。

df.assign(new_col=df.eval('col2 * col3')).groupby('col1')['new_col'].agg('max')

col1
1   -1
2    0
Name: new_col, dtype: int64

With groupby.applythis would be shorter:

有了groupby.apply这个会更短：

df.groupby('col1').apply(lambda x: (x.col2 * x.col3).max())

col1
1   -1
2    0
dtype: int64

However, groupby.applytreats this as a custom function so it is not vectorized. Up to now, the functions we passed to agg('min', 'max', 'min', 'size' etc.) are vectorized and these are aliases for those optimized functions. You can replace df.groupby('col1').agg('min')with df.groupby('col1').agg(min), df.groupby('col1').agg(np.min)or df.groupby('col1').min()and they will all execute the same function. You will not see the same efficiency when you use custom functions.

但是，groupby.apply将此视为自定义函数，因此不会对其进行矢量化。到目前为止，我们传递给的函数agg（'min'、'max'、'min'、'size' 等）是矢量化的，这些是那些优化函数的别名。您可以df.groupby('col1').agg('min')用df.groupby('col1').agg(min), df.groupby('col1').agg(np.min)or替换df.groupby('col1').min()它们，它们都将执行相同的功能。当您使用自定义函数时，您不会看到同样的效率。

Lastly, as of version 0.20, aggcan be used on DataFrames directly, without having to group first. See examples here.

最后，从 0.20 版本开始，agg可以直接在 DataFrame 上使用，而无需先分组。请参阅此处的示例。

Answer 2

回答by Puriney

Check the side-by-side comparison given by Pandas documentation here: http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html#grouping-and-summarizing

在此处检查 Pandas 文档给出的并排比较：http: //pandas.pydata.org/pandas-docs/stable/comparison_with_r.html#grouping-and-summarizing

R's dplyr

R的dplyr

gdf <- group_by(df, col1)
summarise(gdf, avg=mean(col1, na.rm=TRUE))

Pandas

gdf = df.groupby('col1')
df.groupby('col1').agg({'col1': 'mean'})

dplyr 由多个函数汇总/聚合的 Pandas 等价物是什么？

提问by B.Mr.W.

回答by ayhan

回答by Puriney

相关推荐

最近更新

标签

dplyr 由多个函数汇总/聚合的 Pandas 等价物是什么？

提问by B.Mr.W.

回答by ayhan

回答by Puriney

相关推荐

pandas 如何在单元测试中使用 assert_frame_equal

将二维数组放入 Pandas 系列

pandas Panda 的数据框将一列拆分为多列

在 Pandas 数据框中的每一列上应用函数

相关推荐

最近更新

标签