dplyr 由多个函数汇总/聚合的 Pandas 等价物是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38935541/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is the pandas equivalent of dplyr summarize/aggregate by multiple functions?
提问by B.Mr.W.
I'm having issues transitioning to pandas from R where dplyr
package can easily group-by and perform multiple summarizations.
我在从 R 过渡到 Pandas 时遇到了问题,其中dplyr
包可以轻松分组并执行多次汇总。
Please help improve my existing Python pandas code for multiple aggregations:
请帮助改进我现有的用于多个聚合的 Python pandas 代码:
import pandas as pd
data = pd.DataFrame(
{'col1':[1,1,1,1,1,2,2,2,2,2],
'col2':[1,2,3,4,5,6,7,8,9,0],
'col3':[-1,-2,-3,-4,-5,-6,-7,-8,-9,0]
}
)
result = []
for k,v in data.groupby('col1'):
result.append([k, max(v['col2']), min(v['col3'])])
print pd.DataFrame(result, columns=['col1', 'col2_agg', 'col3_agg'])
Issues:
问题:
- too verbose
- probably can be optimized and efficient. (I rewrote a
for-loop groupby
implementation intogroupby.agg
and the performance enhancement was huge).
- 太冗长
- 可能可以优化和高效。(我重写了一个
for-loop groupby
实现groupby.agg
,性能增强是巨大的)。
In R the equivalent code would be:
在 R 中,等效的代码是:
data %>% groupby(col1) %>% summarize(col2_agg=max(col2), col3_agg=min(col3))
UPDATE: @ayhan solved my question, here is a follow-up question that I will post here instead of as comment:
更新:@ayhan 解决了我的问题,这是一个后续问题,我将在这里发布而不是作为评论:
Q2) What is the equivalent of groupby().summarize(newcolumn=max(col2 * col3))
i.e. an aggregation/summarization where the function is a compound function of 2+ columns?
Q2) 什么是groupby().summarize(newcolumn=max(col2 * col3))
ie 聚合/汇总的等价物,其中函数是 2+ 列的复合函数?
回答by ayhan
The equivalent of
相当于
df %>% groupby(col1) %>% summarize(col2_agg=max(col2), col3_agg=min(col3))
is
是
df.groupby('col1').agg({'col2': 'max', 'col3': 'min'})
which returns
返回
col2 col3
col1
1 5 -5
2 9 -9
The returning object is a pandas.DataFrame with an index called col1
and columns named col2
and col3
. By default, when you group your data pandas sets the grouping column(s) as index for efficient access and modification. However, if you don't want that, there are two alternatives to set col1
as a column.
返回的对象是一个名为索引的pandas.DataFramecol1
和命名的列col2
和col3
。默认情况下,当您对数据进行分组时,pandas 将分组列设置为索引以进行高效访问和修改。但是,如果您不想这样,有两种选择可以设置col1
为列。
Pass
as_index=False
:df.groupby('col1', as_index=False).agg({'col2': 'max', 'col3': 'min'})
Call
reset_index
:df.groupby('col1').agg({'col2': 'max', 'col3': 'min'}).reset_index()
通过
as_index=False
:df.groupby('col1', as_index=False).agg({'col2': 'max', 'col3': 'min'})
电话
reset_index
:df.groupby('col1').agg({'col2': 'max', 'col3': 'min'}).reset_index()
both yield
两者产量
col1 col2 col3
1 5 -5
2 9 -9
You can also pass multiple functions to groupby.agg
.
您还可以将多个函数传递给groupby.agg
.
agg_df = df.groupby('col1').agg({'col2': ['max', 'min', 'std'],
'col3': ['size', 'std', 'mean', 'max']})
This also returns a DataFrame but now it has a MultiIndex for columns.
这也会返回一个 DataFrame,但现在它有一个列的 MultiIndex。
col2 col3
max min std size std mean max
col1
1 5 1 1.581139 5 1.581139 -3 -1
2 9 0 3.535534 5 3.535534 -6 0
MultiIndex is very handy for selection and grouping. Here are some examples:
MultiIndex 对于选择和分组非常方便。这里有些例子:
agg_df['col2'] # select the second column
max min std
col1
1 5 1 1.581139
2 9 0 3.535534
agg_df[('col2', 'max')] # select the maximum of the second column
Out:
col1
1 5
2 9
Name: (col2, max), dtype: int64
agg_df.xs('max', axis=1, level=1) # select the maximum of all columns
Out:
col2 col3
col1
1 5 -1
2 9 0
Earlier (before version 0.20.0) it was possible to use dictionaries for renaming the columns in the agg
call. For example
早期(版本 0.20.0之前)可以使用字典来重命名agg
调用中的列。例如
df.groupby('col1')['col2'].agg({'max_col2': 'max'})
would return the maximum of the second column as max_col2
:
将第二列的最大值返回为max_col2
:
max_col2
col1
1 5
2 9
However, it was deprecated in favor of the rename method:
但是,它已被弃用,以支持重命名方法:
df.groupby('col1')['col2'].agg(['max']).rename(columns={'max': 'col2_max'})
col2_max
col1
1 5
2 9
It can get verbose for a DataFrame like agg_df
defined above. You can use a renaming function to flatten those levels in that case:
对于agg_df
上面定义的 DataFrame,它可能会变得冗长。在这种情况下,您可以使用重命名函数来展平这些级别:
agg_df.columns = ['_'.join(col) for col in agg_df.columns]
col2_max col2_min col2_std col3_size col3_std col3_mean col3_max
col1
1 5 1 1.581139 5 1.581139 -3 -1
2 9 0 3.535534 5 3.535534 -6 0
For operations like groupby().summarize(newcolumn=max(col2 * col3))
, you can still use agg by first adding a new column with assign
.
对于诸如 的操作groupby().summarize(newcolumn=max(col2 * col3))
,您仍然可以通过首先添加一个带有 的新列来使用 agg assign
。
df.assign(new_col=df.eval('col2 * col3')).groupby('col1').agg('max')
col2 col3 new_col
col1
1 5 -1 -1
2 9 0 0
This returns maximum for old and new columns but as always you can slice that.
这将返回旧列和新列的最大值,但与往常一样,您可以对其进行切片。
df.assign(new_col=df.eval('col2 * col3')).groupby('col1')['new_col'].agg('max')
col1
1 -1
2 0
Name: new_col, dtype: int64
With groupby.apply
this would be shorter:
有了groupby.apply
这个会更短:
df.groupby('col1').apply(lambda x: (x.col2 * x.col3).max())
col1
1 -1
2 0
dtype: int64
However, groupby.apply
treats this as a custom function so it is not vectorized. Up to now, the functions we passed to agg
('min', 'max', 'min', 'size' etc.) are vectorized and these are aliases for those optimized functions. You can replace df.groupby('col1').agg('min')
with df.groupby('col1').agg(min)
, df.groupby('col1').agg(np.min)
or df.groupby('col1').min()
and they will all execute the same function. You will not see the same efficiency when you use custom functions.
但是,groupby.apply
将此视为自定义函数,因此不会对其进行矢量化。到目前为止,我们传递给的函数agg
('min'、'max'、'min'、'size' 等)是矢量化的,这些是那些优化函数的别名。您可以df.groupby('col1').agg('min')
用df.groupby('col1').agg(min)
, df.groupby('col1').agg(np.min)
or替换df.groupby('col1').min()
它们,它们都将执行相同的功能。当您使用自定义函数时,您不会看到同样的效率。
Lastly, as of version 0.20, agg
can be used on DataFrames directly, without having to group first. See examples here.
最后,从 0.20 版本开始,agg
可以直接在 DataFrame 上使用,而无需先分组。请参阅此处的示例。
回答by Puriney
Check the side-by-side comparison given by Pandas documentation here: http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html#grouping-and-summarizing
在此处检查 Pandas 文档给出的并排比较:http: //pandas.pydata.org/pandas-docs/stable/comparison_with_r.html#grouping-and-summarizing
R's dplyr
R的dplyr
gdf <- group_by(df, col1)
summarise(gdf, avg=mean(col1, na.rm=TRUE))
Pandas
Pandas
gdf = df.groupby('col1')
df.groupby('col1').agg({'col1': 'mean'})