Python 什么是 Pandas 上的 SQL“GROUP BY HAVING”的等价物?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22105452/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is the equivalent of SQL "GROUP BY HAVING" on Pandas?
提问by Mannaggia
what would be the most efficient way to use groupby and in parallel apply a filter in pandas?
使用 groupby 并在 Pandas 中并行应用过滤器的最有效方法是什么?
Basically I am asking for the equivalent in SQL of
基本上我要求在 SQL 中的等价物
select *
...
group by col_name
having condition
I think there are many uses cases ranging from conditional means, sums, conditional probabilities, etc. which would make such a command very powerful.
我认为有很多用例,包括条件均值、总和、条件概率等,这将使这样的命令非常强大。
I need a very good performance, so ideally such a command would not be the result of several layered operations done in python.
我需要一个非常好的性能,所以理想情况下这样的命令不会是在 python 中完成的几个分层操作的结果。
采纳答案by Andy Hayden
As mentioned in unutbu's comment, groupby's filteris the equivalent of SQL'S HAVING:
正如 unutbu 的评论中提到的,groupby 的过滤器相当于 SQL 的 HAVING:
In [11]: df = pd.DataFrame([[1, 2], [1, 3], [5, 6]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
1 1 3
2 5 6
In [13]: g = df.groupby('A') # GROUP BY A
In [14]: g.filter(lambda x: len(x) > 1) # HAVING COUNT(*) > 1
Out[14]:
A B
0 1 2
1 1 3
You can write more complicated functions (these are applied to each group), provided they return a plain ol' bool:
您可以编写更复杂的函数(这些函数适用于每个组),前提是它们返回一个普通的 ol' bool:
In [15]: g.filter(lambda x: x['B'].sum() == 5)
Out[15]:
A B
0 1 2
1 1 3
Note: potentially there is a bugwhere you can't write you function to act on the columns you've used to groupby... a workaround is the groupby the columns manually i.e. g = df.groupby(df['A'])).
注意:可能存在一个错误,即您无法编写函数来对您曾经用于 groupby 的列进行操作……解决方法是手动 groupby 列,即g = df.groupby(df['A']))。
回答by Golden Lion
I group by state and county where max is greater than 20 then subquery the resulting values for True using the dataframe loc
我按州和县分组,其中最大值大于 20,然后使用数据帧 loc 子查询 True 的结果值
counties=df.groupby(['state','county'])['field1'].max()>20
counties=counties.loc[counties.values==True]

