Python Pandas groupby:根据值的条件过滤
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42240476/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas groupby: filter according to condition on values
提问by Adam
Consider a dataframe like the following.
考虑如下所示的数据框。
import pandas as pd
# Initialize dataframe
df1 = pd.DataFrame(columns=['bar', 'foo'])
df1['bar'] = ['001', '001', '001', '001', '002', '002', '003', '003', '003']
df1['foo'] = [-1, 0, 2, 3, -8, 1, 0, 1, 2]
>>> print df1
bar foo
0 001 -1
1 001 0
2 001 2
3 001 3
4 002 -8
5 002 1
6 003 0
7 003 1
8 003 2
# Lower and upper bound for desired range
lower_bound = -5
upper_bound = 5
I would like to use groupby in Pandas to return a dataframe that filters out rows with an bar
that meets a condition. In particular, I would like to filter out rows with an bar
if one of the values of foo
for this bar
is not between lower_bound
and upper_bound
.
我想在 Pandas 中使用 groupby 返回一个数据框,该数据框过滤掉bar
满足条件的行。我特别想用过滤掉行bar
,如果值的一个foo
本bar
不是之间lower_bound
和upper_bound
。
In the above example, rows with bar = 002
should be filtered out since not all of the rows with bar = 002
contain a value of foo
between -5
and 5
(namely, row index 4
contains foo = -8
). The desired output for this example is the following.
在上面的示例中,bar = 002
应过滤掉 的行,因为并非所有的 行bar = 002
都包含foo
介于-5
和之间的值5
(即,行索引4
包含foo = -8
)。此示例所需的输出如下。
# Desired output
bar foo
0 001 -1
1 001 0
2 001 2
3 001 3
6 003 0
7 003 1
8 003 2
I have tried the following approach.
我尝试了以下方法。
# Attempted solution
grouped = df1.groupby('bar')['foo']
grouped.filter(lambda x: x < lower_bound or x > upper_bound)
However, this yields a TypeError: the filter must return a boolean result
. Furthermore, this approach might return a groupby object, when I want the result to return a dataframe object.
但是,这会产生一个TypeError: the filter must return a boolean result
. 此外,当我希望结果返回一个数据帧对象时,这种方法可能会返回一个 groupby 对象。
采纳答案by Psidom
Most likely you will not use and
and or
but vectorized &
and |
with pandas
, and for your case, then apply all()
function in the filter to construct the boolean condition, this keeps bar
where all corresponding foo
values are between lower_boundand upper_bound:
很可能您不会使用and
andor
而是使用矢量化&
和|
with pandas
,并且对于您的情况,然后all()
在过滤器中应用函数来构造布尔条件,这将保持bar
所有相应foo
值在lower_bound和upper_bound之间的位置:
df1.groupby('bar').filter(lambda x: ((x.foo >= lower_bound) & (x.foo <= upper_bound)).all())
# bar foo
#0 001 -1
#1 001 0
#2 001 2
#3 001 3
#6 003 0
#7 003 1
#8 003 2
回答by mrPuzzle
Psidom's answer works fine, but can be slow on large datasets. Mine is somewhat of a workaround, but it is fast.
Psidom 的答案工作正常,但在大型数据集上可能会很慢。我的有点解决方法,但速度很快。
df1['conditions_apply'] = (df1.foo >= lower_bound) & (df1.foo <= upper_bound)
selection = df1.groupby('bar')['conditions_apply'].min() # any False will return False
selection = selection[selection].index.tolist() # get all bars with Trues
df1 = df1[df1.bar.isin(selection)] # make selection
df1.drop(columns=['conditions_apply'], inplace=True) # drop newly made column