Python Pandas groupby:根据值的条件过滤

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42240476/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:59:18  来源:igfitidea点击:

Python Pandas groupby: filter according to condition on values

pythonpandas

提问by Adam

Consider a dataframe like the following.

考虑如下所示的数据框。

import pandas as pd

# Initialize dataframe
df1 = pd.DataFrame(columns=['bar', 'foo'])
df1['bar'] = ['001', '001', '001', '001', '002', '002', '003', '003', '003']
df1['foo'] = [-1, 0, 2, 3, -8, 1, 0, 1, 2]
>>> print df1
   bar  foo
0  001   -1
1  001    0
2  001    2
3  001    3
4  002   -8
5  002    1
6  003    0
7  003    1
8  003    2

# Lower and upper bound for desired range
lower_bound = -5
upper_bound = 5

I would like to use groupby in Pandas to return a dataframe that filters out rows with an barthat meets a condition. In particular, I would like to filter out rows with an barif one of the values of foofor this baris not between lower_boundand upper_bound.

我想在 Pandas 中使用 groupby 返回一个数据框,该数据框过滤掉bar满足条件的行。我特别想用过滤掉行bar,如果值的一个foobar不是之间lower_boundupper_bound

In the above example, rows with bar = 002should be filtered out since not all of the rows with bar = 002contain a value of foobetween -5and 5(namely, row index 4contains foo = -8). The desired output for this example is the following.

在上面的示例中,bar = 002应过滤掉 的行,因为并非所有的 行bar = 002都包含foo介于-5和之间的值5(即,行索引4包含foo = -8)。此示例所需的输出如下。

# Desired output
   bar  foo
0  001   -1
1  001    0
2  001    2
3  001    3
6  003    0
7  003    1
8  003    2

I have tried the following approach.

我尝试了以下方法。

# Attempted solution
grouped = df1.groupby('bar')['foo']
grouped.filter(lambda x: x < lower_bound or x > upper_bound)

However, this yields a TypeError: the filter must return a boolean result. Furthermore, this approach might return a groupby object, when I want the result to return a dataframe object.

但是,这会产生一个TypeError: the filter must return a boolean result. 此外,当我希望结果返回一个数据帧对象时,这种方法可能会返回一个 groupby 对象。

采纳答案by Psidom

Most likely you will not use andand orbut vectorized &and |with pandas, and for your case, then apply all()function in the filter to construct the boolean condition, this keeps barwhere all corresponding foovalues are between lower_boundand upper_bound:

很可能您不会使用andandor而是使用矢量化&|with pandas,并且对于您的情况,然后all()在过滤器中应用函数来构造布尔条件,这将保持bar所有相应foo值在lower_boundupper_bound之间的位置:

df1.groupby('bar').filter(lambda x: ((x.foo >= lower_bound) & (x.foo <= upper_bound)).all())

#   bar foo
#0  001 -1
#1  001  0
#2  001  2
#3  001  3
#6  003  0
#7  003  1
#8  003  2

回答by mrPuzzle

Psidom's answer works fine, but can be slow on large datasets. Mine is somewhat of a workaround, but it is fast.

Psidom 的答案工作正常,但在大型数据集上可能会很慢。我的有点解决方法,但速度很快。

df1['conditions_apply'] = (df1.foo >= lower_bound) & (df1.foo <= upper_bound)
selection = df1.groupby('bar')['conditions_apply'].min()  # any False will return False
selection = selection[selection].index.tolist()           # get all bars with Trues
df1 = df1[df1.bar.isin(selection)]                        # make selection
df1.drop(columns=['conditions_apply'], inplace=True)      # drop newly made column