pandas 动态过滤熊猫数据框

Question

提问by ahoosh

I am trying to filter a pandas data frame using thresholds for three columns

我正在尝试使用三列的阈值过滤Pandas数据框

import pandas as pd
df = pd.DataFrame({"A" : [6, 2, 10, -5, 3],
                   "B" : [2, 5, 3, 2, 6],
                   "C" : [-5, 2, 1, 8, 2]})
df = df.loc[(df.A > 0) & (df.B > 2) & (df.C > -1)].reset_index(drop = True)

df
    A  B  C
0   2  5  2
1  10  3  1
2   3  6  2

However, I want to do this inside a function where the names of the columns and their thresholds are given to me in a dictionary. Here's my first try that works ok. Essentially I am putting the filter inside condvariable and just run it:

但是，我想在一个函数中执行此操作，其中列的名称及其阈值在字典中提供给我。这是我第一次尝试可以正常工作。基本上我将过滤器放在cond变量中并运行它：

df = pd.DataFrame({"A" : [6, 2, 10, -5, 3],
                   "B" : [2, 5, 3, 2, 6],
                   "C" : [-5, 2, 1, 8, 2]})
limits_dic = {"A" : 0, "B" : 2, "C" : -1}
cond = "df = df.loc["
for key in limits_dic.keys():
    cond += "(df." + key + " > " + str(limits_dic[key])+ ") & "
cond = cond[:-2] + "].reset_index(drop = True)"
exec(cond)
df
    A  B  C
0   2  5  2
1  10  3  1
2   3  6  2

Now, finally I put everything inside a function and it stops working (perhaps execfunction does not like to be used inside a function!):

现在，最后我把所有东西都放在一个函数中，它停止工作（也许exec函数不喜欢在函数中使用！）：

df = pd.DataFrame({"A" : [6, 2, 10, -5, 3],
                   "B" : [2, 5, 3, 2, 6],
                   "C" : [-5, 2, 1, 8, 2]})
limits_dic = {"A" : 0, "B" : 2, "C" : -1}
def filtering(df, limits_dic):
    cond = "df = df.loc["
    for key in limits_dic.keys():
        cond += "(df." + key + " > " + str(limits_dic[key])+ ") & "
    cond = cond[:-2] + "].reset_index(drop = True)"
    exec(cond)
    return(df)

df = filtering(df, limits_dic)
df
    A  B  C
0   6  2 -5
1   2  5  2
2  10  3  1
3  -5  2  8
4   3  6  2

I know that execfunction acts differently when used inside a function but was not sure how to address the problem. Also, I am wondering there must be a more elegant way to define a function to do the filtering given two input: 1)dfand 2)limits_dic = {"A" : 0, "B" : 2, "C" : -1}. I would appreciate any thoughts on this.

我知道该exec函数在函数内部使用时行为不同，但不确定如何解决该问题。另外，我想知道必须有一种更优雅的方法来定义一个函数来执行给定两个输入的过滤： 1)df和 2) limits_dic = {"A" : 0, "B" : 2, "C" : -1}。我将不胜感激对此的任何想法。

Answer 1

回答by cs95

If you're trying to build a dynamic query, there are easier ways. Here's one using a list comprehension and str.join:

如果您正在尝试构建动态查询，则有更简单的方法。这是使用列表推导式和str.join：

query = ' & '.join(['{}>{}'.format(k, v) for k, v in limits_dic.items()])

Or, using f-strings with python-3.6+,

或者，f在 python-3.6+ 中使用-strings，

query = ' & '.join([f'{k}>{v}' for k, v in limits_dic.items()])

print(query)

'A>0 & C>-1 & B>2'

Pass the query string to df.query, it's meant for this very purpose:

将查询字符串传递给df.query，它就是为了这个目的：

out = df.query(query)
print(out)

    A  B  C
1   2  5  2
2  10  3  1
4   3  6  2

You could also use df.evalif you want to obtain a boolean mask for your query, and then indexing becomes straightforward after that:

df.eval如果您想为您的查询获取布尔掩码，您也可以使用，然后索引变得简单：

mask = df.eval(query)
print(mask)

0    False
1     True
2     True
3    False
4     True
dtype: bool

out = df[mask]
print(out)

    A  B  C
1   2  5  2
2  10  3  1
4   3  6  2

String Data

字符串数据

If you need to query columns that use string data, the code above will need a slight modification.

如果需要查询使用字符串数据的列，上面的代码需要稍作修改。

Consider (data from this answer):

考虑（来自这个答案的数据）：

df = pd.DataFrame({'gender':list('MMMFFF'),
                   'height':[4,5,4,5,5,4],
                   'age':[70,80,90,40,2,3]})

print (df)
  gender  height  age
0      M       4   70
1      M       5   80
2      M       4   90
3      F       5   40
4      F       5    2
5      F       4    3

And a list of columns, operators, and values:

以及列、运算符和值的列表：

column = ['height', 'age', 'gender']
equal = ['>', '>', '==']
condition = [1.68, 20, 'F']

The appropriate modification here is:

这里适当的修改是：

query = ' & '.join(f'{i} {j} {repr(k)}' for i, j, k in zip(column, equal, condition))
df.query(query)

   age gender  height
3   40      F       5

For information on the pd.eval()family of functions, their features and use cases, please visit Dynamic Expression Evaluation in pandas using pd.eval().

有关pd.eval()函数系列、它们的特性和用例的信息，请使用 pd.eval()访问Pandas 中的动态表达式评估。

Answer 2

回答by Victor Yan

An alternative to @coldspeed 's version:

@coldspeed 版本的替代方案：

conditions = None
for key, val in limit_dic.items():
    cond = df[key] > val
    if conditions is None:
        conditions = cond
    else:
        conditions = conditions & cond
print(df[conditions])

Answer 3

回答by yvesva

An alternative to both posted, that may or may not be more pythonic:

两者都发布的替代方案，可能会或可能不会更加 Pythonic：

import pandas as pd
import operator
from functools import reduce

df = pd.DataFrame({"A": [6, 2, 10, -5, 3],
                   "B": [2, 5, 3, 2, 6],
                   "C": [-5, 2, 1, 8, 2]})

limits_dic = {"A": 0, "B": 2, "C": -1}

# equiv to [df['A'] > 0, df['B'] > 2 ...]
loc_elements = [df[key] > val for key, val in limits_dic.items()]

df = df.loc[reduce(operator.and_, loc_elements)]

pandas 动态过滤熊猫数据框

提问by ahoosh

回答by cs95

String Data

字符串数据

回答by Victor Yan

回答by yvesva

相关推荐

最近更新

标签

pandas 动态过滤熊猫数据框

提问by ahoosh

回答by cs95

String Data

字符串数据

回答by Victor Yan

回答by yvesva

相关推荐

pandas 无论如何要取消分组熊猫数据框中的数据？

pandas 仅当列值是字符串时才将列值转换为小写

pandas 'float' 对象没有属性 'strip'

如何在 Pandas 中获得过去几个月的移动平均值

相关推荐

最近更新

标签