pandas 对熊猫数据框进行子集化的最佳方法

Question

提问by Pierre-Eric Garcia

Hey I'm new to Pandas and I just came across df.query().

嘿，我是 Pandas 的新手，刚遇到df.query().

Why people would use df.query()when you can directly filter your Dataframes using brackets notation ? The official pandas tutorial also seems to prefer the latter approach.

df.query()当您可以使用括号表示法直接过滤数据框时，为什么人们会使用？pandas 官方教程似乎也更喜欢后一种方法。

With brackets notation :

用括号表示法：

df[df['age'] <= 21]

With pandas query method :

使用Pandas查询方法：

df.query('age <= 21')

Besides some of the stylistic or flexibility differences that have been mentioned, is one canonically preferred - namely for performance of operations on large dataframes?

除了已经提到的一些风格或灵活性差异之外，还有一个是规范的首选 - 即对大型数据帧的操作性能？

Answer 1

采纳答案by MaxU

Consider the following sample DF:

考虑以下示例 DF：

In [307]: df
Out[307]:
  sex  age     name
0   M   40      Max
1   F   35     Anna
2   M   29      Joe
3   F   18    Maria
4   F   23  Natalie

There are quite a few good reasons to prefer .query()method.

有很多很好的理由选择.query()方法。

it might be much shorter and cleaner compared to boolean indexing:

In [308]: df.query("20 <= age <= 30 and sex=='F'")
Out[308]:
  sex  age     name
4   F   23  Natalie

In [309]: df[(df['age']>=20) & (df['age']<=30) & (df['sex']=='F')]
Out[309]:
  sex  age     name
4   F   23  Natalie

you can prepare conditions (queries) programmatically:

In [315]: conditions = {'name':'Joe', 'sex':'M'}

In [316]: q = ' and '.join(['{}=="{}"'.format(k,v) for k,v in conditions.items()])

In [317]: q
Out[317]: 'name=="Joe" and sex=="M"'

In [318]: df.query(q)
Out[318]:
  sex  age name
2   M   29  Joe

与布尔索引相比，它可能更短更干净：

In [308]: df.query("20 <= age <= 30 and sex=='F'")
Out[308]:
  sex  age     name
4   F   23  Natalie

In [309]: df[(df['age']>=20) & (df['age']<=30) & (df['sex']=='F')]
Out[309]:
  sex  age     name
4   F   23  Natalie

您可以以编程方式准备条件（查询）：

In [315]: conditions = {'name':'Joe', 'sex':'M'}

In [316]: q = ' and '.join(['{}=="{}"'.format(k,v) for k,v in conditions.items()])

In [317]: q
Out[317]: 'name=="Joe" and sex=="M"'

In [318]: df.query(q)
Out[318]:
  sex  age name
2   M   29  Joe

PS there are also some disadvantages:

PS也有一些缺点：

we can't use .query()method for columns containing spaces or columns that consist only from digits
not all functions can be applied or in some cases we have to use engine='python'instead of default engine='numexpr'(which is faster)

我们不能.query()对包含空格的列或仅由数字组成的列使用方法
并非所有功能都可以应用，或者在某些情况下我们必须使用engine='python'而不是默认engine='numexpr'（更快）

NOTE: Jeff (one of the main Pandas contributors and a member of Pandas core team) once said:

注意：Jeff（Pandas 的主要贡献者之一，也是 Pandas 核心团队的成员）曾经说过：

Note that in reality .query is just a nice-to-have interface, in fact it has very specific guarantees, meaning its meant to parse like a query language, and not a fully general interface.

请注意，实际上 .query 只是一个很好的接口，实际上它有非常具体的保证，这意味着它的目的是像查询语言一样解析，而不是一个完全通用的接口。

Answer 2

回答by Tai

Some other interesting usages in the documentation.

文档中其他一些有趣的用法。

Reuseable

可重复使用

A use case for query() is when you have a collection of DataFrame objects that have a subset of column names (or index levels/names) in common.You can pass the same query to both frames without having to specify which frame you're interested in querying -- (Source)

query() 的一个用例是当您有一组 DataFrame 对象时，这些对象具有共同的列名（或索引级别/名称）的子集。您可以将相同的查询传递给两个框架，而无需指定您对查询哪个框架感兴趣—— （来源）

Example:

例子：

dfA = pd.DataFrame([[1,2,3], [4,5,6]], columns=["X", "Y", "Z"])
dfB = pd.DataFrame([[1,3,3], [4,1,6]], columns=["X", "Y", "Z"])
q = "(X > 3) & (Y < 10)"

print(dfA.query(q))
print(dfB.query(q))

   X  Y  Z
1  4  5  6
   X  Y  Z
1  4  1  6

More flexible syntax

更灵活的语法

df.query('a < b and b < c')  # understand a bit more English

Support `in`operator and `not in`(alternative to `isin`)

支持`in`运算符和`not in`（替代`isin`）

df.query('a in [3, 4, 5]') # select rows whose value of column a is in [2, 3, 4]

Special usage of == and != (similar to `in`/`not in`)

== 和 != 的特殊用法（类似于`in`/ `not in`）

df.query('a == [1, 3, 5]') # select whose value of column a is in [1, 3, 5]
# equivalent to df.query('a in [1, 3, 5]')