pandas 对熊猫数据框进行子集化的最佳方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48370708/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Best way to subset a pandas dataframe
提问by Pierre-Eric Garcia
Hey I'm new to Pandas and I just came across df.query()
.
嘿,我是 Pandas 的新手,刚遇到df.query()
.
Why people would use df.query()
when you can directly filter your Dataframes using brackets notation ? The official pandas tutorial also seems to prefer the latter approach.
df.query()
当您可以使用括号表示法直接过滤数据框时,为什么人们会使用?pandas 官方教程似乎也更喜欢后一种方法。
With brackets notation :
用括号表示法:
df[df['age'] <= 21]
With pandas query method :
使用Pandas查询方法:
df.query('age <= 21')
Besides some of the stylistic or flexibility differences that have been mentioned, is one canonically preferred - namely for performance of operations on large dataframes?
除了已经提到的一些风格或灵活性差异之外,还有一个是规范的首选 - 即对大型数据帧的操作性能?
采纳答案by MaxU
Consider the following sample DF:
考虑以下示例 DF:
In [307]: df
Out[307]:
sex age name
0 M 40 Max
1 F 35 Anna
2 M 29 Joe
3 F 18 Maria
4 F 23 Natalie
There are quite a few good reasons to prefer .query()
method.
有很多很好的理由选择.query()
方法。
it might be much shorter and cleaner compared to boolean indexing:
In [308]: df.query("20 <= age <= 30 and sex=='F'") Out[308]: sex age name 4 F 23 Natalie In [309]: df[(df['age']>=20) & (df['age']<=30) & (df['sex']=='F')] Out[309]: sex age name 4 F 23 Natalie
you can prepare conditions (queries) programmatically:
In [315]: conditions = {'name':'Joe', 'sex':'M'} In [316]: q = ' and '.join(['{}=="{}"'.format(k,v) for k,v in conditions.items()]) In [317]: q Out[317]: 'name=="Joe" and sex=="M"' In [318]: df.query(q) Out[318]: sex age name 2 M 29 Joe
与布尔索引相比,它可能更短更干净:
In [308]: df.query("20 <= age <= 30 and sex=='F'") Out[308]: sex age name 4 F 23 Natalie In [309]: df[(df['age']>=20) & (df['age']<=30) & (df['sex']=='F')] Out[309]: sex age name 4 F 23 Natalie
您可以以编程方式准备条件(查询):
In [315]: conditions = {'name':'Joe', 'sex':'M'} In [316]: q = ' and '.join(['{}=="{}"'.format(k,v) for k,v in conditions.items()]) In [317]: q Out[317]: 'name=="Joe" and sex=="M"' In [318]: df.query(q) Out[318]: sex age name 2 M 29 Joe
PS there are also some disadvantages:
PS也有一些缺点:
- we can't use
.query()
method for columns containing spaces or columns that consist only from digits - not all functions can be applied or in some cases we have to use
engine='python'
instead of defaultengine='numexpr'
(which is faster)
- 我们不能
.query()
对包含空格的列或仅由数字组成的列使用方法 - 并非所有功能都可以应用,或者在某些情况下我们必须使用
engine='python'
而不是默认engine='numexpr'
(更快)
NOTE: Jeff (one of the main Pandas contributors and a member of Pandas core team) once said:
注意:Jeff(Pandas 的主要贡献者之一,也是 Pandas 核心团队的成员)曾经说过:
Note that in reality .query is just a nice-to-have interface, in fact it has very specific guarantees, meaning its meant to parse like a query language, and not a fully general interface.
请注意,实际上 .query 只是一个很好的接口,实际上它有非常具体的保证,这意味着它的目的是像查询语言一样解析,而不是一个完全通用的接口。
回答by Tai
Some other interesting usages in the documentation.
文档中其他一些有趣的用法。
Reuseable
可重复使用
A use case for query() is when you have a collection of DataFrame objects that have a subset of column names (or index levels/names) in common.You can pass the same query to both frames without having to specify which frame you're interested in querying -- (Source)
query() 的一个用例是当您有一组 DataFrame 对象时,这些对象具有共同的列名(或索引级别/名称)的子集。您可以将相同的查询传递给两个框架,而无需指定您对查询哪个框架感兴趣—— (来源)
Example:
例子:
dfA = pd.DataFrame([[1,2,3], [4,5,6]], columns=["X", "Y", "Z"])
dfB = pd.DataFrame([[1,3,3], [4,1,6]], columns=["X", "Y", "Z"])
q = "(X > 3) & (Y < 10)"
print(dfA.query(q))
print(dfB.query(q))
X Y Z
1 4 5 6
X Y Z
1 4 1 6
More flexible syntax
更灵活的语法
df.query('a < b and b < c') # understand a bit more English
Support in
operator and not in
(alternative to isin
)
支持in
运算符和not in
(替代isin
)
df.query('a in [3, 4, 5]') # select rows whose value of column a is in [2, 3, 4]
Special usage of == and != (similar to in
/not in
)
== 和 != 的特殊用法(类似于in
/ not in
)
df.query('a == [1, 3, 5]') # select whose value of column a is in [1, 3, 5]
# equivalent to df.query('a in [1, 3, 5]')