使用 pd.eval() 在 Pandas 中进行动态表达式评估
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/53779986/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Dynamic Expression Evaluation in pandas using pd.eval()
提问by cs95
Given two DataFrames
给定两个数据帧
np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df1
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
3 8 8 1 6
4 7 7 8 1
df2
A B C D
0 5 9 8 9
1 4 3 0 3
2 5 0 2 3
3 8 1 3 3
4 3 7 0 1
I would like to perform arithmetic on one or more columns using pd.eval
. Specifically, I would like to port the following code:
我想使用pd.eval
. 具体来说,我想移植以下代码:
x = 5
df2['D'] = df1['A'] + (df1['B'] * x)
...to code using eval
. The reason for using eval
is that I would like to automate many workflows, so creating them dynamically will be useful to me.
...使用eval
. 使用的原因eval
是我想自动化许多工作流程,因此动态创建它们对我很有用。
I am trying to better understand the engine
and parser
arguments to determine how best to solve my problem. I have gone through the documentationbut the difference was not made clear to me.
我试图更好地理解engine
和parser
论据,以确定如何最好地解决我的问题。我已经阅读了文档,但我并不清楚其中的区别。
- What arguments should be used to ensure my code is working at max performance?
- Is there a way to assign the result of the expression back to
df2
? - Also, to make things more complicated, how do I pass
x
as an argument inside the string expression?
- 应该使用哪些参数来确保我的代码以最高性能运行?
- 有没有办法将表达式的结果分配回
df2
? - 另外,为了让事情变得更复杂,我如何
x
在字符串表达式中作为参数传递?
回答by cs95
This answer dives into the various features and functionality offered by pd.eval
, df.query
, and df.eval
.
这个答案潜入各种特性和功能的提供pd.eval
,df.query
和df.eval
。
Setup
Examples will involve these DataFrames (unless otherwise specified).
设置
示例将涉及这些数据帧(除非另有说明)。
np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df3 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df4 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
pandas.eval
- The "Missing Manual"
pandas.eval
- “丢失的手册”
Note
Of the three functions being discussed,pd.eval
is the most important.df.eval
anddf.query
callpd.eval
under the hood. Behaviour and usage is more or less consistent across the three functions, with some minor semantic variations which will be highlighted later. This section will introduce functionality that is common across all the three functions - this includes, (but not limited to) allowed syntax, precedence rules, and keyword arguments.
注意
在所讨论的三个函数中,pd.eval
是最重要的。df.eval
并在引擎盖下df.query
打电话pd.eval
。这三个函数的行为和用法或多或少是一致的,有一些细微的语义变化,稍后将重点介绍。本节将介绍所有三个函数共有的功能 - 这包括(但不限于)允许的语法、优先规则和关键字参数。
pd.eval
can evaluate arithmetic expressions which can consist of variables and/or literals. These expressions must be passed as strings. So, to answer the questionas stated, you can do
pd.eval
可以计算由变量和/或文字组成的算术表达式。这些表达式必须作为字符串传递。因此,要回答上述问题,您可以这样做
x = 5
pd.eval("df1.A + (df1.B * x)")
Some things to note here:
这里需要注意的一些事项:
- The entire expression is a string
df1
,df2
, andx
refer to variables in the global namespace, these are picked up byeval
when parsing the expression- Specific columns are accessed using the attribute accessor index. You can also use
"df1['A'] + (df1['B'] * x)"
to the same effect.
- 整个表达式是一个字符串
df1
,df2
, 和x
引用全局命名空间中的变量,这些eval
在解析表达式时被拾取- 使用属性访问器索引访问特定列。您也可以使用
"df1['A'] + (df1['B'] * x)"
到相同的效果。
I will be addressing the specific issue of reassignment in the section explaining the target=...
attribute below. But for now, here are more simple examples of valid operations with pd.eval
:
我将在解释以下target=...
属性的部分中解决重新分配的具体问题。但是现在,这里有更简单的有效操作示例pd.eval
:
pd.eval("df1.A + df2.A") # Valid, returns a pd.Series object
pd.eval("abs(df1) ** .5") # Valid, returns a pd.DataFrame object
...and so on. Conditional expressions are also supported in the same way. The statements below are all valid expressions and will be evaluated by the engine.
...等等。也以同样的方式支持条件表达式。下面的语句都是有效的表达式,将由引擎评估。
pd.eval("df1 > df2")
pd.eval("df1 > 5")
pd.eval("df1 < df2 and df3 < df4")
pd.eval("df1 in [1, 2, 3]")
pd.eval("1 < 2 < 3")
A list detailing all the supported features and syntax can be found in the documentation. In summary,
可以在文档中找到详细说明所有支持的功能和语法的列表。总之,
- Arithmetic operations except for the left shift (
<<
) and right shift (>>
) operators, e.g.,df + 2 * pi / s ** 4 % 42
- the_golden_ratio- Comparison operations, including chained comparisons, e.g.,
2 < df < df2
- Boolean operations, e.g.,
df < df2 and df3 < df4
ornot df_bool
list
andtuple
literals, e.g.,[1, 2]
or(1, 2)
- Attribute access, e.g.,
df.a
- Subscript expressions, e.g.,
df[0]
- Simple variable evaluation, e.g.,
pd.eval('df')
(this is not very useful)- Math functions: sin, cos, exp, log, expm1, log1p, sqrt, sinh, cosh, tanh, arcsin, arccos, arctan, arccosh, arcsinh, arctanh, abs and arctan2.
- 除了左移 (
<<
) 和右移 (>>
) 运算符之外的算术运算,例如df + 2 * pi / s ** 4 % 42
- the_golden_ratio- 比较操作,包括链式比较,例如,
2 < df < df2
- 布尔运算,例如,
df < df2 and df3 < df4
或not df_bool
list
和tuple
文字,例如,[1, 2]
或(1, 2)
- 属性访问,例如,
df.a
- 下标表达式,例如,
df[0]
- 简单的变量评估,例如,
pd.eval('df')
(这不是很有用)- 数学函数:sin、cos、exp、log、expm1、log1p、sqrt、sinh、cosh、tanh、arcsin、arccos、arctan、arccosh、arcsinh、arctanh、abs 和 arctan2。
This section of the documentation also specifies syntax rules that are not supported, including set
/dict
literals, if-else statements, loops, and comprehensions, and generator expressions.
文档的这一部分还指定了不受支持的语法规则,包括set
/dict
文字、if-else 语句、循环和推导式以及生成器表达式。
From the list, it is obvious you can also pass expressions involving the index, such as
从列表中,很明显您还可以传递涉及索引的表达式,例如
pd.eval('df1.A * (df1.index > 1)')
Parser Selection: The parser=...
argument
解析器选择:parser=...
参数
pd.eval
supports two different parser options when parsing the expression string to generate the syntax tree: pandas
and python
. The main difference between the two is highlighted by slightly differing precedence rules.
pd.eval
在解析表达式字符串以生成语法树时支持两种不同的解析器选项:pandas
和python
. 两者之间的主要区别通过略有不同的优先规则突出显示。
Using the default parser pandas
, the overloaded bitwise operators &
and |
which implement vectorized AND and OR operations with pandas objects will have the same operator precedence as and
and or
. So,
使用默认解析器pandas
,重载的按位运算符&
和|
使用 Pandas 对象实现矢量化 AND 和 OR 操作将具有与and
and相同的运算符优先级or
。所以,
pd.eval("(df1 > df2) & (df3 < df4)")
Will be the same as
将与
pd.eval("df1 > df2 & df3 < df4")
# pd.eval("df1 > df2 & df3 < df4", parser='pandas')
And also the same as
而且也一样
pd.eval("df1 > df2 and df3 < df4")
Here, the parentheses are necessary. To do this conventionally, the parens would be required to override the higher precedence of bitwise operators:
在这里,括号是必要的。按照惯例,要做到这一点,需要括号覆盖按位运算符的更高优先级:
(df1 > df2) & (df3 < df4)
Without that, we end up with
没有那个,我们最终得到
df1 > df2 & df3 < df4
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Use parser='python'
if you want to maintain consistency with python's actual operator precedence rules while evaluating the string.
parser='python'
如果要在评估字符串时保持与 python 的实际运算符优先级规则的一致性,请使用。
pd.eval("(df1 > df2) & (df3 < df4)", parser='python')
The other difference between the two types of parsers are the semantics of the ==
and !=
operators with list and tuple nodes, which have the similar semantics as in
and not in
respectively, when using the 'pandas'
parser. For example,
两种类型的解析器之间的另一个区别是具有列表和元组节点的==
and!=
运算符的语义,当使用解析器时,它们分别具有与in
和相似的语义。例如,not in
'pandas'
pd.eval("df1 == [1, 2, 3]")
Is valid, and will run with the same semantics as
是有效的,并将以相同的语义运行
pd.eval("df1 in [1, 2, 3]")
OTOH, pd.eval("df1 == [1, 2, 3]", parser='python')
will throw a NotImplementedError
error.
OTOH,pd.eval("df1 == [1, 2, 3]", parser='python')
会抛出NotImplementedError
错误。
Backend Selection: The engine=...
argument
后端选择:engine=...
参数
There are two options - numexpr
(the default) and python
. The numexpr
option uses the numexprbackend which is optimized for performance.
有两个选项 - numexpr
(默认)和python
. 该numexpr
选项使用针对性能进行了优化的numexpr后端。
With 'python'
backend, your expression is evaluated similar to just passing the expression to python's eval
function. You have the flexibility of doing more inside expressions, such as string operations, for instance.
使用'python'
后端,您的表达式的计算类似于将表达式传递给 python 的eval
函数。您可以灵活地执行更多内部表达式,例如字符串操作。
df = pd.DataFrame({'A': ['abc', 'def', 'abacus']})
pd.eval('df.A.str.contains("ab")', engine='python')
0 True
1 False
2 True
Name: A, dtype: bool
Unfortunately, this method offers noperformance benefits over the numexpr
engine, and there are very few security measures to ensure that dangerous expressions are not evaluated, so USE AT YOUR OWN RISK! It is generally not recommended to change this option to 'python'
unless you know what you're doing.
不幸的是,与引擎相比,此方法没有提供任何性能优势numexpr
,并且几乎没有安全措施可以确保不评估危险表达式,因此请自行承担风险!通常不建议将此选项更改为,'python'
除非您知道自己在做什么。
local_dict
and global_dict
arguments
local_dict
和global_dict
论据
Sometimes, it is useful to supply values for variables used inside expressions, but not currently defined in your namespace. You can pass a dictionary to local_dict
有时,为表达式中使用的变量提供值很有用,但当前未在您的命名空间中定义。您可以将字典传递给local_dict
For example,
例如,
pd.eval("df1 > thresh")
UndefinedVariableError: name 'thresh' is not defined
This fails because thresh
is not defined. However, this works:
这失败了,因为thresh
没有定义。但是,这有效:
pd.eval("df1 > thresh", local_dict={'thresh': 10})
This is useful when you have variables to supply from a dictionary. Alternatively, with the 'python'
engine, you could simply do this:
当您需要从字典中提供变量时,这很有用。或者,使用'python'
引擎,您可以简单地执行以下操作:
mydict = {'thresh': 5}
# Dictionary values with *string* keys cannot be accessed without
# using the 'python' engine.
pd.eval('df1 > mydict["thresh"]', engine='python')
But this is going to possibly be muchslower than using the 'numexpr'
engine and passing a dictionary to local_dict
or global_dict
. Hopefully, this should make a convincing argument for the use of these parameters.
但是,这将可能是很多比使用较慢的'numexpr'
发动机和传递一个字典local_dict
或global_dict
。希望这应该为使用这些参数提供令人信服的论据。
The target
(+ inplace
) argument, and Assignment Expressions
的target
(+ inplace
)参数,并赋值表达式
This is not often a requirement because there are usually simpler ways of doing this, but you can assign the result of pd.eval
to an object that implements __getitem__
such as dict
s, and (you guessed it) DataFrames.
这通常不是必需的,因为通常有更简单的方法来执行此操作,但是您可以将 的结果分配给pd.eval
实现__getitem__
诸如dict
s 和(您猜对了)DataFrames 的对象。
Consider the example in the question
考虑问题中的例子
x = 5 df2['D'] = df1['A'] + (df1['B'] * x)
x = 5 df2['D'] = df1['A'] + (df1['B'] * x)
To assign a column "D" to df2
, we do
要将列“D”分配给df2
,我们执行
pd.eval('D = df1.A + (df1.B * x)', target=df2)
A B C D
0 5 9 8 5
1 4 3 0 52
2 5 0 2 22
3 8 1 3 48
4 3 7 0 42
This is not an in-place modification of df2
(but it can be... read on). Consider another example:
这不是对df2
(但它可以......继续阅读)的就地修改。考虑另一个例子:
pd.eval('df1.A + df2.A')
0 10
1 11
2 7
3 16
4 10
dtype: int32
If you wanted to (for example) assign this back to a DataFrame, you could use the target
argument as follows:
如果您想(例如)将其分配回 DataFrame,您可以target
按如下方式使用该参数:
df = pd.DataFrame(columns=list('FBGH'), index=df1.index)
df
F B G H
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
df = pd.eval('B = df1.A + df2.A', target=df)
# Similar to
# df = df.assign(B=pd.eval('df1.A + df2.A'))
df
F B G H
0 NaN 10 NaN NaN
1 NaN 11 NaN NaN
2 NaN 7 NaN NaN
3 NaN 16 NaN NaN
4 NaN 10 NaN NaN
If you wanted to perform an in-place mutation on df
, set inplace=True
.
如果您想对 执行就地突变df
,请设置inplace=True
。
pd.eval('B = df1.A + df2.A', target=df, inplace=True)
# Similar to
# df['B'] = pd.eval('df1.A + df2.A')
df
F B G H
0 NaN 10 NaN NaN
1 NaN 11 NaN NaN
2 NaN 7 NaN NaN
3 NaN 16 NaN NaN
4 NaN 10 NaN NaN
If inplace
is set without a target, a ValueError
is raised.
如果inplace
在没有目标的情况下设置,ValueError
则提高 a。
While the target
argument is fun to play around with, you will seldom need to use it.
虽然target
玩这个参数很有趣,但你很少需要使用它。
If you wanted to do this with df.eval
, you would use an expression involving an assignment:
如果你想用 来做这件事df.eval
,你可以使用一个涉及赋值的表达式:
df = df.eval("B = @df1.A + @df2.A")
# df.eval("B = @df1.A + @df2.A", inplace=True)
df
F B G H
0 NaN 10 NaN NaN
1 NaN 11 NaN NaN
2 NaN 7 NaN NaN
3 NaN 16 NaN NaN
4 NaN 10 NaN NaN
Note
One of pd.eval
's unintended uses is parsing literal strings in a manner very similar to ast.literal_eval
:
注意 的
一个pd.eval
非预期用途是以一种非常类似于 的方式解析文字字符串ast.literal_eval
:
pd.eval("[1, 2, 3]")
array([1, 2, 3], dtype=object)
It can also parse nested lists with the 'python'
engine:
它还可以使用'python'
引擎解析嵌套列表:
pd.eval("[[1, 2, 3], [4, 5], [10]]", engine='python')
[[1, 2, 3], [4, 5], [10]]
And lists of strings:
和字符串列表:
pd.eval(["[1, 2, 3]", "[4, 5]", "[10]"], engine='python')
[[1, 2, 3], [4, 5], [10]]
The problem, however, is for lists with length larger than 100:
然而,问题在于长度大于 100 的列表:
pd.eval(["[1]"] * 100, engine='python') # Works
pd.eval(["[1]"] * 101, engine='python')
AttributeError: 'PandasExprVisitor' object has no attribute 'visit_Ellipsis'
More information can this error, causes, fixes, and workarounds can be found here.
可以在此处找到有关此错误、原因、修复和解决方法的更多信息。
DataFrame.eval
- A Juxtaposition with pandas.eval
DataFrame.eval
- 并列 pandas.eval
As mentioned above, df.eval
calls pd.eval
under the hood. The v0.23 source codeshows this:
如上所述,df.eval
幕后调用pd.eval
。的v0.23源代码示出了该:
def eval(self, expr, inplace=False, **kwargs):
from pandas.core.computation.eval import eval as _eval
inplace = validate_bool_kwarg(inplace, 'inplace')
resolvers = kwargs.pop('resolvers', None)
kwargs['level'] = kwargs.pop('level', 0) + 1
if resolvers is None:
index_resolvers = self._get_index_resolvers()
resolvers = dict(self.iteritems()), index_resolvers
if 'target' not in kwargs:
kwargs['target'] = self
kwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)
return _eval(expr, inplace=inplace, **kwargs)
eval
creates arguments, does a little validation, and passes the arguments on to pd.eval
.
eval
创建参数,进行一些验证,然后将参数传递给pd.eval
.
For more, you can read on: when to use DataFrame.eval() versus pandas.eval() or python eval()
有关更多信息,您可以继续阅读:何时使用 DataFrame.eval() 与 pandas.eval() 或 python eval()
Usage Differences
用法差异
Expressions with DataFrames v/s Series Expressions
带有 DataFrames 的表达式 v/s 系列表达式
For dynamic queries associated with entire DataFrames, you should prefer pd.eval
. For example, there is no simple way to specify the equivalent of pd.eval("df1 + df2")
when you call df1.eval
or df2.eval
.
对于与整个 DataFrame 关联的动态查询,您应该更喜欢pd.eval
. 例如,没有简单的方法可以指定pd.eval("df1 + df2")
调用df1.eval
或时的等效项df2.eval
。
Specifying Column Names
指定列名
Another other major difference is how columns are accessed. For example, to add two columns "A" and "B" in df1
, you would call pd.eval
with the following expression:
另一个主要区别是访问列的方式。例如,要在 中添加两列“A”和“B” df1
,您可以pd.eval
使用以下表达式进行调用:
pd.eval("df1.A + df1.B")
With df.eval, you need only supply the column names:
使用 df.eval,您只需要提供列名:
df1.eval("A + B")
Since, within the context of df1
, it is clear that "A" and "B" refer to column names.
因为,在 的上下文中df1
,很明显“A”和“B”指的是列名。
You can also refer to the index and columns using index
(unless the index is named, in which case you would use the name).
您还可以使用引用索引和列index
(除非索引已命名,在这种情况下您将使用名称)。
df1.eval("A + index")
Or, more generally, for any DataFrame with an index having 1 or more levels, you can refer to the kthlevel of the index in an expression using the variable "ilevel_k"which stands for "index at level k". IOW, the expression above can be written as df1.eval("A + ilevel_0")
.
或者,更一般地,对于具有1或更多级的索引数据帧的任何,可以参考第k个使用变量索引的水平在表达“ilevel_k”表示“我ndex在等级k”。IOW,上面的表达式可以写成df1.eval("A + ilevel_0")
.
These rules also apply to query
.
这些规则也适用于query
.
Accessing Variables in Local/Global Namespace
访问本地/全局命名空间中的变量
Variables supplied inside expressions must be preceeded by the "@" symbol, to avoid confusion with column names.
表达式中提供的变量必须以“@”符号开头,以避免与列名混淆。
A = 5
df1.eval("A > @A")
The same goes for query
.
也是如此query
。
It goes without saying that your column names must follow the rules for valid identifier naming in python to be accessible inside eval
. See herefor a list of rules on naming identifiers.
不用说,您的列名必须遵循 python 中有效标识符命名的规则才能在eval
. 有关命名标识符的规则列表,请参见此处。
Multiline Queries and Assignment
多行查询和赋值
A little known fact is that eval
support multiline expressions that deal with assignment. For example, to create two new columns "E" and "F" in df1 based on some arithmetic operations on some columns, and a third column "G" based on the previously created "E" and "F", we can do
一个鲜为人知的事实是eval
支持处理赋值的多行表达式。例如,要基于对某些列的一些算术运算在 df1 中创建两个新列“E”和“F”,以及基于先前创建的“E”和“F”的第三列“G”,我们可以这样做
df1.eval("""
E = A + B
F = @df2.A + @df2.B
G = E >= F
""")
A B C D E F G
0 5 0 3 3 5 14 False
1 7 9 3 5 16 7 True
2 2 4 7 6 6 5 True
3 8 8 1 6 16 9 True
4 7 7 8 1 14 10 True
...Nifty! However, note that this is not supported by query
.
……漂亮!但是,请注意,这不受query
.
eval
v/s query
- Final Word
eval
v/s query
- 最后一句话
It helps to think of df.query
as a function that uses pd.eval
as a subroutine.
将其df.query
视为pd.eval
用作子例程的函数会有所帮助。
Typically, query
(as the name suggests) is used to evaluate conditional expressions (i.e., expressions that result in True/False values) and return the rows corresponding to the True
result. The result of the expression is then passed to loc
(in most cases) to return the rows that satisfy the expression. According to the documentation,
通常,query
(顾名思义)用于评估条件表达式(即导致 True/False 值的表达式)并返回与True
结果对应的行。然后将表达式的结果传递给loc
(在大多数情况下)以返回满足表达式的行。根据文档,
The result of the evaluation of this expression is first passed to
DataFrame.loc
and if that fails because of a multidimensional key (e.g., a DataFrame) then the result will be passed toDataFrame.__getitem__()
.This method uses the top-level
pandas.eval()
function to evaluate the passed query.
此表达式的计算结果首先传递给
DataFrame.loc
,如果由于多维键(例如,DataFrame)而失败,则结果将传递给DataFrame.__getitem__()
。此方法使用顶级
pandas.eval()
函数来评估传递的查询。
In terms of similarity, query
and df.eval
are both alike in how they access column names and variables.
在相似的条件,query
并df.eval
在他们如何访问列名和变量都一样。
This key difference between the two, as mentioned above is how they handle the expression result. This becomes obvious when you actually run an expression through these two functions. For example, consider
如上所述,两者之间的主要区别在于它们如何处理表达式结果。当您通过这两个函数实际运行表达式时,这一点变得显而易见。例如,考虑
df1.A
0 5
1 7
2 2
3 8
4 7
Name: A, dtype: int32
df1.B
0 9
1 3
2 0
3 1
4 7
Name: B, dtype: int32
To get all rows where "A" >= "B" in df1
, we would use eval
like this:
要获取 "A" >= "B" in 的所有行df1
,我们将使用eval
如下:
m = df1.eval("A >= B")
m
0 True
1 False
2 False
3 True
4 True
dtype: bool
m
represents the intermediate result generated by evaluating the expression "A >= B". We then use the mask to filter df1
:
m
表示通过评估表达式“A >= B”生成的中间结果。然后我们使用掩码过滤df1
:
df1[m]
# df1.loc[m]
A B C D
0 5 0 3 3
3 8 8 1 6
4 7 7 8 1
However, with query
, the intermediate result "m" is directly passed to loc
, so with query
, you would simply need to do
但是,使用query
,中间结果“m”直接传递给loc
,因此使用query
,您只需要执行
df1.query("A >= B")
A B C D
0 5 0 3 3
3 8 8 1 6
4 7 7 8 1
Performance wise, it is exactlythe same.
性能方面,完全一样。
df1_big = pd.concat([df1] * 100000, ignore_index=True)
%timeit df1_big[df1_big.eval("A >= B")]
%timeit df1_big.query("A >= B")
14.7 ms ± 33.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
14.7 ms ± 24.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But the latter is more concise, and expresses the same operation in a single step.
但后者更简洁,在一个步骤中表达相同的操作。
Note that you can also do weird stuff with query
like this (to, say, return all rows indexed by df1.index)
请注意,您也可以用query
这样的方式做奇怪的事情(例如,返回由 df1.index 索引的所有行)
df1.query("index")
# Same as df1.loc[df1.index] # Pointless,... I know
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
3 8 8 1 6
4 7 7 8 1
But don't.
但是不要。
Bottom line: Please use query
when querying or filtering rows based on a conditional expression.
底线:请query
在基于条件表达式查询或过滤行时使用。
回答by astro123
Great tutorial already, but bear in mind that before jumping wildly into the usage of eval/query
attracted by its simpler syntax, it has severe performance issues if your dataset has less than 15,000 rows.
已经很棒的教程,但请记住,在eval/query
被其更简单的语法吸引到疯狂使用之前,如果您的数据集少于 15,000 行,它会出现严重的性能问题。
In that case, simply use df.loc[mask1, mask2]
.
在这种情况下,只需使用df.loc[mask1, mask2]
.
Refer: https://pandas.pydata.org/pandas-docs/version/0.22/enhancingperf.html#enhancingperf-eval
参考:https: //pandas.pydata.org/pandas-docs/version/0.22/enhancingperf.html#enhancingperf-eval