Python Pandas 中布尔索引的逻辑运算符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21415661/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 22:47:49  来源:igfitidea点击:

Logical operators for boolean indexing in Pandas

pythonpandasdataframebooleanfiltering

提问by user2988577

I'm working with boolean index in Pandas. The question is why the statement:

我正在 Pandas 中使用布尔索引。问题是为什么声明:

a[(a['some_column']==some_number) & (a['some_other_column']==some_other_number)]

works fine whereas

工作正常,而

a[(a['some_column']==some_number) and (a['some_other_column']==some_other_number)]

exits with error?

退出时出错?

Example:

例子:

a=pd.DataFrame({'x':[1,1],'y':[10,20]})

In: a[(a['x']==1)&(a['y']==10)]
Out:    x   y
     0  1  10

In: a[(a['x']==1) and (a['y']==10)]
Out: ValueError: The truth value of an array with more than one element is ambiguous.     Use a.any() or a.all()

采纳答案by unutbu

When you say

当你说

(a['x']==1) and (a['y']==10)

You are implicitly asking Python to convert (a['x']==1)and (a['y']==10)to boolean values.

您隐含地要求 Python 转换(a['x']==1)并转换(a['y']==10)为布尔值。

NumPy arrays (of length greater than 1) and Pandas objects such as Series do not have a boolean value -- in other words, they raise

NumPy 数组(长度大于 1)和 Pandas 对象(例如 Series)没有布尔值——换句话说,它们会引发

ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

when used as a boolean value. That's because its unclear when it should be True or False. Some users might assume they are True if they have non-zero length, like a Python list. Others might desire for it to be True only if allits elements are True. Others might want it to be True if anyof its elements are True.

当用作布尔值时。那是因为它不清楚什么时候应该是 True 或 False。如果它们的长度不为零,一些用户可能会认为它们是 True,例如 Python 列表。其他人可能只希望它的所有元素都为真时才为真。如果它的任何元素为 True,其他人可能希望它为True。

Because there are so many conflicting expectations, the designers of NumPy and Pandas refuse to guess, and instead raise a ValueError.

因为有太多相互矛盾的期望,NumPy 和 Pandas 的设计者拒绝猜测,而是提出了 ValueError。

Instead, you must be explicit, by calling the empty(), all()or any()method to indicate which behavior you desire.

相反,您必须明确,通过调用empty(),all()any()方法来指示您想要哪种行为。

In this case, however, it looks like you do not want boolean evaluation, you want element-wiselogical-and. That is what the &binary operator performs:

但是,在这种情况下,您似乎不需要布尔值计算,而是需要逐元素逻辑与。这就是&二元运算符的作用:

(a['x']==1) & (a['y']==10)

returns a boolean array.

返回一个布尔数组。



By the way, as alexpmil notes, the parentheses are mandatory since &has a higher operator precedencethan ==. Without the parentheses, a['x']==1 & a['y']==10would be evaluated as a['x'] == (1 & a['y']) == 10which would in turn be equivalent to the chained comparison (a['x'] == (1 & a['y'])) and ((1 & a['y']) == 10). That is an expression of the form Series and Series. The use of andwith two Series would again trigger the same ValueErroras above. That's why the parentheses are mandatory.

顺便说一句,正如alexpmil 指出的那样,括号是强制性的,因为&它的运算符优先级高于==。没有括号,a['x']==1 & a['y']==10将被评估为a['x'] == (1 & a['y']) == 10这又相当于链式比较(a['x'] == (1 & a['y'])) and ((1 & a['y']) == 10)。那是形式的表达Series and Series。使用andwith 两个 Series 将再次触发与ValueError上述相同的操作。这就是为什么括号是强制性的。

回答by cs95

TLDR; Logical Operators in Pandas are &, |and ~, and parentheses (...)is important!

TLDR;Pandas 中的逻辑运算符是&|~,括号(...)很重要!

Python's and, orand notlogical operators are designed to work with scalars. So Pandas had to do one better and override the bitwise operators to achieve vectorized(element-wise) version of this functionality.

Python 的and,ornot逻辑运算符旨在与标量一起使用。因此 Pandas 必须做得更好并覆盖按位运算符以实现此功能的矢量化(按元素)版本。

So the following in python (exp1and exp2are expressions which evaluate to a boolean result)...

因此,python 中的以下内容(exp1并且exp2是计算结果为布尔结果的表达式)...

exp1 and exp2              # Logical AND
exp1 or exp2               # Logical OR
not exp1                   # Logical NOT

...will translate to...

……会翻译成……

exp1 & exp2                # Element-wise logical AND
exp1 | exp2                # Element-wise logical OR
~exp1                      # Element-wise logical NOT

for pandas.

对于熊猫。

If in the process of performing logical operation you get a ValueError, then you need to use parentheses for grouping:

如果在执行逻辑运算的过程中得到一个ValueError,则需要使用括号进行分组:

(exp1) op (exp2)

For example,

例如,

(df['col1'] == x) & (df['col2'] == y) 

And so on.

等等。



Boolean Indexing: A common operation is to compute boolean masks through logical conditions to filter the data. Pandas provides threeoperators: &for logical AND, |for logical OR, and ~for logical NOT.

布尔索引:一个常见的操作是通过逻辑条件计算布尔掩码来过滤数据。Pandas 提供了三种运算符:&逻辑与、|逻辑或和~逻辑非。

Consider the following setup:

考虑以下设置:

np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (5, 3)), columns=list('ABC'))
df

   A  B  C
0  5  0  3
1  3  7  9
2  3  5  2
3  4  7  6
4  8  8  1

Logical AND

逻辑与

For dfabove, say you'd like to return all rows where A < 5 and B > 5. This is done by computing masks for each condition separately, and ANDing them.

对于df上述情况,假设您想返回 A < 5 和 B > 5 的所有行。这是通过分别计算每个条件的掩码并对它们进行 AND 运算来完成的。

Overloaded Bitwise &Operator
Before continuing, please take note of this particular excerpt of the docs, which state

重载的按位&运算符
在继续之前,请注意文档的这个特定摘录,其中说明

Another common operation is the use of boolean vectors to filter the data. The operators are: |for or, &for and, and ~for not. These must be grouped by using parentheses, since by default Python will evaluate an expression such as df.A > 2 & df.B < 3as df.A > (2 & df.B) < 3, while the desired evaluation order is (df.A > 2) & (df.B < 3).

另一个常见的操作是使用布尔向量来过滤数据。运算符是:|for or&forand~for not这些必须通过使用括号来分组,由于由默认的Python将评估的表达式如df.A > 2 & df.B < 3df.A > (2 & df.B) < 3,而所期望的评价顺序是(df.A > 2) & (df.B < 3)

So, with this in mind, element wise logical AND can be implemented with the bitwise operator &:

因此,考虑到这一点,元素明智的逻辑 AND 可以用按位运算符实现&

df['A'] < 5

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'] > 5

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

(df['A'] < 5) & (df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

And the subsequent filtering step is simply,

而后续的过滤步骤很简单,

df[(df['A'] < 5) & (df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

The parentheses are used to override the default precedence order of bitwise operators, which have higher precedence over the conditional operators <and >. See the section of Operator Precedencein the python docs.

括号用于覆盖按位运算符的默认优先顺序,这些运算符的优先级高于条件运算符<>。请参阅python 文档中的运算符优先级部分。

If you do not use parentheses, the expression is evaluated incorrectly. For example, if you accidentally attempt something such as

如果不使用括号,则表达式的计算结果不正确。例如,如果您不小心尝试了诸如

df['A'] < 5 & df['B'] > 5

It is parsed as

它被解析为

df['A'] < (5 & df['B']) > 5

Which becomes,

变成,

df['A'] < something_you_dont_want > 5

Which becomes (see the python docs on chained operator comparison),

变成了(参见关于链式运算符比较的 python 文档),

(df['A'] < something_you_dont_want) and (something_you_dont_want > 5)

Which becomes,

变成,

# Both operands are Series...
something_else_you_dont_want1 and something_else_you_dont_want2

Which throws

哪个抛出

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

So, don't make that mistake!1

所以,不要犯这个错误!1

Avoiding Parentheses Grouping
The fix is actually quite simple. Most operators have a corresponding bound method for DataFrames. If the individual masks are built up using functions instead of conditional operators, you will no longer need to group by parens to specify evaluation order:

避免括号分组
修复实际上非常简单。大多数操作符都有对应的 DataFrame 绑定方法。如果单个掩码是使用函数而不是条件运算符构建的,您将不再需要按括号分组来指定评估顺序:

df['A'].lt(5)

0     True
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'].gt(5)

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

df['A'].lt(5) & df['B'].gt(5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

See the section on Flexible Comparisons.. To summarise, we have

请参阅灵活比较部分. 总结一下,我们有

╒════╤════════════╤════════════╕
│    │ Operator   │ Function   │
╞════╪════════════╪════════════╡
│  0 │ >          │ gt         │
├────┼────────────┼────────────┤
│  1 │ >=         │ ge         │
├────┼────────────┼────────────┤
│  2 │ <          │ lt         │
├────┼────────────┼────────────┤
│  3 │ <=         │ le         │
├────┼────────────┼────────────┤
│  4 │ ==         │ eq         │
├────┼────────────┼────────────┤
│  5 │ !=         │ ne         │
╘════╧════════════╧════════════╛

Another option for avoiding parentheses is to use DataFrame.query(or eval):

避免括号的另一种选择是使用DataFrame.query(或eval):

df.query('A < 5 and B > 5')

   A  B  C
1  3  7  9
3  4  7  6

I have extensivelydocumented queryand evalin Dynamic Expression Evaluation in pandas using pd.eval().

我已经广泛地记录queryeval使用pd.eval动态表达评价大熊猫()

operator.and_
Allows you to perform this operation in a functional manner. Internally calls Series.__and__which corresponds to the bitwise operator.

operator.and_
允许您以功能方式执行此操作。内部调用Series.__and__对应于按位运算符。

import operator 

operator.and_(df['A'] < 5, df['B'] > 5)
# Same as,
# (df['A'] < 5).__and__(df['B'] > 5) 

0    False
1     True
2    False
3     True
4    False
dtype: bool

df[operator.and_(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

You won't usually need this, but it is useful to know.

你通常不需要这个,但知道它很有用。

Generalizing: np.logical_and(and logical_and.reduce)
Another alternative is using np.logical_and, which also does not need parentheses grouping:

概括:(np.logical_andlogical_and.reduce
另一种选择是 using np.logical_and,它也不需要括号分组:

np.logical_and(df['A'] < 5, df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
Name: A, dtype: bool

df[np.logical_and(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

np.logical_andis a ufunc (Universal Functions), and most ufuncs have a reducemethod. This means it is easier to generalise with logical_andif you have multiple masks to AND. For example, to AND masks m1and m2and m3with &, you would have to do

np.logical_and是一个ufunc (Universal Functions),大多数 ufunc 都有一个reduce方法。这意味着logical_and如果您有多个与 AND 的掩码,则更容易推广。例如,要 AND 掩码m1m2m3with &,您必须执行

m1 & m2 & m3

However, an easier option is

然而,一个更简单的选择是

np.logical_and.reduce([m1, m2, m3])

This is powerful, because it lets you build on top of this with more complex logic (for example, dynamically generating masks in a list comprehension and adding all of them):

这很强大,因为它允许您在此基础上构建更复杂的逻辑(例如,在列表推导式中动态生成掩码并添加所有掩码):

import operator

cols = ['A', 'B']
ops = [np.less, np.greater]
values = [5, 5]

m = np.logical_and.reduce([op(df[c], v) for op, c, v in zip(ops, cols, values)])
m 
# array([False,  True, False,  True, False])

df[m]
   A  B  C
1  3  7  9
3  4  7  6

1 - I know I'm harping on this point, but please bear with me. This is a very, verycommon beginner's mistake, and must be explained very thoroughly.

1 - 我知道我在这一点上喋喋不休,但请耐心等待。这是一个非常非常常见的初学者的错误,必须非常有详尽的解释。



Logical OR

逻辑或

For the dfabove, say you'd like to return all rows where A == 3 or B == 7.

对于df上述情况,假设您想返回 A == 3 或 B == 7 的所有行。

Overloaded Bitwise |

按位重载 |

df['A'] == 3

0    False
1     True
2     True
3    False
4    False
Name: A, dtype: bool

df['B'] == 7

0    False
1     True
2    False
3     True
4    False
Name: B, dtype: bool

(df['A'] == 3) | (df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[(df['A'] == 3) | (df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

If you haven't yet, please also read the section on Logical ANDabove, all caveats apply here.

如果您还没有,请同时阅读上面关于逻辑与的部分,所有注意事项都适用于此处。

Alternatively, this operation can be specified with

或者,此操作可以指定为

df[df['A'].eq(3) | df['B'].eq(7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

operator.or_
Calls Series.__or__under the hood.

operator.or_
在幕后打电话Series.__or__

operator.or_(df['A'] == 3, df['B'] == 7)
# Same as,
# (df['A'] == 3).__or__(df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[operator.or_(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

np.logical_or
For two conditions, use logical_or:

np.logical_or
对于两种情况,请使用logical_or

np.logical_or(df['A'] == 3, df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df[np.logical_or(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

For multiple masks, use logical_or.reduce:

对于多个掩码,请使用logical_or.reduce

np.logical_or.reduce([df['A'] == 3, df['B'] == 7])
# array([False,  True,  True,  True, False])

df[np.logical_or.reduce([df['A'] == 3, df['B'] == 7])]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6


Logical NOT

逻辑非

Given a mask, such as

给定一个面具,例如

mask = pd.Series([True, True, False])

If you need to invert every boolean value (so that the end result is [False, False, True]), then you can use any of the methods below.

如果您需要反转每个布尔值(以便最终结果为[False, False, True]),那么您可以使用以下任何方法。

Bitwise ~

按位 ~

~mask

0    False
1    False
2     True
dtype: bool

Again, expressions need to be parenthesised.

同样,表达式需要用括号括起来。

~(df['A'] == 3)

0     True
1    False
2    False
3     True
4     True
Name: A, dtype: bool

This internally calls

这在内部调用

mask.__invert__()

0    False
1    False
2     True
dtype: bool

But don't use it directly.

但是不要直接使用。

operator.inv
Internally calls __invert__on the Series.

operator.inv
内部调用__invert__系列。

operator.inv(mask)

0    False
1    False
2     True
dtype: bool

np.logical_not
This is the numpy variant.

np.logical_not
这是 numpy 变体。

np.logical_not(mask)

0    False
1    False
2     True
dtype: bool


Note, np.logical_andcan be substituted for np.bitwise_and, logical_orwith bitwise_or, and logical_notwith invert.

注意,np.logical_and可以代替np.bitwise_andlogical_orbitwise_or,并logical_notinvert

回答by MSeifert

Logical operators for boolean indexing in Pandas

Pandas 中布尔索引的逻辑运算符

It's important to realize that you cannot use any of the Python logical operators(and, oror not) on pandas.Seriesor pandas.DataFrames (similarly you cannot use them on numpy.arrays with more than one element). The reason why you cannot use those is because they implicitly call boolon their operands which throws an Exception because these data structures decided that the boolean of an array is ambiguous:

重要的是要意识到您不能在or s上使用任何 Python逻辑运算符and, oror not)(同样,您不能在具有多个元素的 s 上使用它们)。你不能使用它们的原因是因为它们隐式调用了抛出异常的操作数,因为这些数据结构决定了数组的布尔值是不明确的:pandas.Seriespandas.DataFramenumpy.arraybool

>>> import numpy as np
>>> import pandas as pd
>>> arr = np.array([1,2,3])
>>> s = pd.Series([1,2,3])
>>> df = pd.DataFrame([1,2,3])
>>> bool(arr)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> bool(s)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> bool(df)
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I did cover this more extensively in my answer to the "Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()" Q+A.

在回答“系列的真值不明确。使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()”的回答中确实更广泛地涵盖了这一点+A

NumPys logical functions

NumPys 逻辑函数

However NumPyprovides element-wise operating equivalents to these operators as functions that can be used on numpy.array, pandas.Series, pandas.DataFrame, or any other (conforming) numpy.arraysubclass:

然而NumPy的提供逐元素的操作等同于这些运营商的功能,可以在被使用numpy.arraypandas.Seriespandas.DataFrame,或任何其他(符合)numpy.array亚类:

So, essentially, one should use (assuming df1and df2are pandas DataFrames):

因此,本质上,应该使用(假设df1并且df2是熊猫数据帧):

np.logical_and(df1, df2)
np.logical_or(df1, df2)
np.logical_not(df1)
np.logical_xor(df1, df2)

Bitwise functions and bitwise operators for booleans

布尔值的按位函数和按位运算符

However in case you have boolean NumPy array, pandas Series, or pandas DataFrames you could also use the element-wise bitwise functions(for booleans they are - or at least should be - indistinguishable from the logical functions):

但是,如果您有 boolean NumPy 数组、pandas Series 或 pandas DataFrames,您也​​可以使用逐元素按位函数(对于布尔值,它们 - 或者至少应该 - 与逻辑函数无法区分):

Typically the operators are used. However when combined with comparison operators one has to remember to wrap the comparison in parenthesis because the bitwise operators have a higher precedence than the comparison operators:

通常使用运算符。但是,当与比较运算符结合使用时,必须记住将比较括在括号中,因为按位运算符的优先级高于比较运算符

(df1 < 10) | (df2 > 10)  # instead of the wrong df1 < 10 | df2 > 10

This may be irritating because the Python logical operators have a lower precendence than the comparison operators so you normally write a < 10 and b > 10(where aand bare for example simple integers) and don't need the parenthesis.

这可能令人恼火,因为 Python 逻辑运算符的优先级低于比较运算符,因此您通常编写a < 10 and b > 10(其中ab是例如简单整数)并且不需要括号。

Differences between logical and bitwise operations (on non-booleans)

逻辑运算和按位运算之间的差异(非布尔运算)

It is really important to stress that bit and logical operations are only equivalent for boolean NumPy arrays (and boolean Series & DataFrames). If these don't contain booleans then the operations will give different results. I'll include examples using NumPy arrays but the results will be similar for the pandas data structures:

强调位和逻辑运算仅对布尔 NumPy 数组(以及布尔系列和数据帧)等效,这一点非常重要。如果这些不包含布尔值,则操作将给出不同的结果。我将包括使用 NumPy 数组的示例,但对于 Pandas 数据结构,结果将类似:

>>> import numpy as np
>>> a1 = np.array([0, 0, 1, 1])
>>> a2 = np.array([0, 1, 0, 1])

>>> np.logical_and(a1, a2)
array([False, False, False,  True])
>>> np.bitwise_and(a1, a2)
array([0, 0, 0, 1], dtype=int32)

And since NumPy (and similarly pandas) does different things for boolean (Boolean or “mask” index arrays) and integer (Index arrays) indices the results of indexing will be also be different:

由于 NumPy(以及类似的熊猫)对布尔(布尔或“掩码”索引数组)和整数(索引数组)索引做了不同的事情,索引的结果也会不同:

>>> a3 = np.array([1, 2, 3, 4])

>>> a3[np.logical_and(a1, a2)]
array([4])
>>> a3[np.bitwise_and(a1, a2)]
array([1, 1, 1, 2])

Summary table

汇总表

Logical operator | NumPy logical function | NumPy bitwise function | Bitwise operator
-------------------------------------------------------------------------------------
       and       |  np.logical_and        | np.bitwise_and         |        &
-------------------------------------------------------------------------------------
       or        |  np.logical_or         | np.bitwise_or          |        |
-------------------------------------------------------------------------------------
                 |  np.logical_xor        | np.bitwise_xor         |        ^
-------------------------------------------------------------------------------------
       not       |  np.logical_not        | np.invert              |        ~

Where the logical operator does not work for NumPy arrays, pandas Series, and pandas DataFrames. The others work on these data structures (and plain Python objects) and work element-wise. However be careful with the bitwise invert on plain Python bools because the bool will be interpreted as integers in this context (for example ~Falsereturns -1and ~Truereturns -2).

其中的逻辑运算符不适合与NumPy阵列工作,熊猫系列,和熊猫DataFrames。其他人处理这些数据结构(和普通的 Python 对象)并按元素工作。但是,请注意纯 Python 上的按位反转,bool因为 bool 在此上下文中将被解释为整数(例如~False返回-1~True返回-2)。