Python 如何在 Pandas 中实现多列的布尔搜索

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22546425/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:08:27  来源:igfitidea点击:

How to implement a Boolean search with multiple columns in pandas

pythonpandas

提问by Tyler Wood

I have a pandas df and would like to accomplish something along these lines (in SQL terms):

我有一个 Pandas df 并且想沿着这些方向完成一些事情(用 SQL 术语):

SELECT * FROM df WHERE column1 = 'a' OR column2 = 'b' OR column3 = 'c' etc.

Now this works, for one column/value pair:

现在这适用于一个列/值对:

foo = df.loc[df['column']==value]

However, I'm not sure how to expand that to multiple column/value pairs.

但是,我不确定如何将其扩展到多个列/值对。

  • To be clear, each column matches a different value.
  • 需要明确的是,每列匹配不同的值。

采纳答案by EdChum

You need to enclose multiple conditions in braces due to operator precedence and use the bitwise and (&) and or (|) operators:

由于运算符优先级,您需要将多个条件括在大括号中,并使用按位和 ( &) 和或 ( |) 运算符:

foo = df[(df['column1']==value) | (df['columns2'] == 'b') | (df['column3'] == 'c')]

If you use andor or, then pandas is likely to moan that the comparison is ambiguous. In that case, it is unclear whether we are comparing every value in a series in the condition, and what does it mean if only 1 or all but 1 match the condition. That is why you should use the bitwise operators or the numpy np.allor np.anyto specify the matching criteria.

如果您使用andor,那么pandas 可能会抱怨比较不明确。在这种情况下,不清楚我们是否在比较条件中一个系列中的每个值,如果只有 1 或除 1 外的所有值都与条件匹配,这意味着什么。这就是为什么您应该使用按位运算符或 numpynp.allnp.any指定匹配条件的原因。

There is also the query method: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.query.html

还有查询方法:http: //pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.query.html

but there are some limitations mainly to do with issues where there could be ambiguity between column names and index values.

但是有一些限制主要与列名和索引值之间可能存在歧义的问题有关。

回答by Phillip Cloud

A more concise--but not necessarily faster--method is to use DataFrame.isin()and DataFrame.any()

一种更简洁——但不一定更快——的方法是使用DataFrame.isin()DataFrame.any()

In [27]: n = 10

In [28]: df = DataFrame(randint(4, size=(n, 2)), columns=list('ab'))

In [29]: df
Out[29]:
   a  b
0  0  0
1  1  1
2  1  1
3  2  3
4  2  3
5  0  2
6  1  2
7  3  0
8  1  1
9  2  2

[10 rows x 2 columns]

In [30]: df.isin([1, 2])
Out[30]:
       a      b
0  False  False
1   True   True
2   True   True
3   True  False
4   True  False
5  False   True
6   True   True
7  False  False
8   True   True
9   True   True

[10 rows x 2 columns]

In [31]: df.isin([1, 2]).any(1)
Out[31]:
0    False
1     True
2     True
3     True
4     True
5     True
6     True
7    False
8     True
9     True
dtype: bool

In [32]: df.loc[df.isin([1, 2]).any(1)]
Out[32]:
   a  b
1  1  1
2  1  1
3  2  3
4  2  3
5  0  2
6  1  2
8  1  1
9  2  2

[8 rows x 2 columns]

回答by rra

Easiest way to do this

最简单的方法来做到这一点

if this helpful hit up arrow! Tahnks!!

如果这个有用的向上箭头!塔恩克斯!!

students = [ ('Hyman1', 'Apples1' , 341) ,
             ('Riti1', 'Mangos1'  , 311) ,
             ('Aadi1', 'Grapes1' , 301) ,
             ('Sonia1', 'Apples1', 321) ,
             ('Lucy1', 'Mangos1'  , 331) ,
             ('Mike1', 'Apples1' , 351),
              ('Mik', 'Apples1' , np.nan)
              ]
#Create a DataFrame object
df = pd.DataFrame(students, columns = ['Name1' , 'Product1', 'Sale1']) 
print(df)


    Name1 Product1  Sale1
0   Hyman1  Apples1    341
1   Riti1  Mangos1    311
2   Aadi1  Grapes1    301
3  Sonia1  Apples1    321
4   Lucy1  Mangos1    331
5   Mike1  Apples1    351
6     Mik  Apples1    NaN

# Select rows in above DataFrame for which ‘Product' column contains the value ‘Apples',
subset = df[df['Product1'] == 'Apples1']
print(subset)

 Name1 Product1  Sale1
0   Hyman1  Apples1    341
3  Sonia1  Apples1    321
5   Mike1  Apples1    351
6     Mik  Apples1    NA

# Select rows in above DataFrame for which ‘Product' column contains the value ‘Apples', AND notnull value in Sale

subsetx= df[(df['Product1'] == "Apples1")  & (df['Sale1'].notnull())]
print(subsetx)
    Name1   Product1    Sale1
0   Hyman1   Apples1      341
3   Sonia1  Apples1      321
5   Mike1   Apples1      351

# Select rows in above DataFrame for which ‘Product' column contains the value ‘Apples', AND Sale = 351

subsetx= df[(df['Product1'] == "Apples1")  & (df['Sale1'] == 351)]
print(subsetx)

   Name1 Product1  Sale1
5  Mike1  Apples1    351

# Another example
subsetData = df[df['Product1'].isin(['Mangos1', 'Grapes1']) ]
print(subsetData)

Name1 Product1  Sale1
1  Riti1  Mangos1    311
2  Aadi1  Grapes1    301
4  Lucy1  Mangos1    331

Here is the Original link I found this. I edit it a little bit -- https://thispointer.com/python-pandas-select-rows-in-dataframe-by-conditions-on-multiple-columns/

这是我找到的原始链接。我稍微编辑一下——https: //thispointer.com/python-pandas-select-rows-in-dataframe-by-conditions-on-multiple-columns/

回答by Massifox

All the considerations made by @EdChumin 2014 are still valid, but the pandas.Dataframe.ixmethod is deprecatedfrom the version 0.0.20 of pandas. Directly from the docs:

@EdChum在 2014 年所做的所有考虑仍然有效,但该pandas.Dataframe.ix方法已从 pandas 0.0.20版本开始弃用。直接来自文档

Warning: Starting in 0.20.0, the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

警告:从 0.20.0 开始,不推荐使用 .ix 索引器,取而代之的是更严格的 .iloc 和 .loc 索引器。

In subsequent versions of pandas, this method has been replaced by new indexingmethods pandas.Dataframe.locand pandas.Dataframe.iloc.

在后续版本的 pandas 中,此方法已被新的索引方法pandas.Dataframe.locpandas.Dataframe.iloc所取代。

If you want to learn more, in thispost you can find comparisons between the methods mentioned above.

如果你想了解更多,在这篇文章中你可以找到上述方法之间的比较。

Ultimately, to date (and there does not seem to be any change in the upcoming versions of pandas from this point of view), the answer to this question is as follows:

最终,到目前为止(从这个角度来看,即将发布的 Pandas 版本似乎没有任何变化),这个问题的答案如下:

foo = df.loc[(df['column1']==value) | (df['columns2'] == 'b') | (df['column3'] == 'c')]