Python 如何使用'in'和'not in'像SQL一样过滤Pandas数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19960077/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 15:04:26  来源:igfitidea点击:

How to filter Pandas dataframe using 'in' and 'not in' like in SQL

pythonpandasdataframesql-function

提问by LondonRob

How can I achieve the equivalents of SQL's INand NOT IN?

如何实现 SQLIN和的等价物NOT IN

I have a list with the required values. Here's the scenario:

我有一个包含所需值的列表。这是场景:

df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = ['UK','China']

# pseudo-code:
df[df['countries'] not in countries]

My current way of doing this is as follows:

我目前的做法如下:

df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = pd.DataFrame({'countries':['UK','China'], 'matched':True})

# IN
df.merge(countries,how='inner',on='countries')

# NOT IN
not_in = df.merge(countries,how='left',on='countries')
not_in = not_in[pd.isnull(not_in['matched'])]

But this seems like a horrible kludge. Can anyone improve on it?

但这似乎是一个可怕的混杂。任何人都可以改进它吗?

采纳答案by DSM

You can use pd.Series.isin.

您可以使用pd.Series.isin.

For "IN" use: something.isin(somewhere)

对于“IN”使用: something.isin(somewhere)

Or for "NOT IN": ~something.isin(somewhere)

或者对于“不在”: ~something.isin(somewhere)

As a worked example:

作为一个工作示例:

>>> df
  countries
0        US
1        UK
2   Germany
3     China
>>> countries
['UK', 'China']
>>> df.countries.isin(countries)
0    False
1     True
2    False
3     True
Name: countries, dtype: bool
>>> df[df.countries.isin(countries)]
  countries
1        UK
3     China
>>> df[~df.countries.isin(countries)]
  countries
0        US
2   Germany

回答by Kos

I've been usually doing generic filtering over rows like this:

我通常对这样的行进行通用过滤:

criterion = lambda row: row['countries'] not in countries
not_in = df[df.apply(criterion, axis=1)]

回答by Sam Henderson

I wanted to filter out dfbc rows that had a BUSINESS_ID that was also in the BUSINESS_ID of dfProfilesBusIds

我想过滤掉具有 BUSINESS_ID 的 dfbc 行,该行也在 dfProfilesBusIds 的 BUSINESS_ID 中

dfbc = dfbc[~dfbc['BUSINESS_ID'].isin(dfProfilesBusIds['BUSINESS_ID'])]

回答by MaxU

Alternative solution that uses .query()method:

使用.query()方法的替代解决方案:

In [5]: df.query("countries in @countries")
Out[5]:
  countries
1        UK
3     China

In [6]: df.query("countries not in @countries")
Out[6]:
  countries
0        US
2   Germany

回答by Ioannis Nasios

df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = ['UK','China']

implement in:

实施

df[df.countries.isin(countries)]

implement not inas in of rest countries:

实施与其他国家不同:

df[df.countries.isin([x for x in np.unique(df.countries) if x not in countries])]

回答by cs95

How to implement 'in' and 'not in' for a pandas DataFrame?

如何为 Pandas DataFrame 实现“in”和“not in”?

Pandas offers two methods: Series.isinand DataFrame.isinfor Series and DataFrames, respectively.

熊猫提供了两种方法:Series.isinDataFrame.isin分别为系列和DataFrames。



Filter DataFrame Based on ONE Column (also applies to Series)

基于一列过滤数据帧(也适用于系列)

The most common scenario is applying an isincondition on a specific column to filter rows in a DataFrame.

最常见的场景是isin在特定列上应用条件来过滤 DataFrame 中的行。

df = pd.DataFrame({'countries': ['US', 'UK', 'Germany', np.nan, 'China']})
df
  countries
0        US
1        UK
2   Germany
3     China

c1 = ['UK', 'China']             # list
c2 = {'Germany'}                 # set
c3 = pd.Series(['China', 'US'])  # Series
c4 = np.array(['US', 'UK'])      # array

Series.isinaccepts various types as inputs. The following are all valid ways of getting what you want:

Series.isin接受各种类型作为输入。以下是获得所需内容的所有有效方法:

df['countries'].isin(c1)

0    False
1     True
2    False
3    False
4     True
Name: countries, dtype: bool

# `in` operation
df[df['countries'].isin(c1)]

  countries
1        UK
4     China

# `not in` operation
df[~df['countries'].isin(c1)]

  countries
0        US
2   Germany
3       NaN

# Filter with `set` (tuples work too)
df[df['countries'].isin(c2)]

  countries
2   Germany

# Filter with another Series
df[df['countries'].isin(c3)]

  countries
0        US
4     China

# Filter with array
df[df['countries'].isin(c4)]

  countries
0        US
1        UK


Filter on MANY Columns

过滤许多列

Sometimes, you will want to apply an 'in' membership check with some search terms over multiple columns,

有时,您需要在多个列上使用一些搜索词应用“in”成员资格检查,

df2 = pd.DataFrame({
    'A': ['x', 'y', 'z', 'q'], 'B': ['w', 'a', np.nan, 'x'], 'C': np.arange(4)})
df2

   A    B  C
0  x    w  0
1  y    a  1
2  z  NaN  2
3  q    x  3

c1 = ['x', 'w', 'p']

To apply the isincondition to both columns "A" and "B", use DataFrame.isin:

要将isin条件应用于“A”和“B”列,请使用DataFrame.isin

df2[['A', 'B']].isin(c1)

      A      B
0   True   True
1  False  False
2  False  False
3  False   True

From this, to retain rows where at least one column is True, we can use anyalong the first axis:

由此,要保留至少一列是 的行True,我们可以any沿第一个轴使用:

df2[['A', 'B']].isin(c1).any(axis=1)

0     True
1    False
2    False
3     True
dtype: bool

df2[df2[['A', 'B']].isin(c1).any(axis=1)]

   A  B  C
0  x  w  0
3  q  x  3

Note that if you want to search every column, you'd just omit the column selection step and do

请注意,如果您想搜索每一列,您只需省略列选择步骤并执行

df2.isin(c1).any(axis=1)

Similarly, to retain rows where ALL columns are True, use allin the same manner as before.

同样,要保留所有列所在的行,请Trueall以与以前相同的方式使用。

df2[df2[['A', 'B']].isin(c1).all(axis=1)]

   A  B  C
0  x  w  0


Notable Mentions: numpy.isin, query, list comprehensions (string data)

值得注意的提及:numpy.isin, query, 列表推导式(字符串数据)

In addition to the methods described above, you can also use the numpy equivalent: numpy.isin.

除了上述方法外,您还可以使用 numpy 等效项:numpy.isin.

# `in` operation
df[np.isin(df['countries'], c1)]

  countries
1        UK
4     China

# `not in` operation
df[np.isin(df['countries'], c1, invert=True)]

  countries
0        US
2   Germany
3       NaN

Why is it worth considering? NumPy functions are usually a bit faster than their pandas equivalents because of lower overhead. Since this is an elementwise operation that does not depend on index alignment, there are very few situations where this method is not an appropriate replacement for pandas' isin.

为什么值得考虑?由于开销较低,NumPy 函数通常比它们的 Pandas 函数快一点。由于这是一个不依赖于索引对齐的逐元素操作,因此在极少数情况下此方法不是 pandas' 的合适替代品isin

Pandas routines are usually iterative when working with strings, because string operations are hard to vectorise. There is a lot of evidence to suggest that list comprehensions will be faster here.. We resort to an incheck now.

Pandas 例程在处理字符串时通常是迭代的,因为字符串操作很难向量化。有很多证据表明列表理解在这里会更快。. 我们in现在求助于支票。

c1_set = set(c1) # Using `in` with `sets` is a constant time operation... 
                 # This doesn't matter for pandas because the implementation differs.
# `in` operation
df[[x in c1_set for x in df['countries']]]

  countries
1        UK
4     China

# `not in` operation
df[[x not in c1_set for x in df['countries']]]

  countries
0        US
2   Germany
3       NaN

It is a lot more unwieldy to specify, however, so don't use it unless you know what you're doing.

但是,指定要麻烦得多,因此除非您知道自己在做什么,否则不要使用它。

Lastly, there's also DataFrame.querywhich has been covered in this answer. numexpr FTW!

最后,本答案中DataFrame.query涵盖了哪些内容。numexpr FTW!

回答by Abhishek Gaur

Collating possible solutions from the answers:

从答案中整理可能的解决方案:

For IN: df[df['A'].isin([3, 6])]

对于 IN: df[df['A'].isin([3, 6])]

For NOT IN:

对于不在:

  1. df[-df["A"].isin([3, 6])]

  2. df[~df["A"].isin([3, 6])]

  3. df[df["A"].isin([3, 6]) == False]

  4. df[np.logical_not(df["A"].isin([3, 6]))]

  1. df[-df["A"].isin([3, 6])]

  2. df[~df["A"].isin([3, 6])]

  3. df[df["A"].isin([3, 6]) == False]

  4. df[np.logical_not(df["A"].isin([3, 6]))]