Python 如何使用'in'和'not in'像SQL一样过滤Pandas数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19960077/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
提问by LondonRob
How can I achieve the equivalents of SQL's IN
and NOT IN
?
如何实现 SQLIN
和的等价物NOT IN
?
I have a list with the required values. Here's the scenario:
我有一个包含所需值的列表。这是场景:
df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = ['UK','China']
# pseudo-code:
df[df['countries'] not in countries]
My current way of doing this is as follows:
我目前的做法如下:
df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = pd.DataFrame({'countries':['UK','China'], 'matched':True})
# IN
df.merge(countries,how='inner',on='countries')
# NOT IN
not_in = df.merge(countries,how='left',on='countries')
not_in = not_in[pd.isnull(not_in['matched'])]
But this seems like a horrible kludge. Can anyone improve on it?
但这似乎是一个可怕的混杂。任何人都可以改进它吗?
采纳答案by DSM
You can use pd.Series.isin
.
您可以使用pd.Series.isin
.
For "IN" use: something.isin(somewhere)
对于“IN”使用: something.isin(somewhere)
Or for "NOT IN": ~something.isin(somewhere)
或者对于“不在”: ~something.isin(somewhere)
As a worked example:
作为一个工作示例:
>>> df
countries
0 US
1 UK
2 Germany
3 China
>>> countries
['UK', 'China']
>>> df.countries.isin(countries)
0 False
1 True
2 False
3 True
Name: countries, dtype: bool
>>> df[df.countries.isin(countries)]
countries
1 UK
3 China
>>> df[~df.countries.isin(countries)]
countries
0 US
2 Germany
回答by Kos
I've been usually doing generic filtering over rows like this:
我通常对这样的行进行通用过滤:
criterion = lambda row: row['countries'] not in countries
not_in = df[df.apply(criterion, axis=1)]
回答by Sam Henderson
I wanted to filter out dfbc rows that had a BUSINESS_ID that was also in the BUSINESS_ID of dfProfilesBusIds
我想过滤掉具有 BUSINESS_ID 的 dfbc 行,该行也在 dfProfilesBusIds 的 BUSINESS_ID 中
dfbc = dfbc[~dfbc['BUSINESS_ID'].isin(dfProfilesBusIds['BUSINESS_ID'])]
回答by MaxU
回答by Ioannis Nasios
df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = ['UK','China']
implement in:
实施:
df[df.countries.isin(countries)]
implement not inas in of rest countries:
实施与其他国家不同:
df[df.countries.isin([x for x in np.unique(df.countries) if x not in countries])]
回答by cs95
How to implement 'in' and 'not in' for a pandas DataFrame?
如何为 Pandas DataFrame 实现“in”和“not in”?
Pandas offers two methods: Series.isin
and DataFrame.isin
for Series and DataFrames, respectively.
熊猫提供了两种方法:Series.isin
和DataFrame.isin
分别为系列和DataFrames。
Filter DataFrame Based on ONE Column (also applies to Series)
基于一列过滤数据帧(也适用于系列)
The most common scenario is applying an isin
condition on a specific column to filter rows in a DataFrame.
最常见的场景是isin
在特定列上应用条件来过滤 DataFrame 中的行。
df = pd.DataFrame({'countries': ['US', 'UK', 'Germany', np.nan, 'China']})
df
countries
0 US
1 UK
2 Germany
3 China
c1 = ['UK', 'China'] # list
c2 = {'Germany'} # set
c3 = pd.Series(['China', 'US']) # Series
c4 = np.array(['US', 'UK']) # array
Series.isin
accepts various types as inputs. The following are all valid ways of getting what you want:
Series.isin
接受各种类型作为输入。以下是获得所需内容的所有有效方法:
df['countries'].isin(c1)
0 False
1 True
2 False
3 False
4 True
Name: countries, dtype: bool
# `in` operation
df[df['countries'].isin(c1)]
countries
1 UK
4 China
# `not in` operation
df[~df['countries'].isin(c1)]
countries
0 US
2 Germany
3 NaN
# Filter with `set` (tuples work too)
df[df['countries'].isin(c2)]
countries
2 Germany
# Filter with another Series
df[df['countries'].isin(c3)]
countries
0 US
4 China
# Filter with array
df[df['countries'].isin(c4)]
countries
0 US
1 UK
Filter on MANY Columns
过滤许多列
Sometimes, you will want to apply an 'in' membership check with some search terms over multiple columns,
有时,您需要在多个列上使用一些搜索词应用“in”成员资格检查,
df2 = pd.DataFrame({
'A': ['x', 'y', 'z', 'q'], 'B': ['w', 'a', np.nan, 'x'], 'C': np.arange(4)})
df2
A B C
0 x w 0
1 y a 1
2 z NaN 2
3 q x 3
c1 = ['x', 'w', 'p']
To apply the isin
condition to both columns "A" and "B", use DataFrame.isin
:
要将isin
条件应用于“A”和“B”列,请使用DataFrame.isin
:
df2[['A', 'B']].isin(c1)
A B
0 True True
1 False False
2 False False
3 False True
From this, to retain rows where at least one column is True
, we can use any
along the first axis:
由此,要保留至少一列是 的行True
,我们可以any
沿第一个轴使用:
df2[['A', 'B']].isin(c1).any(axis=1)
0 True
1 False
2 False
3 True
dtype: bool
df2[df2[['A', 'B']].isin(c1).any(axis=1)]
A B C
0 x w 0
3 q x 3
Note that if you want to search every column, you'd just omit the column selection step and do
请注意,如果您想搜索每一列,您只需省略列选择步骤并执行
df2.isin(c1).any(axis=1)
Similarly, to retain rows where ALL columns are True
, use all
in the same manner as before.
同样,要保留所有列所在的行,请True
all
以与以前相同的方式使用。
df2[df2[['A', 'B']].isin(c1).all(axis=1)]
A B C
0 x w 0
Notable Mentions: numpy.isin
, query
, list comprehensions (string data)
值得注意的提及:numpy.isin
, query
, 列表推导式(字符串数据)
In addition to the methods described above, you can also use the numpy equivalent: numpy.isin
.
除了上述方法外,您还可以使用 numpy 等效项:numpy.isin
.
# `in` operation
df[np.isin(df['countries'], c1)]
countries
1 UK
4 China
# `not in` operation
df[np.isin(df['countries'], c1, invert=True)]
countries
0 US
2 Germany
3 NaN
Why is it worth considering? NumPy functions are usually a bit faster than their pandas equivalents because of lower overhead. Since this is an elementwise operation that does not depend on index alignment, there are very few situations where this method is not an appropriate replacement for pandas' isin
.
为什么值得考虑?由于开销较低,NumPy 函数通常比它们的 Pandas 函数快一点。由于这是一个不依赖于索引对齐的逐元素操作,因此在极少数情况下此方法不是 pandas' 的合适替代品isin
。
Pandas routines are usually iterative when working with strings, because string operations are hard to vectorise. There is a lot of evidence to suggest that list comprehensions will be faster here..
We resort to an in
check now.
Pandas 例程在处理字符串时通常是迭代的,因为字符串操作很难向量化。有很多证据表明列表理解在这里会更快。. 我们in
现在求助于支票。
c1_set = set(c1) # Using `in` with `sets` is a constant time operation...
# This doesn't matter for pandas because the implementation differs.
# `in` operation
df[[x in c1_set for x in df['countries']]]
countries
1 UK
4 China
# `not in` operation
df[[x not in c1_set for x in df['countries']]]
countries
0 US
2 Germany
3 NaN
It is a lot more unwieldy to specify, however, so don't use it unless you know what you're doing.
但是,指定要麻烦得多,因此除非您知道自己在做什么,否则不要使用它。
Lastly, there's also DataFrame.query
which has been covered in this answer. numexpr FTW!
最后,本答案中还DataFrame.query
涵盖了哪些内容。numexpr FTW!
回答by Abhishek Gaur
Collating possible solutions from the answers:
从答案中整理可能的解决方案:
For IN: df[df['A'].isin([3, 6])]
对于 IN: df[df['A'].isin([3, 6])]
For NOT IN:
对于不在:
df[-df["A"].isin([3, 6])]
df[~df["A"].isin([3, 6])]
df[df["A"].isin([3, 6]) == False]
df[np.logical_not(df["A"].isin([3, 6]))]
df[-df["A"].isin([3, 6])]
df[~df["A"].isin([3, 6])]
df[df["A"].isin([3, 6]) == False]
df[np.logical_not(df["A"].isin([3, 6]))]