python pandas:如何在一个数据框中而不是在另一个数据框中查找行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32651860/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:54:15  来源:igfitidea点击:

python pandas: how to find rows in one dataframe but not in another?

pythonpandasdataframe

提问by Pythonista anonymous

Let's say that I have two tables: people_alland people_usa, both with the same structure and therefore the same primary key.

假设我有两个表:people_alland people_usa,它们具有相同的结构,因此具有相同的主键。

How can I get a table of the people not in the USA? In SQL I'd do something like:

我怎样才能得到一张不在美国的人的桌子?在 SQL 中,我会执行以下操作:

select a.*
from people_all a

left outer join people_usa u
on a.id = u.id

where u.id is null

What would be the Python equivalent? I cannot think of a way to translate this where statement into pandas syntax.

Python 的等价物是什么?我想不出一种方法将这个 where 语句翻译成 Pandas 语法。

The only way I can think of is to add an arbitrary field to people_usa (e.g. people_usa['dummy']=1), do a left join, then take only the records where 'dummy' is nan, then delete the dummy field - which seems a bit convoluted.

我能想到的唯一方法是向 people_usa(例如people_usa['dummy']=1)添加一个任意字段,进行左连接,然后仅获取“虚拟”为 nan 的记录,然后删除虚拟字段 - 这似乎有点令人费解。

Thanks!

谢谢!

回答by EdChum

use isinand negate the boolean mask:

使用isin和否定布尔掩码:

people_usa[~people_usa['ID'].isin(people_all ['ID'])]

Example:

例子:

In [364]:
people_all = pd.DataFrame({ 'ID' : np.arange(5)})
people_usa = pd.DataFrame({ 'ID' : [3,4,6,7,100]})
people_usa[~people_usa['ID'].isin(people_all['ID'])]

Out[364]:
    ID
2    6
3    7
4  100

so 3 and 4 are removed from the result, the boolean mask looks like this:

因此从结果中删除了 3 和 4,布尔掩码如下所示:

In [366]:
people_usa['ID'].isin(people_all['ID'])

Out[366]:
0     True
1     True
2    False
3    False
4    False
Name: ID, dtype: bool

using ~inverts the mask

使用~反转掩码

回答by MaxU

Here is another similar to SQL Pandas method: .query():

这是另一个类似于 SQL Pandas 的方法:.query()

people_all.query('ID not in @people_usa.ID')

or using NumPy's in1d()method:

或使用 NumPy 的in1d()方法:

people_all.[~np.in1d(people_all, people_usa)]

NOTE: for those who have experience with SQL it might be worth to read Pandas comparison with SQL

注意:对于那些有 SQL 经验的人来说,阅读Pandas 与 SQL 的比较可能是值得的

回答by Graham Streich

I would combine (by stacking) the data frames and then perform a .drop_duplicates method. Documentation found here:

我会组合(通过堆叠)数据帧,然后执行 .drop_duplicates 方法。文档在这里找到:

http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.drop_duplicates.html

http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.drop_duplicates.html