python pandas:如何在一个数据框中而不是在另一个数据框中查找行?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32651860/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
python pandas: how to find rows in one dataframe but not in another?
提问by Pythonista anonymous
Let's say that I have two tables: people_alland people_usa, both with the same structure and therefore the same primary key.
假设我有两个表:people_alland people_usa,它们具有相同的结构,因此具有相同的主键。
How can I get a table of the people not in the USA? In SQL I'd do something like:
我怎样才能得到一张不在美国的人的桌子?在 SQL 中,我会执行以下操作:
select a.*
from people_all a
left outer join people_usa u
on a.id = u.id
where u.id is null
What would be the Python equivalent? I cannot think of a way to translate this where statement into pandas syntax.
Python 的等价物是什么?我想不出一种方法将这个 where 语句翻译成 Pandas 语法。
The only way I can think of is to add an arbitrary field to people_usa (e.g. people_usa['dummy']=1), do a left join, then take only the records where 'dummy' is nan, then delete the dummy field - which seems a bit convoluted.
我能想到的唯一方法是向 people_usa(例如people_usa['dummy']=1)添加一个任意字段,进行左连接,然后仅获取“虚拟”为 nan 的记录,然后删除虚拟字段 - 这似乎有点令人费解。
Thanks!
谢谢!
回答by EdChum
use isinand negate the boolean mask:
使用isin和否定布尔掩码:
people_usa[~people_usa['ID'].isin(people_all ['ID'])]
Example:
例子:
In [364]:
people_all = pd.DataFrame({ 'ID' : np.arange(5)})
people_usa = pd.DataFrame({ 'ID' : [3,4,6,7,100]})
people_usa[~people_usa['ID'].isin(people_all['ID'])]
Out[364]:
ID
2 6
3 7
4 100
so 3 and 4 are removed from the result, the boolean mask looks like this:
因此从结果中删除了 3 和 4,布尔掩码如下所示:
In [366]:
people_usa['ID'].isin(people_all['ID'])
Out[366]:
0 True
1 True
2 False
3 False
4 False
Name: ID, dtype: bool
using ~inverts the mask
使用~反转掩码
回答by MaxU
Here is another similar to SQL Pandas method: .query():
这是另一个类似于 SQL Pandas 的方法:.query():
people_all.query('ID not in @people_usa.ID')
or using NumPy's in1d()method:
或使用 NumPy 的in1d()方法:
people_all.[~np.in1d(people_all, people_usa)]
NOTE: for those who have experience with SQL it might be worth to read Pandas comparison with SQL
注意:对于那些有 SQL 经验的人来说,阅读Pandas 与 SQL 的比较可能是值得的
回答by Graham Streich
I would combine (by stacking) the data frames and then perform a .drop_duplicates method. Documentation found here:
我会组合(通过堆叠)数据帧,然后执行 .drop_duplicates 方法。文档在这里找到:
http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.drop_duplicates.html
http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.drop_duplicates.html

