Python Pandas:通过多列查找另一个 DataFrame 中不存在的行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32652718/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 11:58:46  来源:igfitidea点击:

Pandas: Find rows which don't exist in another DataFrame by multiple columns

pythonjoinpandas

提问by Pekka

same as this python pandas: how to find rows in one dataframe but not in another?but with multiple columns

与此python pandas相同:如何在一个数据帧中查找行而不在另一个数据帧中查找行?但有多个列

This is the setup:

这是设置:

import pandas as pd

df = pd.DataFrame(dict(
    col1=[0,1,1,2],
    col2=['a','b','c','b'],
    extra_col=['this','is','just','something']
))

other = pd.DataFrame(dict(
    col1=[1,2],
    col2=['b','c']
))

Now, I want to select the rows from dfwhich don't exist in other. I want to do the selection by col1and col2

现在,我想选择df其他中不存在的行。我想通过col1和进行选择col2

In SQL I would do:

在 SQL 中,我会这样做:

select * from df 
where not exists (
    select * from other o 
    where df.col1 = o.col1 and 
    df.col2 = o.col2
)

And in Pandas I can do something like this but it feels very ugly. Part of the ugliness could be avoided if df had id-column but it's not always available.

在 Pandas 中,我可以做这样的事情,但感觉很丑陋。如果 df 有 id-column 但它并不总是可用,则可以避免部分丑陋。

key_col = ['col1','col2']
df_with_idx = df.reset_index()
common = pd.merge(df_with_idx,other,on=key_col)['index']
mask = df_with_idx['index'].isin(common)

desired_result =  df_with_idx[~mask].drop('index',axis=1)

So maybe there is some more elegant way?

那么也许有一些更优雅的方式?

采纳答案by EdChum

Since 0.17.0there is a new indicatorparam you can pass to mergewhich will tell you whether the rows are only present in left, right or both:

由于0.17.0有一个新indicator参数,您可以传递给merge它,它会告诉您行是否仅出现在左侧、右侧或两者中:

In [5]:
merged = df.merge(other, how='left', indicator=True)
merged

Out[5]:
   col1 col2  extra_col     _merge
0     0    a       this  left_only
1     1    b         is       both
2     1    c       just  left_only
3     2    b  something  left_only

In [6]:    
merged[merged['_merge']=='left_only']

Out[6]:
   col1 col2  extra_col     _merge
0     0    a       this  left_only
2     1    c       just  left_only
3     2    b  something  left_only

So you can now filter the merged df by selecting only 'left_only'rows

因此,您现在可以通过仅选择'left_only'行来过滤合并的 df

回答by greg_data

Interesting

有趣的

cols = ['col1','col2']
#get copies where the indeces are the columns of interest
df2 = df.set_index(cols)
other2 = other.set_index(cols)
#Look for index overlap, ~
df[~df2.index.isin(other2.index)]

Returns:

返回:

    col1 col2  extra_col
0     0    a       this
2     1    c       just
3     2    b  something

Seems a little bit more elegant...

看起来更优雅一点...