比较多列以获取两个 Pandas 数据帧中不同的行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33524000/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:10:13  来源:igfitidea点击:

Compare Multiple Columns to Get Rows that are Different in Two Pandas Dataframes

pythonpandas

提问by slaw

I have two dataframes:

我有两个数据框:

df1=
    A    B   C
0   A0   B0  C0
1   A1   B1  C1
2   A2   B2  C2

df2=
    A    B   C
0   A2   B2  C10
1   A1   B3  C11
2   A9   B4  C12

and I want to find rows in df1 that are not found in df2 based on one or two columns (or more columns). So, if I only compare column 'A' then the following rows from df1 are not found in df2 (note that column 'B' and column 'C' are not used for comparison between df1 and df2)

我想根据一列或两列(或更多列)在 df1 中找到在 df2 中找不到的行。因此,如果我只比较“A”列,那么在 df2 中找不到来自 df1 的以下行(请注意,“B”列和“C”列不用于 df1 和 df2 之间的比较)

    A    B   C
0   A0   B0  C0

And I would like to return a series with

我想返回一个系列

0   False
1   True
2   True

Or, if I only compare column 'A' and column 'B' then the following rows from df1 are not found in df2 (note that column 'C' is not used for comparison between df1 and df2)

或者,如果我只比较“A”列和“B”列,那么在 df2 中找不到来自 df1 的以下行(请注意,“C”列不用于 df1 和 df2 之间的比较)

    A    B   C
0   A0   B0  C0
1   A1   B1  C1

And I would want to return a series with

我想返回一个系列

0   False
1   False
2   True

I know how to accomplish this using sets but I am looking for a straightforward Pandas way of accomplishing this.

我知道如何使用集合来实现这一点,但我正在寻找一种简单的 Pandas 方式来实现这一点。

回答by guibor

Ideally, one would like to be able to just use ~df1[COLS].isin(df2[COLS]) as a mask, but this requires index labels to match (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html)

理想情况下,人们希望能够仅使用 ~df1[COLS].isin(df2[COLS]) 作为掩码,但这需要索引标签匹配(https://pandas.pydata.org/pandas-docs/稳定/生成/pandas.DataFrame.isin.html)

Here is a succinct form that uses .isin but converts the second DataFrame to a dict so that index labels don't need to match:

这是一个使用 .isin 的简洁形式,但将第二个 DataFrame 转换为 dict 以便索引标签不需要匹配:

COLS = ['A', 'B'] # or whichever columns to use for comparison

df1[~df1[COLS].isin(df2[COLS].to_dict(
    orient='list')).all(axis=1)]

回答by vmg

 ~df1['A'].isin(df2['A'])

Should get you the series you want

应该给你你想要的系列

df1[ ~df1['A'].isin(df2['A'])]

The dataframe:

数据框:

    A   B   C
0   A0  B0  C0

回答by EdChum

If your version is 0.17.0then you can use pd.mergeand pass the cols of interest, how='left' and set indicator=Trueto whether the values are only present in left or both. You can then test whether the appended _mergecol is equal to 'both':

如果您的版本是,0.17.0那么您可以使用pd.merge并传递感兴趣的列, how='left' 并设置indicator=True为值是否仅出现在左侧或两者中。然后,您可以测试附加的_mergecol 是否等于 'both':

In [102]:
pd.merge(df1, df2, on='A',how='left', indicator=True)['_merge'] == 'both'

Out[102]:
0    False
1     True
2     True
Name: _merge, dtype: bool

In [103]:
pd.merge(df1, df2, on=['A', 'B'],how='left', indicator=True)['_merge'] == 'both'

Out[103]:
0    False
1    False
2     True
Name: _merge, dtype: bool

output from the merge:

合并的输出:

In [104]:
pd.merge(df1, df2, on='A',how='left', indicator=True)

Out[104]:
    A B_x C_x  B_y  C_y     _merge
0  A0  B0  C0  NaN  NaN  left_only
1  A1  B1  C1   B3  C11       both
2  A2  B2  C2   B2  C10       both

In [105]:    
pd.merge(df1, df2, on=['A', 'B'],how='left', indicator=True)

Out[105]:
    A   B C_x  C_y     _merge
0  A0  B0  C0  NaN  left_only
1  A1  B1  C1  NaN  left_only
2  A2  B2  C2  C10       both

回答by Nader Hisham

Method ( 1 )

方法(一)



In [63]:
df1['A'].isin(df2['A']) & df1['B'].isin(df2['B'])
Out[63]:

0   False
1   False
2   True

Method ( 2 )

方法(二)



you can use the left merge to obtain values that exist in both frames +values that exist in the first data frame only

您可以使用左合并获取+仅存在于第一个数据帧中的两个帧中存在的值

In [10]:
left = pd.merge(df1 , df2 , on = ['A' , 'B'] ,how = 'left')
left
Out[10]:
    A   B   C_x C_y
0   A0  B0  C0  NaN
1   A1  B1  C1  NaN
2   A2  B2  C2  C10

then of course values that exist only in the first frame will have NANvalues in columns of the other data frame , then you can filter by this NANvalues by doing the following

那么当然仅存在于第一帧中的NAN值将在另一个数据帧的列中具有值,然后您可以NAN通过执行以下操作按此值进行过滤

In [16]:
left.loc[pd.isnull(left['C_y']) , 'A':'C_x']
Out[16]:
    A   B   C_x
0   A0  B0  C0
1   A1  B1  C1

In [17]:

if you want to get whether the values in Aexists in Byou can do the following

如果您想获取中的值是否A存在,B您可以执行以下操作

In [20]:
pd.notnull(left['C_y'])
Out[20]:
0    False
1    False
2     True
Name: C_y, dtype: bool