比较多列以获取两个 Pandas 数据帧中不同的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33524000/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Compare Multiple Columns to Get Rows that are Different in Two Pandas Dataframes
提问by slaw
I have two dataframes:
我有两个数据框:
df1=
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
df2=
A B C
0 A2 B2 C10
1 A1 B3 C11
2 A9 B4 C12
and I want to find rows in df1 that are not found in df2 based on one or two columns (or more columns). So, if I only compare column 'A' then the following rows from df1 are not found in df2 (note that column 'B' and column 'C' are not used for comparison between df1 and df2)
我想根据一列或两列(或更多列)在 df1 中找到在 df2 中找不到的行。因此,如果我只比较“A”列,那么在 df2 中找不到来自 df1 的以下行(请注意,“B”列和“C”列不用于 df1 和 df2 之间的比较)
A B C
0 A0 B0 C0
And I would like to return a series with
我想返回一个系列
0 False
1 True
2 True
Or, if I only compare column 'A' and column 'B' then the following rows from df1 are not found in df2 (note that column 'C' is not used for comparison between df1 and df2)
或者,如果我只比较“A”列和“B”列,那么在 df2 中找不到来自 df1 的以下行(请注意,“C”列不用于 df1 和 df2 之间的比较)
A B C
0 A0 B0 C0
1 A1 B1 C1
And I would want to return a series with
我想返回一个系列
0 False
1 False
2 True
I know how to accomplish this using sets but I am looking for a straightforward Pandas way of accomplishing this.
我知道如何使用集合来实现这一点,但我正在寻找一种简单的 Pandas 方式来实现这一点。
回答by guibor
Ideally, one would like to be able to just use ~df1[COLS].isin(df2[COLS]) as a mask, but this requires index labels to match (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html)
理想情况下,人们希望能够仅使用 ~df1[COLS].isin(df2[COLS]) 作为掩码,但这需要索引标签匹配(https://pandas.pydata.org/pandas-docs/稳定/生成/pandas.DataFrame.isin.html)
Here is a succinct form that uses .isin but converts the second DataFrame to a dict so that index labels don't need to match:
这是一个使用 .isin 的简洁形式,但将第二个 DataFrame 转换为 dict 以便索引标签不需要匹配:
COLS = ['A', 'B'] # or whichever columns to use for comparison
df1[~df1[COLS].isin(df2[COLS].to_dict(
orient='list')).all(axis=1)]
回答by vmg
~df1['A'].isin(df2['A'])
Should get you the series you want
应该给你你想要的系列
df1[ ~df1['A'].isin(df2['A'])]
The dataframe:
数据框:
A B C
0 A0 B0 C0
回答by EdChum
If your version is 0.17.0
then you can use pd.merge
and pass the cols of interest, how='left' and set indicator=True
to whether the values are only present in left or both. You can then test whether the appended _merge
col is equal to 'both':
如果您的版本是,0.17.0
那么您可以使用pd.merge
并传递感兴趣的列, how='left' 并设置indicator=True
为值是否仅出现在左侧或两者中。然后,您可以测试附加的_merge
col 是否等于 'both':
In [102]:
pd.merge(df1, df2, on='A',how='left', indicator=True)['_merge'] == 'both'
Out[102]:
0 False
1 True
2 True
Name: _merge, dtype: bool
In [103]:
pd.merge(df1, df2, on=['A', 'B'],how='left', indicator=True)['_merge'] == 'both'
Out[103]:
0 False
1 False
2 True
Name: _merge, dtype: bool
output from the merge:
合并的输出:
In [104]:
pd.merge(df1, df2, on='A',how='left', indicator=True)
Out[104]:
A B_x C_x B_y C_y _merge
0 A0 B0 C0 NaN NaN left_only
1 A1 B1 C1 B3 C11 both
2 A2 B2 C2 B2 C10 both
In [105]:
pd.merge(df1, df2, on=['A', 'B'],how='left', indicator=True)
Out[105]:
A B C_x C_y _merge
0 A0 B0 C0 NaN left_only
1 A1 B1 C1 NaN left_only
2 A2 B2 C2 C10 both
回答by Nader Hisham
Method ( 1 )
方法(一)
In [63]:
df1['A'].isin(df2['A']) & df1['B'].isin(df2['B'])
Out[63]:
0 False
1 False
2 True
Method ( 2 )
方法(二)
you can use the left merge to obtain values that exist in both frames +
values that exist in the first data frame only
您可以使用左合并获取+
仅存在于第一个数据帧中的两个帧中存在的值
In [10]:
left = pd.merge(df1 , df2 , on = ['A' , 'B'] ,how = 'left')
left
Out[10]:
A B C_x C_y
0 A0 B0 C0 NaN
1 A1 B1 C1 NaN
2 A2 B2 C2 C10
then of course values that exist only in the first frame will have NAN
values in columns of the other data frame , then you can filter by this NAN
values by doing the following
那么当然仅存在于第一帧中的NAN
值将在另一个数据帧的列中具有值,然后您可以NAN
通过执行以下操作按此值进行过滤
In [16]:
left.loc[pd.isnull(left['C_y']) , 'A':'C_x']
Out[16]:
A B C_x
0 A0 B0 C0
1 A1 B1 C1
In [17]:
if you want to get whether the values in A
exists in B
you can do the following
如果您想获取中的值是否A
存在,B
您可以执行以下操作
In [20]:
pd.notnull(left['C_y'])
Out[20]:
0 False
1 False
2 True
Name: C_y, dtype: bool