Python Pandas - 找出两个数据框之间的差异

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48647534/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:47:54  来源:igfitidea点击:

Python Pandas - Find difference between two data frames

pythonpandasdataframe

提问by userPyGeo

I have two data frames df1 and df2, where df2 is a subset of df1. How do I get a new data frame (df3) which is the difference between the two data frames?

我有两个数据帧 df1 和 df2,其中 df2 是 df1 的子集。我如何获得一个新的数据帧 (df3),这是两个数据帧之间的差异?

In other word, a data frame that has all the rows/columns in df1 that are not in df2?

换句话说,一个包含 df1 中所有不在 df2 中的行/列的数据框?

enter image description here

在此处输入图片说明

回答by YOBEN_S

By using drop_duplicates

通过使用 drop_duplicates

pd.concat([df1,df2]).drop_duplicates(keep=False)


Update :

Update :

Above method only working for those dataframes they do not have duplicate itself, For example

Above method only working for those dataframes they do not have duplicate itself, For example

df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})

It will output like below , which is wrong

它会输出如下,这是错误的

Wrong Output :

错误的输出:

pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]: 
   A  B
1  2  3

Correct Output

正确的输出

Out[656]: 
   A  B
1  2  3
2  3  4
3  3  4


How to achieve that?

如何做到这一点?

Method 1: Using isinwith tuple

方法一:使用isinwithtuple

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]: 
   A  B
1  2  3
2  3  4
3  3  4

Method 2: mergewith indicator

方法二:mergeindicator

df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]: 
   A  B     _merge
1  2  3  left_only
2  3  4  left_only
3  3  4  left_only

回答by jpp

For rows, try this, where Nameis the joint index column (can be a list for multiple common columns, or specify left_onand right_on):

对于行,试试这个,Name联合索引列在哪里(可以是多个公共列的列表,或者指定left_onand right_on):

m = df1.merge(df2, on='Name', how='outer', suffixes=['', '_'], indicator=True)

The indicator=Truesetting is useful as it adds a column called _merge, with all changes between df1and df2, categorized into 3 possible kinds: "left_only", "right_only" or "both".

indicator=True设置很有用,因为它添加了一个名为 的列_merge,其中包含df1和之间的所有更改df2,分为 3 种可能的类型:“left_only”、“right_only”或“both”。

For columns, try this:

对于列,试试这个:

set(df1.columns).symmetric_difference(df2.columns)

回答by toecsnar42

Accepted answerMethod 1 will not work for data frames with NaNs inside, as pd.np.nan != pd.np.nan. I am not sure if this is the best way, but it can be avoided by

接受的答案方法 1 不适用于内部包含 NaN 的数据帧,因为pd.np.nan != pd.np.nan. 我不确定这是否是最好的方法,但可以通过以下方式避免

df1[~df1.astype(str).apply(tuple, 1).isin(df2.astype(str).apply(tuple, 1))]

回答by liangli

edit2, I figured out a new solution without the need of setting index

edit2,我想出了一个不需要设置索引的新解决方案

newdf=pd.concat[df1,df2].drop_duplicates(keep=False)

okay i found the answer of hightest vote already contain what i have figured out .Yes, we can only use this code on condition that there are no duplicates in each two dfs.

好吧,我发现最高投票的答案已经包含我想出的内容。是的,我们只能在每两个 dfs 中没有重复项的情况下使用此代码。



I have a tricky method.First we set 'Name' as the index of two dataframe given by the question.Since we have same 'Name' in two dfs,we can just drop the 'smaller' df's index from the ‘bigger' df. Here is the code.

我有一个棘手的方法。首先,我们将“名称”设置为问题给出的两个数据帧的索引。由于我们在两个 dfs 中有相同的“名称”,我们可以从“较大”的 df 中删除“较小”的 df 索引. 这是代码。

df1.set_index('Name',inplace=True)
df2.set_index('Name',inplace=True)
newdf=df1.drop(df2.index)

回答by SpeedCoder5

import pandas as pd
# given
df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
    'Age':[23,45,12,34,27,44,28,39,40]})
df2 = pd.DataFrame({'Name':['John','Smith','Wale','Tom','Menda','Yuswa',],
    'Age':[23,12,34,44,28,40]})

# find elements in df1 that are not in df2
df_1notin2 = df1[~(df1['Name'].isin(df2['Name']) & df1['Age'].isin(df2['Age']))].reset_index(drop=True)

# output:
print('df1\n', df1)
print('df2\n', df2)
print('df_1notin2\n', df_1notin2)

# df1
#     Age   Name
# 0   23   John
# 1   45   Mike
# 2   12  Smith
# 3   34   Wale
# 4   27  Marry
# 5   44    Tom
# 6   28  Menda
# 7   39   Bolt
# 8   40  Yuswa
# df2
#     Age   Name
# 0   23   John
# 1   12  Smith
# 2   34   Wale
# 3   44    Tom
# 4   28  Menda
# 5   40  Yuswa
# df_1notin2
#     Age   Name
# 0   45   Mike
# 1   27  Marry
# 2   39   Bolt

回答by Cherif Diallo

Perhaps a simpler one-liner, with identical or different column names. Worked even when df2['Name2'] contained duplicate values.

也许是更简单的单行,具有相同或不同的列名。即使 df2['Name2'] 包含重复值也能工作。

newDf = df1.set_index('Name1')
           .drop(df2['Name2'], errors='ignore')
           .reset_index(drop=False)

回答by DOS

Finding difference by index. Assuming df1 is a subset of df2 and the indexes are carried forward when subsetting

按索引查找差异。假设 df1 是 df2 的子集,并且在子集化时将索引结转

df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()

# Example

df1 = pd.DataFrame({"gender":np.random.choice(['m','f'],size=5), "subject":np.random.choice(["bio","phy","chem"],size=5)}, index = [1,2,3,4,5])

df2 =  df1.loc[[1,3,5]]

df1

 gender subject
1      f     bio
2      m    chem
3      f     phy
4      m     bio
5      f     bio

df2

  gender subject
1      f     bio
3      f     phy
5      f     bio

df3 = df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()

df3

  gender subject
2      m    chem
4      m     bio

回答by Luchko

In addition to accepted answer, I would like to propose one more wider solution that can find a 2D set differenceof two dataframes with any index/columns(they might not coincide for both datarames). Also method allows to setup tolerance for floatelements for dataframe comparison (it uses np.isclose)

除了公认的答案之外,我还想提出一个更广泛的解决方案,它可以找到两个数据帧的二维集差异与任何index/ columns(它们可能不符合两个数据帧)。还允许float为数据帧比较的元素设置容差(它使用np.isclose


import numpy as np
import pandas as pd

def get_dataframe_setdiff2d(df_new: pd.DataFrame, 
                            df_old: pd.DataFrame, 
                            rtol=1e-03, atol=1e-05) -> pd.DataFrame:
    """Returns set difference of two pandas DataFrames"""

    union_index = np.union1d(df_new.index, df_old.index)
    union_columns = np.union1d(df_new.columns, df_old.columns)

    new = df_new.reindex(index=union_index, columns=union_columns)
    old = df_old.reindex(index=union_index, columns=union_columns)

    mask_diff = ~np.isclose(new, old, rtol, atol)

    df_bool = pd.DataFrame(mask_diff, union_index, union_columns)

    df_diff = pd.concat([new[df_bool].stack(),
                         old[df_bool].stack()], axis=1)

    df_diff.columns = ["New", "Old"]

    return df_diff

Example:

例子:

In [1]

df1 = pd.DataFrame({'A':[2,1,2],'C':[2,1,2]})
df2 = pd.DataFrame({'A':[1,1],'B':[1,1]})

print("df1:\n", df1, "\n")

print("df2:\n", df2, "\n")

diff = get_dataframe_setdiff2d(df1, df2)

print("diff:\n", diff, "\n")
Out [1]

df1:
   A  C
0  2  2
1  1  1
2  2  2 

df2:
   A  B
0  1  1
1  1  1 

diff:
     New  Old
0 A  2.0  1.0
  B  NaN  1.0
  C  2.0  NaN
1 B  NaN  1.0
  C  1.0  NaN
2 A  2.0  NaN
  C  2.0  NaN 

回答by arun pal

As mentioned herethat

正如这里提到的 那样

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]

is correct solution but it will produce wrong output if

是正确的解决方案,但如果它会产生错误的输出

df1=pd.DataFrame({'A':[1],'B':[2]})
df2=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})

In that case above solution will give Empty DataFrame, instead you should use concatmethod after removing duplicates from each datframe.

在这种情况下,上述解决方案将提供 Empty DataFrame,而您应该concat在从每个数据帧中删除重复项后使用方法。

Use concate with drop_duplicates

concate with drop_duplicates

df1=df1.drop_duplicates(keep="first") 
df2=df2.drop_duplicates(keep="first") 
pd.concat([df1,df2]).drop_duplicates(keep=False)

回答by Serge Ballesta

A slight variation of the nice @liangli's solution that does not require to change the index of existing dataframes:

不错的@lianli 解决方案的一个细微变化,不需要更改现有数据帧的索引:

newdf = df1.drop(df1.join(df2.set_index('Name').index))