Python Pandas - 找出两个数据框之间的差异
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48647534/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas - Find difference between two data frames
提问by userPyGeo
I have two data frames df1 and df2, where df2 is a subset of df1. How do I get a new data frame (df3) which is the difference between the two data frames?
我有两个数据帧 df1 和 df2,其中 df2 是 df1 的子集。我如何获得一个新的数据帧 (df3),这是两个数据帧之间的差异?
In other word, a data frame that has all the rows/columns in df1 that are not in df2?
换句话说,一个包含 df1 中所有不在 df2 中的行/列的数据框?
回答by YOBEN_S
By using drop_duplicates
通过使用 drop_duplicates
pd.concat([df1,df2]).drop_duplicates(keep=False)
Update :
Update :
Above method only working for those dataframes they do not have duplicate itself, For example
Above method only working for those dataframes they do not have duplicate itself, For example
df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})
It will output like below , which is wrong
它会输出如下,这是错误的
Wrong Output :
错误的输出:
pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]:
A B
1 2 3
Correct Output
正确的输出
Out[656]:
A B
1 2 3
2 3 4
3 3 4
How to achieve that?
如何做到这一点?
Method 1: Using isin
with tuple
方法一:使用isin
withtuple
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]:
A B
1 2 3
2 3 4
3 3 4
Method 2: merge
with indicator
方法二:merge
用indicator
df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]:
A B _merge
1 2 3 left_only
2 3 4 left_only
3 3 4 left_only
回答by jpp
For rows, try this, where Name
is the joint index column (can be a list for multiple common columns, or specify left_on
and right_on
):
对于行,试试这个,Name
联合索引列在哪里(可以是多个公共列的列表,或者指定left_on
and right_on
):
m = df1.merge(df2, on='Name', how='outer', suffixes=['', '_'], indicator=True)
The indicator=True
setting is useful as it adds a column called _merge
, with all changes between df1
and df2
, categorized into 3 possible kinds: "left_only", "right_only" or "both".
该indicator=True
设置很有用,因为它添加了一个名为 的列_merge
,其中包含df1
和之间的所有更改df2
,分为 3 种可能的类型:“left_only”、“right_only”或“both”。
For columns, try this:
对于列,试试这个:
set(df1.columns).symmetric_difference(df2.columns)
回答by toecsnar42
Accepted answerMethod 1 will not work for data frames with NaNs inside, as pd.np.nan != pd.np.nan
. I am not sure if this is the best way, but it can be avoided by
接受的答案方法 1 不适用于内部包含 NaN 的数据帧,因为pd.np.nan != pd.np.nan
. 我不确定这是否是最好的方法,但可以通过以下方式避免
df1[~df1.astype(str).apply(tuple, 1).isin(df2.astype(str).apply(tuple, 1))]
回答by liangli
edit2, I figured out a new solution without the need of setting index
edit2,我想出了一个不需要设置索引的新解决方案
newdf=pd.concat[df1,df2].drop_duplicates(keep=False)
okay i found the answer of hightest vote already contain what i have figured out .Yes, we can only use this code on condition that there are no duplicates in each two dfs.
好吧,我发现最高投票的答案已经包含我想出的内容。是的,我们只能在每两个 dfs 中没有重复项的情况下使用此代码。
I have a tricky method.First we set 'Name' as the index of two dataframe given by the question.Since we have same 'Name' in two dfs,we can just drop the 'smaller' df's index from the ‘bigger' df. Here is the code.
我有一个棘手的方法。首先,我们将“名称”设置为问题给出的两个数据帧的索引。由于我们在两个 dfs 中有相同的“名称”,我们可以从“较大”的 df 中删除“较小”的 df 索引. 这是代码。
df1.set_index('Name',inplace=True)
df2.set_index('Name',inplace=True)
newdf=df1.drop(df2.index)
回答by SpeedCoder5
import pandas as pd
# given
df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
'Age':[23,45,12,34,27,44,28,39,40]})
df2 = pd.DataFrame({'Name':['John','Smith','Wale','Tom','Menda','Yuswa',],
'Age':[23,12,34,44,28,40]})
# find elements in df1 that are not in df2
df_1notin2 = df1[~(df1['Name'].isin(df2['Name']) & df1['Age'].isin(df2['Age']))].reset_index(drop=True)
# output:
print('df1\n', df1)
print('df2\n', df2)
print('df_1notin2\n', df_1notin2)
# df1
# Age Name
# 0 23 John
# 1 45 Mike
# 2 12 Smith
# 3 34 Wale
# 4 27 Marry
# 5 44 Tom
# 6 28 Menda
# 7 39 Bolt
# 8 40 Yuswa
# df2
# Age Name
# 0 23 John
# 1 12 Smith
# 2 34 Wale
# 3 44 Tom
# 4 28 Menda
# 5 40 Yuswa
# df_1notin2
# Age Name
# 0 45 Mike
# 1 27 Marry
# 2 39 Bolt
回答by Cherif Diallo
Perhaps a simpler one-liner, with identical or different column names. Worked even when df2['Name2'] contained duplicate values.
也许是更简单的单行,具有相同或不同的列名。即使 df2['Name2'] 包含重复值也能工作。
newDf = df1.set_index('Name1')
.drop(df2['Name2'], errors='ignore')
.reset_index(drop=False)
回答by DOS
Finding difference by index. Assuming df1 is a subset of df2 and the indexes are carried forward when subsetting
按索引查找差异。假设 df1 是 df2 的子集,并且在子集化时将索引结转
df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()
# Example
df1 = pd.DataFrame({"gender":np.random.choice(['m','f'],size=5), "subject":np.random.choice(["bio","phy","chem"],size=5)}, index = [1,2,3,4,5])
df2 = df1.loc[[1,3,5]]
df1
gender subject
1 f bio
2 m chem
3 f phy
4 m bio
5 f bio
df2
gender subject
1 f bio
3 f phy
5 f bio
df3 = df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()
df3
gender subject
2 m chem
4 m bio
回答by Luchko
In addition to accepted answer, I would like to propose one more wider solution that can find a 2D set differenceof two dataframes with any index
/columns
(they might not coincide for both datarames). Also method allows to setup tolerance for float
elements for dataframe comparison (it uses np.isclose
)
除了公认的答案之外,我还想提出一个更广泛的解决方案,它可以找到两个数据帧的二维集差异与任何index
/ columns
(它们可能不符合两个数据帧)。还允许float
为数据帧比较的元素设置容差(它使用np.isclose
)
import numpy as np
import pandas as pd
def get_dataframe_setdiff2d(df_new: pd.DataFrame,
df_old: pd.DataFrame,
rtol=1e-03, atol=1e-05) -> pd.DataFrame:
"""Returns set difference of two pandas DataFrames"""
union_index = np.union1d(df_new.index, df_old.index)
union_columns = np.union1d(df_new.columns, df_old.columns)
new = df_new.reindex(index=union_index, columns=union_columns)
old = df_old.reindex(index=union_index, columns=union_columns)
mask_diff = ~np.isclose(new, old, rtol, atol)
df_bool = pd.DataFrame(mask_diff, union_index, union_columns)
df_diff = pd.concat([new[df_bool].stack(),
old[df_bool].stack()], axis=1)
df_diff.columns = ["New", "Old"]
return df_diff
Example:
例子:
In [1]
df1 = pd.DataFrame({'A':[2,1,2],'C':[2,1,2]})
df2 = pd.DataFrame({'A':[1,1],'B':[1,1]})
print("df1:\n", df1, "\n")
print("df2:\n", df2, "\n")
diff = get_dataframe_setdiff2d(df1, df2)
print("diff:\n", diff, "\n")
Out [1]
df1:
A C
0 2 2
1 1 1
2 2 2
df2:
A B
0 1 1
1 1 1
diff:
New Old
0 A 2.0 1.0
B NaN 1.0
C 2.0 NaN
1 B NaN 1.0
C 1.0 NaN
2 A 2.0 NaN
C 2.0 NaN
回答by arun pal
As mentioned herethat
正如这里提到的 那样
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
is correct solution but it will produce wrong output if
是正确的解决方案,但如果它会产生错误的输出
df1=pd.DataFrame({'A':[1],'B':[2]})
df2=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
In that case above solution will give
Empty DataFrame, instead you should use concat
method after removing duplicates from each datframe.
在这种情况下,上述解决方案将提供
Empty DataFrame,而您应该concat
在从每个数据帧中删除重复项后使用方法。
Use concate with drop_duplicates
用 concate with drop_duplicates
df1=df1.drop_duplicates(keep="first")
df2=df2.drop_duplicates(keep="first")
pd.concat([df1,df2]).drop_duplicates(keep=False)
回答by Serge Ballesta
A slight variation of the nice @liangli's solution that does not require to change the index of existing dataframes:
不错的@lianli 解决方案的一个细微变化,不需要更改现有数据帧的索引:
newdf = df1.drop(df1.join(df2.set_index('Name').index))