Python 比较两个数据框并获取差异
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20225110/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Comparing two dataframes and getting the differences
提问by Eric D. Brown
I have two dataframes. Examples:
我有两个数据框。例子:
df1:
Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
df2:
Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange
Each dataframe has the Date as an index. Both dataframes have the same structure.
每个数据框都有日期作为索引。两个数据帧具有相同的结构。
What i want to do, is compare these two dataframes and find which rows are in df2 that aren't in df1. I want to compare the date (index) and the first column (Banana, APple, etc) to see if they exist in df2 vs df1.
我想要做的是比较这两个数据帧,并找出 df2 中哪些行不在 df1 中。我想比较日期(索引)和第一列(香蕉、苹果等),看看它们是否存在于 df2 和 df1 中。
I have tried the following:
我尝试了以下方法:
- Outputting difference in two Pandas dataframes side by side - highlighting the difference
- Comparing two pandas dataframes for differences
For the first approach I get this error: "Exception: Can only compare identically-labeled DataFrame objects". I have tried removing the Date as index but get the same error.
对于第一种方法,我收到此错误:“异常:只能比较标记相同的 DataFrame 对象”。我尝试删除日期作为索引,但得到相同的错误。
On the third approach, I get the assert to return False but cannot figure out how to actually see the different rows.
在第三种方法中,我得到了返回 False 的断言,但无法弄清楚如何实际查看不同的行。
Any pointers would be welcome
欢迎任何指点
采纳答案by alko
This approach, df1 != df2, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_samemethod, and exception is raised if differences found, even in columns/indices order.
这种方法df1 != df2仅适用于具有相同行和列的数据框。事实上,所有数据帧轴都与_indexed_same方法进行比较,如果发现差异,即使在列/索引顺序中也会引发异常。
If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:
如果我猜对了,您不想找到变化,而是要找到对称差异。为此,一种方法可能是连接数据帧:
>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)
group by
通过...分组
>>> df_gpby = df.groupby(list(df.columns))
get index of unique records
获取唯一记录的索引
>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
filter
筛选
>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red
回答by eyquem
I got this solution. Does this help you ?
我得到了这个解决方案。这对你有帮助吗?
text = """df1:
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
df2:
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange
argetz45
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 118.6 Orange
2013-11-24 Apple 74.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Nuts 45.8 Brown
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange
2013-11-26 Pear 102.54 Pale"""
.
.
from collections import OrderedDict
import re
r = re.compile('([a-zA-Z\d]+).*\n'
'(20\d\d-[01]\d-[0123]\d.+\n?'
'(.+\n?)*)'
'(?=[ \n]*\Z'
'|'
'\n+[a-zA-Z\d]+.*\n'
'20\d\d-[01]\d-[0123]\d)')
r2 = re.compile('((20\d\d-[01]\d-[0123]\d) +([^\d.]+)(?<! )[^\n]+)')
d = OrderedDict()
bef = []
for m in r.finditer(text):
li = []
for x in r2.findall(m.group(2)):
if not any(x[1:3]==elbef for elbef in bef):
bef.append(x[1:3])
li.append(x[0])
d[m.group(1)] = li
for name,lu in d.iteritems():
print '%s\n%s\n' % (name,'\n'.join(lu))
result
结果
df1
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
df2
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange
argetz45
2013-11-25 Nuts 45.8 Brown
2013-11-26 Pear 102.54 Pale
回答by fnl
Building on alko's answer that almost worked for me, except for the filtering step (where I get: ValueError: cannot reindex from a duplicate axis), here is the final solution I used:
基于 alko 几乎对我有用的答案,除了过滤步骤(我得到的地方:)ValueError: cannot reindex from a duplicate axis,这是我使用的最终解决方案:
# join the dataframes
united_data = pd.concat([data1, data2, data3, ...])
# group the data by the whole row to find duplicates
united_data_grouped = united_data.groupby(list(united_data.columns))
# detect the row indices of unique rows
uniq_data_idx = [x[0] for x in united_data_grouped.indices.values() if len(x) == 1]
# extract those unique values
uniq_data = united_data.iloc[uniq_data_idx]
回答by jur
Passing the dataframes to concat in a dictionary, results in a multi-index dataframe from which you can easily delete the duplicates, which results in a multi-index dataframe with the differences between the dataframes:
将数据帧传递给字典中的 concat 会产生一个多索引数据帧,您可以从中轻松删除重复项,从而产生具有数据帧之间差异的多索引数据帧:
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
DF1 = StringIO("""Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
""")
DF2 = StringIO("""Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange""")
df1 = pd.read_table(DF1, sep='\s+')
df2 = pd.read_table(DF2, sep='\s+')
#%%
dfs_dictionary = {'DF1':df1,'DF2':df2}
df=pd.concat(dfs_dictionary)
df.drop_duplicates(keep=False)
Result:
结果:
Date Fruit Num Color
DF2 4 2013-11-25 Apple 22.1 Red
5 2013-11-25 Orange 8.6 Orange
回答by ntg
There is a simpler solution that is faster and better, and if the numbers are different can even give you quantities differences:
有一个更简单的解决方案,更快更好,如果数字不同,甚至可以给你数量差异:
df1_i = df1.set_index(['Date','Fruit','Color'])
df2_i = df2.set_index(['Date','Fruit','Color'])
df_diff = df1_i.join(df2_i,how='outer',rsuffix='_').fillna(0)
df_diff = (df_diff['Num'] - df_diff['Num_'])
Here df_diff is a synopsis of the differences. You can even use it to find the differences in quantities. In your example:
这里 df_diff 是差异的概要。您甚至可以使用它来找出数量上的差异。在你的例子中:
Explanation: Similarly to comparing two lists, to do it efficiently we should first order them then compare them (converting the list to sets/hashing would also be fast; both are an incredible improvement to the simple O(N^2) double comparison loop
说明:与比较两个列表类似,为了有效地进行,我们应该首先对它们进行排序然后比较它们(将列表转换为集合/散列也会很快;两者都是对简单的 O(N^2) 双重比较循环的不可思议的改进
Note: the following code produces the tables:
注意:以下代码生成表格:
df1=pd.DataFrame({
'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24'],
'Fruit':['Banana','Orange','Apple','Celery'],
'Num':[22.1,8.6,7.6,10.2],
'Color':['Yellow','Orange','Green','Green'],
})
df2=pd.DataFrame({
'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24','2013-11-25','2013-11-25'],
'Fruit':['Banana','Orange','Apple','Celery','Apple','Orange'],
'Num':[22.1,8.6,7.6,10.2,22.1,8.6],
'Color':['Yellow','Orange','Green','Green','Red','Orange'],
})
回答by gandreoti
One important detail to notice is that your data has duplicate index values, so to perform any straightforward comparison we need to turn everything as unique with df.reset_index()and therefore we can perform selections based on conditions. Once in your case the index is defined, I assume that you would like to keep de index so there are a one-line solution:
要注意的一个重要细节是您的数据具有重复的索引值,因此要执行任何直接比较,我们需要将所有内容都设为唯一,df.reset_index()因此我们可以根据条件执行选择。一旦在你的情况下定义了索引,我假设你想保留 de index 所以有一个单行解决方案:
[~df2.reset_index().isin(df1.reset_index())].dropna().set_index('Date')
Once the objective from a pythonic perspective is to improve readability, we can break a little bit:
一旦从 python 的角度来看,目标是提高可读性,我们可以打破一点:
# keep the index name, if it does not have a name it uses the default name
index_name = df.index.name if df.index.name else 'index'
# setting the index to become unique
df1 = df1.reset_index()
df2 = df2.reset_index()
# getting the differences to a Dataframe
df_diff = df2[~df2.isin(df1)].dropna().set_index(index_name)
回答by SpeedCoder5
# given
df1=pd.DataFrame({'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24'],
'Fruit':['Banana','Orange','Apple','Celery'],
'Num':[22.1,8.6,7.6,10.2],
'Color':['Yellow','Orange','Green','Green']})
df2=pd.DataFrame({'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24','2013-11-25','2013-11-25'],
'Fruit':['Banana','Orange','Apple','Celery','Apple','Orange'],
'Num':[22.1,8.6,7.6,1000,22.1,8.6],
'Color':['Yellow','Orange','Green','Green','Red','Orange']})
# find which rows are in df2 that aren't in df1 by Date and Fruit
df_2notin1 = df2[~(df2['Date'].isin(df1['Date']) & df2['Fruit'].isin(df1['Fruit']) )].dropna().reset_index(drop=True)
# output
print('df_2notin1\n', df_2notin1)
# Color Date Fruit Num
# 0 Red 2013-11-25 Apple 22.1
# 1 Orange 2013-11-25 Orange 8.6
回答by Tom2shoes
Founder a simple solution here:
创始人在这里有一个简单的解决方案:
https://stackoverflow.com/a/47132808/9656339
https://stackoverflow.com/a/47132808/9656339
pd.concat([df1, df2]).loc[df1.index.symmetric_difference(df2.index)]
pd.concat([df1, df2]).loc[df1.index.symmetric_difference(df2.index)]
回答by Carson
Hope this would be useful to you. ^o^
希望这对你有用。^o^
df1 = pd.DataFrame({'date': ['0207', '0207'], 'col1': [1, 2]})
df2 = pd.DataFrame({'date': ['0207', '0207', '0208', '0208'], 'col1': [1, 2, 3, 4]})
print(f"df1(Before):\n{df1}\ndf2:\n{df2}")
"""
df1(Before):
date col1
0 0207 1
1 0207 2
df2:
date col1
0 0207 1
1 0207 2
2 0208 3
3 0208 4
"""
old_set = set(df1.index.values)
new_set = set(df2.index.values)
new_data_index = new_set - old_set
new_data_list = []
for idx in new_data_index:
new_data_list.append(df2.loc[idx])
if len(new_data_list) > 0:
df1 = df1.append(new_data_list)
print(f"df1(After):\n{df1}")
"""
df1(After):
date col1
0 0207 1
1 0207 2
2 0208 3
3 0208 4
"""
回答by debugging XD
I tried this method, and it worked. I hope it can help too:
我试过这个方法,它奏效了。我希望它也能有所帮助:
"""Identify differences between two pandas DataFrames"""
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
df_all = pd.concat([df1, df12], axis='columns', keys=['First', 'Second'])
df_final = df_all.swaplevel(axis='columns')[df1.columns[1:]]
df_final[df_final['change this to one of the columns'] != df_final['change this to one of the columns']]


