Python 比较两个数据框并获取差异

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20225110/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:55:43  来源:igfitidea点击:

Comparing two dataframes and getting the differences

pythonpandasdataframe

提问by Eric D. Brown

I have two dataframes. Examples:

我有两个数据框。例子:

df1:
Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green

df2:
Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple  22.1 Red
2013-11-25 Orange  8.6 Orange

Each dataframe has the Date as an index. Both dataframes have the same structure.

每个数据框都有日期作为索引。两个数据帧具有相同的结构。

What i want to do, is compare these two dataframes and find which rows are in df2 that aren't in df1. I want to compare the date (index) and the first column (Banana, APple, etc) to see if they exist in df2 vs df1.

我想要做的是比较这两个数据帧,并找出 df2 中哪些行不在 df1 中。我想比较日期(索引)和第一列(香蕉、苹果等),看看它们是否存在于 df2 和 df1 中。

I have tried the following:

我尝试了以下方法:

For the first approach I get this error: "Exception: Can only compare identically-labeled DataFrame objects". I have tried removing the Date as index but get the same error.

对于第一种方法,我收到此错误:“异常:只能比较标记相同的 DataFrame 对象”。我尝试删除日期作为索引,但得到相同的错误。

On the third approach, I get the assert to return False but cannot figure out how to actually see the different rows.

第三种方法中,我得到了返回 False 的断言,但无法弄清楚如何实际查看不同的行。

Any pointers would be welcome

欢迎任何指点

采纳答案by alko

This approach, df1 != df2, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_samemethod, and exception is raised if differences found, even in columns/indices order.

这种方法df1 != df2仅适用于具有相同行和列的数据框。事实上,所有数据帧轴都与_indexed_same方法进行比较,如果发现差异,即使在列/索引顺序中也会引发异常。

If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:

如果我猜对了,您不想找到变化,而是要找到对称差异。为此,一种方法可能是连接数据帧:

>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)

group by

通过...分组

>>> df_gpby = df.groupby(list(df.columns))

get index of unique records

获取唯一记录的索引

>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

filter

筛选

>>> df.reindex(idx)
         Date   Fruit   Num   Color
9  2013-11-25  Orange   8.6  Orange
8  2013-11-25   Apple  22.1     Red

回答by eyquem

I got this solution. Does this help you ?

我得到了这个解决方案。这对你有帮助吗?

text = """df1:
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green

df2:
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange



argetz45
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 118.6 Orange
2013-11-24 Apple 74.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25     Nuts    45.8 Brown
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange
2013-11-26   Pear 102.54    Pale"""

.

.

from collections import OrderedDict
import re

r = re.compile('([a-zA-Z\d]+).*\n'
               '(20\d\d-[01]\d-[0123]\d.+\n?'
               '(.+\n?)*)'
               '(?=[ \n]*\Z'
                  '|'
                  '\n+[a-zA-Z\d]+.*\n'
                  '20\d\d-[01]\d-[0123]\d)')

r2 = re.compile('((20\d\d-[01]\d-[0123]\d) +([^\d.]+)(?<! )[^\n]+)')

d = OrderedDict()
bef = []

for m in r.finditer(text):
    li = []
    for x in r2.findall(m.group(2)):
        if not any(x[1:3]==elbef for elbef in bef):
            bef.append(x[1:3])
            li.append(x[0])
    d[m.group(1)] = li


for name,lu in d.iteritems():
    print '%s\n%s\n' % (name,'\n'.join(lu))

result

结果

df1
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green

df2
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange

argetz45
2013-11-25     Nuts    45.8 Brown
2013-11-26   Pear 102.54    Pale

回答by fnl

Building on alko's answer that almost worked for me, except for the filtering step (where I get: ValueError: cannot reindex from a duplicate axis), here is the final solution I used:

基于 alko 几乎对我有用的答案,除了过滤步骤(我得到的地方:)ValueError: cannot reindex from a duplicate axis,这是我使用的最终解决方案:

# join the dataframes
united_data = pd.concat([data1, data2, data3, ...])
# group the data by the whole row to find duplicates
united_data_grouped = united_data.groupby(list(united_data.columns))
# detect the row indices of unique rows
uniq_data_idx = [x[0] for x in united_data_grouped.indices.values() if len(x) == 1]
# extract those unique values
uniq_data = united_data.iloc[uniq_data_idx]

回答by jur

Passing the dataframes to concat in a dictionary, results in a multi-index dataframe from which you can easily delete the duplicates, which results in a multi-index dataframe with the differences between the dataframes:

将数据帧传递给字典中的 concat 会产生一个多索引数据帧,您可以从中轻松删除重复项,从而产生具有数据帧之间差异的多索引数据帧:

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO
import pandas as pd

DF1 = StringIO("""Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green
""")
DF2 = StringIO("""Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple  22.1 Red
2013-11-25 Orange  8.6 Orange""")


df1 = pd.read_table(DF1, sep='\s+')
df2 = pd.read_table(DF2, sep='\s+')
#%%
dfs_dictionary = {'DF1':df1,'DF2':df2}
df=pd.concat(dfs_dictionary)
df.drop_duplicates(keep=False)

Result:

结果:

             Date   Fruit   Num   Color
DF2 4  2013-11-25   Apple  22.1     Red
    5  2013-11-25  Orange   8.6  Orange

回答by ntg

There is a simpler solution that is faster and better, and if the numbers are different can even give you quantities differences:

有一个更简单的解决方案,更快更好,如果数字不同,甚至可以给你数量差异:

df1_i = df1.set_index(['Date','Fruit','Color'])
df2_i = df2.set_index(['Date','Fruit','Color'])
df_diff = df1_i.join(df2_i,how='outer',rsuffix='_').fillna(0)
df_diff = (df_diff['Num'] - df_diff['Num_'])

Here df_diff is a synopsis of the differences. You can even use it to find the differences in quantities. In your example:

这里 df_diff 是差异的概要。您甚至可以使用它来找出数量上的差异。在你的例子中:

enter image description here

在此处输入图片说明

Explanation: Similarly to comparing two lists, to do it efficiently we should first order them then compare them (converting the list to sets/hashing would also be fast; both are an incredible improvement to the simple O(N^2) double comparison loop

说明:与比较两个列表类似,为了有效地进行,我们应该首先对它们进行排序然后比较它们(将列表转换为集合/散列也会很快;两者都是对简单的 O(N^2) 双重比较循环的不可思议的改进

Note: the following code produces the tables:

注意:以下代码生成表格:

df1=pd.DataFrame({
    'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24'],
    'Fruit':['Banana','Orange','Apple','Celery'],
    'Num':[22.1,8.6,7.6,10.2],
    'Color':['Yellow','Orange','Green','Green'],
})
df2=pd.DataFrame({
    'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24','2013-11-25','2013-11-25'],
    'Fruit':['Banana','Orange','Apple','Celery','Apple','Orange'],
    'Num':[22.1,8.6,7.6,10.2,22.1,8.6],
    'Color':['Yellow','Orange','Green','Green','Red','Orange'],
})

回答by gandreoti

One important detail to notice is that your data has duplicate index values, so to perform any straightforward comparison we need to turn everything as unique with df.reset_index()and therefore we can perform selections based on conditions. Once in your case the index is defined, I assume that you would like to keep de index so there are a one-line solution:

要注意的一个重要细节是您的数据具有重复的索引值,因此要执行任何直接比较,我们需要将所有内容都设为唯一,df.reset_index()因此我们可以根据条件执行选择。一旦在你的情况下定义了索引,我假设你想保留 de index 所以有一个单行解决方案:

[~df2.reset_index().isin(df1.reset_index())].dropna().set_index('Date')

Once the objective from a pythonic perspective is to improve readability, we can break a little bit:

一旦从 python 的角度来看,目标是提高可读性,我们可以打破一点:

# keep the index name, if it does not have a name it uses the default name
index_name = df.index.name if df.index.name else 'index' 

# setting the index to become unique
df1 = df1.reset_index()
df2 = df2.reset_index()

# getting the differences to a Dataframe
df_diff = df2[~df2.isin(df1)].dropna().set_index(index_name)

回答by SpeedCoder5

# given
df1=pd.DataFrame({'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24'],
    'Fruit':['Banana','Orange','Apple','Celery'],
    'Num':[22.1,8.6,7.6,10.2],
    'Color':['Yellow','Orange','Green','Green']})
df2=pd.DataFrame({'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24','2013-11-25','2013-11-25'],
    'Fruit':['Banana','Orange','Apple','Celery','Apple','Orange'],
    'Num':[22.1,8.6,7.6,1000,22.1,8.6],
    'Color':['Yellow','Orange','Green','Green','Red','Orange']})

# find which rows are in df2 that aren't in df1 by Date and Fruit
df_2notin1 = df2[~(df2['Date'].isin(df1['Date']) & df2['Fruit'].isin(df1['Fruit']) )].dropna().reset_index(drop=True)

# output
print('df_2notin1\n', df_2notin1)
#      Color        Date   Fruit   Num
# 0     Red  2013-11-25   Apple  22.1
# 1  Orange  2013-11-25  Orange   8.6

回答by Tom2shoes

Founder a simple solution here:

创始人在这里有一个简单的解决方案:

https://stackoverflow.com/a/47132808/9656339

https://stackoverflow.com/a/47132808/9656339

pd.concat([df1, df2]).loc[df1.index.symmetric_difference(df2.index)]

pd.concat([df1, df2]).loc[df1.index.symmetric_difference(df2.index)]

回答by Carson

Hope this would be useful to you. ^o^

希望这对你有用。^o^

df1 = pd.DataFrame({'date': ['0207', '0207'], 'col1': [1, 2]})
df2 = pd.DataFrame({'date': ['0207', '0207', '0208', '0208'], 'col1': [1, 2, 3, 4]})
print(f"df1(Before):\n{df1}\ndf2:\n{df2}")
"""
df1(Before):
   date  col1
0  0207     1
1  0207     2

df2:
   date  col1
0  0207     1
1  0207     2
2  0208     3
3  0208     4
"""

old_set = set(df1.index.values)
new_set = set(df2.index.values)
new_data_index = new_set - old_set
new_data_list = []
for idx in new_data_index:
    new_data_list.append(df2.loc[idx])

if len(new_data_list) > 0:
    df1 = df1.append(new_data_list)
print(f"df1(After):\n{df1}")
"""
df1(After):
   date  col1
0  0207     1
1  0207     2
2  0208     3
3  0208     4
"""

回答by debugging XD

I tried this method, and it worked. I hope it can help too:

我试过这个方法,它奏效了。我希望它也能有所帮助:

"""Identify differences between two pandas DataFrames"""
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
df_all = pd.concat([df1, df12], axis='columns', keys=['First', 'Second'])
df_final = df_all.swaplevel(axis='columns')[df1.columns[1:]]
df_final[df_final['change this to one of the columns'] != df_final['change this to one of the columns']]