Python 并排输出两个 Pandas 数据帧中的差异 - 突出显示差异

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17095101/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:25:46  来源:igfitidea点击:

Outputting difference in two Pandas dataframes side by side - highlighting the difference

pythonhtmlpandasdataframepanel

提问by sky

I am trying to highlight exactly what changed between two dataframes.

我试图突出显示两个数据帧之间发生的变化。

Suppose I have two Python Pandas dataframes:

假设我有两个 Python Pandas 数据框:

"StudentRoster Jan-1":
id   Name   score                    isEnrolled           Comment
111  Hyman   2.17                     True                 He was late to class
112  Nick   1.11                     False                Graduated
113  Zoe    4.12                     True       

"StudentRoster Jan-2":
id   Name   score                    isEnrolled           Comment
111  Hyman   2.17                     True                 He was late to class
112  Nick   1.21                     False                Graduated
113  Zoe    4.12                     False                On vacation

My goal is to output an HTML table that:

我的目标是输出一个 HTML 表:

  1. Identifies rows that have changed (could be int, float, boolean, string)
  2. Outputs rows with same, OLD and NEW values (ideally into an HTML table) so the consumer can clearly see what changed between two dataframes:

    "StudentRoster Difference Jan-1 - Jan-2":  
    id   Name   score                    isEnrolled           Comment
    112  Nick   was 1.11| now 1.21       False                Graduated
    113  Zoe    4.12                     was True | now False was "" | now   "On   vacation"
    
  1. 标识已更改的行(可以是 int、float、boolean、string)
  2. 输出具有相同、旧和新值的行(理想情况下是一个 HTML 表),以便消费者可以清楚地看到两个数据帧之间发生了什么变化:

    "StudentRoster Difference Jan-1 - Jan-2":  
    id   Name   score                    isEnrolled           Comment
    112  Nick   was 1.11| now 1.21       False                Graduated
    113  Zoe    4.12                     was True | now False was "" | now   "On   vacation"
    

I suppose I could do a row by row and column by column comparison, but is there an easier way?

我想我可以逐行和逐列进行比较,但有没有更简单的方法?

采纳答案by Andy Hayden

The first part is similar to Constantine, you can get the boolean of which rows are empty*:

第一部分类似于Constantine,你可以得到哪些行是空的布尔值*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0    False
1     True
2     True
dtype: bool

Then we can see which entries have changed:

然后我们可以看到哪些条目发生了变化:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id  col
1   score         True
2   isEnrolled    True
    Comment       True
dtype: bool

Here the first entry is the index and the second the columns which has been changed.

这里的第一个条目是索引,第二个条目是已更改的列。

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
               from           to
id col
1  score       1.11         1.21
2  isEnrolled  True        False
   Comment     None  On vacation

* Note: it's important that df1and df2share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index, but I think I'll leave that as an exercise.

* 注意:重要的是df1df2在这里共享相同的索引。为了克服这种歧义,您可以确保只使用 来查看共享标签df1.index & df2.index,但我想我会将其留作练习。

回答by cge

If your two dataframes have the same ids in them, then finding out what changed is actually pretty easy. Just doing frame1 != frame2will give you a boolean DataFrame where each Trueis data that has changed. From that, you could easily get the index of each changed row by doing changedids = frame1.index[np.any(frame1 != frame2,axis=1)].

如果您的两个数据帧中具有相同的 ID,那么找出更改的内容实际上非常容易。只是这样做frame1 != frame2会给你一个布尔True数据帧,其中每个都是已更改的数据。由此,您可以通过执行 轻松获取每个更改行的索引changedids = frame1.index[np.any(frame1 != frame2,axis=1)]

回答by unutbu

import pandas as pd
import io

texts = ['''\
id   Name   score                    isEnrolled                        Comment
111  Hyman   2.17                     True                 He was late to class
112  Nick   1.11                     False                           Graduated
113  Zoe    4.12                     True       ''',

         '''\
id   Name   score                    isEnrolled                        Comment
111  Hyman   2.17                     True                 He was late to class
112  Nick   1.21                     False                           Graduated
113  Zoe    4.12                     False                         On vacation''']


df1 = pd.read_fwf(io.BytesIO(texts[0]), widths=[5,7,25,21,20])
df2 = pd.read_fwf(io.BytesIO(texts[1]), widths=[5,7,25,21,20])
df = pd.concat([df1,df2]) 

print(df)
#     id  Name  score isEnrolled               Comment
# 0  111  Hyman   2.17       True  He was late to class
# 1  112  Nick   1.11      False             Graduated
# 2  113   Zoe   4.12       True                   NaN
# 0  111  Hyman   2.17       True  He was late to class
# 1  112  Nick   1.21      False             Graduated
# 2  113   Zoe   4.12      False           On vacation

df.set_index(['id', 'Name'], inplace=True)
print(df)
#           score isEnrolled               Comment
# id  Name                                        
# 111 Hyman   2.17       True  He was late to class
# 112 Nick   1.11      False             Graduated
# 113 Zoe    4.12       True                   NaN
# 111 Hyman   2.17       True  He was late to class
# 112 Nick   1.21      False             Graduated
# 113 Zoe    4.12      False           On vacation

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

changes = df.groupby(level=['id', 'Name']).agg(report_diff)
print(changes)

prints

印刷

                score    isEnrolled               Comment
id  Name                                                 
111 Hyman         2.17          True  He was late to class
112 Nick  1.11 | 1.21         False             Graduated
113 Zoe          4.12  True | False     nan | On vacation

回答by journois

I have faced this issue, but found an answer before finding this post :

我遇到过这个问题,但在找到这篇文章之前找到了答案:

Based on unutbu's answer, load your data...

根据 unutbu 的回答,加载您的数据...

import pandas as pd
import io

texts = ['''\
id   Name   score                    isEnrolled                       Date
111  Hyman                            True              2013-05-01 12:00:00
112  Nick   1.11                     False             2013-05-12 15:05:23
     Zoe    4.12                     True                                  ''',

         '''\
id   Name   score                    isEnrolled                       Date
111  Hyman   2.17                     True              2013-05-01 12:00:00
112  Nick   1.21                     False                                
     Zoe    4.12                     False             2013-05-01 12:00:00''']


df1 = pd.read_fwf(io.BytesIO(texts[0]), widths=[5,7,25,17,20], parse_dates=[4])
df2 = pd.read_fwf(io.BytesIO(texts[1]), widths=[5,7,25,17,20], parse_dates=[4])

...define your difffunction...

...定义您的差异函数...

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

Then you can simply use a Panel to conclude :

然后你可以简单地使用一个面板来总结:

my_panel = pd.Panel(dict(df1=df1,df2=df2))
print my_panel.apply(report_diff, axis=0)

#          id  Name        score    isEnrolled                       Date
#0        111  Hyman   nan | 2.17          True        2013-05-01 12:00:00
#1        112  Nick  1.11 | 1.21         False  2013-05-12 15:05:23 | NaT
#2  nan | nan   Zoe         4.12  True | False  NaT | 2013-05-01 12:00:00

By the way, if you're in IPython Notebook, you may like to use a colored difffunction to give colors depending whether cells are different, equal or left/right null :

顺便说一句,如果您在 IPython Notebook 中,您可能喜欢使用彩色diff函数根据单元格是否不同、相等或左/右 null 来给出颜色:

from IPython.display import HTML
pd.options.display.max_colwidth = 500  # You need this, otherwise pandas
#                          will limit your HTML strings to 50 characters

def report_diff(x):
    if x[0]==x[1]:
        return unicode(x[0].__str__())
    elif pd.isnull(x[0]) and pd.isnull(x[1]):
        return u'<table style="background-color:#00ff00;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', 'nan')
    elif pd.isnull(x[0]) and ~pd.isnull(x[1]):
        return u'<table style="background-color:#ffff00;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', x[1])
    elif ~pd.isnull(x[0]) and pd.isnull(x[1]):
        return u'<table style="background-color:#0000ff;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0],'nan')
    else:
        return u'<table style="background-color:#ff0000;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0], x[1])

HTML(my_panel.apply(report_diff, axis=0).to_html(escape=False))

回答by Hubbitus

Extending answer of @cge, which is pretty cool for more readability of result:

扩展@cge 的答案,这对于提高结果的可读性非常酷:

a[a != b][np.any(a != b, axis=1)].join(pd.DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
        b[a != b][np.any(a != b, axis=1)]
        ,rsuffix='_b', how='outer'
).fillna('')

Full demonstration example:

完整演示示例:

import numpy as np, pandas as pd

a = pd.DataFrame(np.random.randn(7,3), columns=list('ABC'))
b = a.copy()
b.iloc[0,2] = np.nan
b.iloc[1,0] = 7
b.iloc[3,1] = 77
b.iloc[4,2] = 777

a[a != b][np.any(a != b, axis=1)].join(pd.DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
        b[a != b][np.any(a != b, axis=1)]
        ,rsuffix='_b', how='outer'
).fillna('')

回答by James Owers

This answer simply extends @Andy Hayden's, making it resilient to when numeric fields are nan, and wrapping it up into a function.

这个答案只是扩展了@Andy Hayden 的答案,使其在数字字段为 时具有弹性nan,并将其包装成一个函数。

import pandas as pd
import numpy as np


def diff_pd(df1, df2):
    """Identify differences between two pandas DataFrames"""
    assert (df1.columns == df2.columns).all(), \
        "DataFrame column names are different"
    if any(df1.dtypes != df2.dtypes):
        "Data Types are different, trying to convert"
        df2 = df2.astype(df1.dtypes)
    if df1.equals(df2):
        return None
    else:
        # need to account for np.nan != np.nan returning True
        diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
        ne_stacked = diff_mask.stack()
        changed = ne_stacked[ne_stacked]
        changed.index.names = ['id', 'col']
        difference_locations = np.where(diff_mask)
        changed_from = df1.values[difference_locations]
        changed_to = df2.values[difference_locations]
        return pd.DataFrame({'from': changed_from, 'to': changed_to},
                            index=changed.index)

So with your data (slightly edited to have a NaN in the score column):

因此,对于您的数据(稍微编辑以在分数列中包含 NaN):

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Hyman   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Hyman   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)
df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
diff_pd(df1, df2)

Output:

输出:

                from           to
id  col                          
112 score       1.11         1.21
113 isEnrolled  True        False
    Comment           On vacation

回答by jur

A different approach using concat and drop_duplicates:

使用 concat 和 drop_duplicates 的不同方法:

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO
import pandas as pd

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Hyman   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Hyman   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)

df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
#%%
dictionary = {1:df1,2:df2}
df=pd.concat(dictionary)
df.drop_duplicates(keep=False)

Output:

输出:

       Name  score isEnrolled      Comment
  id                                      
1 112  Nick   1.11      False    Graduated
  113   Zoe    NaN       True             
2 112  Nick   1.21      False    Graduated
  113   Zoe    NaN      False  On vacation

回答by Aziz Alto

Here is another way using select and merge:

这是使用选择和合并的另一种方法:

In [6]: # first lets create some dummy dataframes with some column(s) different
   ...: df1 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': range(20,25)})
   ...: df2 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': [20] + list(range(101,105))})


In [7]: df1
Out[7]:
   a   b   c
0 -5  10  20
1 -4  11  21
2 -3  12  22
3 -2  13  23
4 -1  14  24


In [8]: df2
Out[8]:
   a   b    c
0 -5  10   20
1 -4  11  101
2 -3  12  102
3 -2  13  103
4 -1  14  104


In [10]: # make condition over the columns you want to comapre
    ...: condition = df1['c'] != df2['c']
    ...:
    ...: # select rows from each dataframe where the condition holds
    ...: diff1 = df1[condition]
    ...: diff2 = df2[condition]


In [11]: # merge the selected rows (dataframes) with some suffixes (optional)
    ...: diff1.merge(diff2, on=['a','b'], suffixes=('_before', '_after'))
Out[11]:
   a   b  c_before  c_after
0 -4  11        21      101
1 -3  12        22      102
2 -2  13        23      103
3 -1  14        24      104

Here is the same thing from a Jupyter screenshot:

以下是 Jupyter 屏幕截图中的相同内容:

enter image description here

在此处输入图片说明

回答by Ted Petrou

Highlighting the difference between two DataFrames

突出显示两个 DataFrame 之间的差异

It is possible to use the DataFrame style property to highlight the background color of the cells where there is a difference.

可以使用 DataFrame 样式属性突出显示存在差异的单元格的背景颜色。

Using the example data from the original question

使用来自原始问题的示例数据

The first step is to concatenate the DataFrames horizontally with the concatfunction and distinguish each frame with the keysparameter:

第一步是将DataFrames与concat函数水平连接起来,并用keys参数区分每一帧:

df_all = pd.concat([df.set_index('id'), df2.set_index('id')], 
                   axis='columns', keys=['First', 'Second'])
df_all

enter image description here

在此处输入图片说明

It's probably easier to swap the column levels and put the same column names next to each other:

交换列级别并将相同的列名放在一起可能更容易:

df_final = df_all.swaplevel(axis='columns')[df.columns[1:]]
df_final

enter image description here

在此处输入图片说明

Now, its much easier to spot the differences in the frames. But, we can go further and use the styleproperty to highlight the cells that are different. We define a custom function to do this which you can see in this part of the documentation.

现在,更容易发现帧中的差异。但是,我们可以进一步使用该style属性来突出显示不同的单元格。我们定义了一个自定义函数来执行此操作,您可以在文档的这一部分中看到。

def highlight_diff(data, color='yellow'):
    attr = 'background-color: {}'.format(color)
    other = data.xs('First', axis='columns', level=-1)
    return pd.DataFrame(np.where(data.ne(other, level=0), attr, ''),
                        index=data.index, columns=data.columns)

df_final.style.apply(highlight_diff, axis=None)

enter image description here

在此处输入图片说明

This will highlight cells that both have missing values. You can either fill them or provide extra logic so that they don't get highlighted.

这将突出显示都有缺失值的单元格。您可以填充它们或提供额外的逻辑,以便它们不会被突出显示。

回答by Aaron N. Brock

After fiddling around with @journois's answer, I was able to get it to work using MultiIndex instead of Panel due to Panel's deprication.

在摆弄@journois 的答案之后,由于Panel 的 deprication,我能够使用 MultiIndex 而不是 Panel 让它工作。

First, create some dummy data:

首先,创建一些虚拟数据:

df1 = pd.DataFrame({
    'id': ['111', '222', '333', '444', '555'],
    'let': ['a', 'b', 'c', 'd', 'e'],
    'num': ['1', '2', '3', '4', '5']
})
df2 = pd.DataFrame({
    'id': ['111', '222', '333', '444', '666'],
    'let': ['a', 'b', 'c', 'D', 'f'],
    'num': ['1', '2', 'Three', '4', '6'],
})

Then, define your difffunction, in this case I'll use the one from his answer report_diffstays the same:

然后,定义您的diff函数,在这种情况下,我将使用他的答案中的report_diff那个保持不变:

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

Then, I'm going to concatenate the data into a MultiIndex dataframe:

然后,我要将数据连接到 MultiIndex 数据帧中:

df_all = pd.concat(
    [df1.set_index('id'), df2.set_index('id')], 
    axis='columns', 
    keys=['df1', 'df2'],
    join='outer'
)
df_all = df_all.swaplevel(axis='columns')[df1.columns[1:]]

And finally I'm going to apply the report_diffdown each column group:

最后,我将应用report_diff向下的每个列组:

df_final.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1))

This outputs:

这输出:

         let        num
111        a          1
222        b          2
333        c  3 | Three
444    d | D          4
555  e | nan    5 | nan
666  nan | f    nan | 6

And that is all!

这就是全部!