pandas 熊猫中两个数据框之间的差异

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47131361/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:44:14  来源:igfitidea点击:

Diff between two dataframes in pandas

pythonpandasmergecomparediff

提问by mhy

I have two dataframes both of which have the same basic schema. (4 date fields, a couple of string fields, and 4-5 float fields). Call them df1and df2.

我有两个数据框,它们都具有相同的基本架构。(4 个日期字段、几个字符串字段和 4-5 个浮点字段)。打电话给他们df1df2

What I want to do is basically get a "diff" of the two - where I get back all rows that are not shared between the two dataframes (not in the set intersection). Note, the two dataframes need not be the same length.

我想要做的基本上是获得两者的“差异” - 在那里我返回两个数据帧之间未共享的所有行(不在集合交集中)。请注意,两个数据帧的长度不必相同。

I tried using pandas.merge(how='outer')but I was not sure what column to pass in as the 'key' as there really isn't one and the various combinations I tried were not working. It is possible that df1or df2has two (or more) rows that are identical.

我尝试使用,pandas.merge(how='outer')但我不确定要作为“键”传入的列,因为实际上没有一个列,而且我尝试的各种组合都不起作用。这是可能的df1或者df2具有两个(或更多个),其是相同的行。

What is a good way to do this in pandas/Python?

在 Pandas/Python 中这样做的好方法是什么?

回答by niceGuy

Try this:

尝试这个:

diff_df = pd.merge(df1, df2, how='outer', indicator='Exist')

diff_df = diff_df.loc[diff_df['Exist'] != 'both']

You will have a dataframe of all rows that don't exist on both df1 and df2.

您将拥有 df1 和 df2 上都不存在的所有行的数据框。

回答by piRSquared

IIUC:
You can use pd.Index.symmetric_difference

IIUC:
您可以使用pd.Index.symmetric_difference

pd.concat([df1, df2]).loc[
    df1.index.symmetric_difference(df2.index)
]

回答by Ji Wei

You can use this function, the output is an ordered dict of 6 dataframes which you can write to excel for further analysis.

您可以使用此函数,输出是 6 个数据帧的有序字典,您可以将其写入 excel 以进行进一步分析。

  • 'df1' and 'df2' refers to your input dataframes.
  • 'uid' refers to the column or combination of columns that make up the unique key. (i.e. 'Fruits')
  • 'dedupe' (default=True) drops duplicates in df1 and df2. (refer to Step 4 in comments)
  • 'labels' (default = ('df1','df2')) allows you to name the input dataframes. If a unique key exists in both dataframes, but have different values in one or more columns, it is usually important to know these rows, put them one on top of the other and label the row with the name so we know to which dataframe does it belong to.
  • 'drop' can take a list of columns to be excluded from the consideration when considering the difference
  • 'df1' 和 'df2' 是指您的输入数据帧。
  • 'uid' 是指构成唯一键的列或列组合。(即'水果')
  • 'dedupe'(默认值=True)删除 df1 和 df2 中的重复项。(请参阅评论中的第 4 步)
  • 'labels' (default = ('df1','df2')) 允许您命名输入数据框。如果两个数据帧中都存在唯一键,但在一列或多列中具有不同的值,通常很重要的是了解这些行,将它们放在另一行的顶部并用名称标记该行,以便我们知道哪个数据帧它属于。
  • 'drop' 可以在考虑差异时将列列表排除在考虑范围之外

Here goes:

开始:

df1 = pd.DataFrame([['apple', '1'], ['banana', 2], ['coconut',3]], columns=['Fruits','Quantity'])
df2 = pd.DataFrame([['apple', '1'], ['banana', 3], ['durian',4]], columns=['Fruits','Quantity'])
dict1 = diff_func(df1, df2, 'Fruits')

In [10]: dict1['df1_only']:
Out[10]:
    Fruits Quantity
1  coconut        3

In [11]: dict1['df2_only']:
Out[11]:
   Fruits Quantity
3  durian        4

In [12]: dict1['Diff']:
Out[12]:
   Fruits Quantity df1 or df2
0  banana        2        df1
1  banana        3        df2

In [13]: dict1['Merge']:
Out[13]:
  Fruits Quantity
0  apple        1

Here is the code:

这是代码:

import pandas as pd
from collections import OrderedDict as od

def diff_func(df1, df2, uid, dedupe=True, labels=('df1', 'df2'), drop=[]):
    dict_df = {labels[0]: df1, labels[1]: df2}
    col1 = df1.columns.values.tolist()
    col2 = df2.columns.values.tolist()

    # There could be columns known to be different, hence allow user to pass this as a list to be dropped.
    if drop:
        print ('Ignoring columns {} in comparison.'.format(', '.join(drop)))
        col1 = list(filter(lambda x: x not in drop, col1))
        col2 = list(filter(lambda x: x not in drop, col2))
        df1 = df1[col1]
        df2 = df2[col2]


    # Step 1 - Check if no. of columns are the same:
    len_lr = len(col1), len(col2)
    assert len_lr[0]==len_lr[1], \
    'Cannot compare frames with different number of columns: {}.'.format(len_lr)

    # Step 2a - Check if the set of column headers are the same
    #           (order doesnt matter)
    assert set(col1)==set(col2), \
    'Left column headers are different from right column headers.' \
       +'\n   Left orphans: {}'.format(list(set(col1)-set(col2))) \
       +'\n   Right orphans: {}'.format(list(set(col2)-set(col1)))

    # Step 2b - Check if the column headers are in the same order
    if col1 != col2:
        print ('[Note] Reordering right Dataframe...')
        df2 = df2[col1]

    # Step 3 - Check datatype are the same [Order is important]
    if set((df1.dtypes == df2.dtypes).tolist()) - {True}:
        print ('dtypes are not the same.')
        df_dtypes = pd.DataFrame({labels[0]:df1.dtypes,labels[1]:df2.dtypes,'Diff':(df1.dtypes == df2.dtypes)})
        df_dtypes = df_dtypes[df_dtypes['Diff']==False][[labels[0],labels[1],'Diff']]
        print (df_dtypes)
    else:
        print ('DataType check: Passed')

    # Step 4 - Check for duplicate rows
    if dedupe:
        for key, df in dict_df.items():
            if df.shape[0] != df.drop_duplicates().shape[0]:
                print(key + ': Duplicates exists, they will be dropped.')
                dict_df[key] = df.drop_duplicates()

    # Step 5 - Check for duplicate uids.
    if type(uid)==str or type(uid)==list:
        print ('Uniqueness check: {}'.format(uid))
        for key, df in dict_df.items():
            count_uid = df.shape[0]
            count_uid_unique = df[uid].drop_duplicates().shape[0]
            var = [0,1][count_uid_unique == df.shape[0]] #<-- Round off to the nearest integer if it is 100%
            pct = round(100*count_uid_unique/df.shape[0], var)
            print ('{}: {} out of {} are unique ({}%).'.format(key, count_uid_unique, count_uid, pct))

    # Checks complete, begin merge. '''Remenber to dedupe, provide labels for common_no_match'''
    dict_result = od()
    df_merge = pd.merge(df1, df2, on=col1, how='inner')
    if not df_merge.shape[0]:
        print ('Error: Merged DataFrame is empty.')
    else:
        dict_result[labels[0]] = df1
        dict_result[labels[1]] = df2
        dict_result['Merge'] = df_merge
        if type(uid)==str:
            uid = [uid]

        if type(uid)==list:
            df1_only = df1.append(df_merge).reset_index(drop=True)
            df1_only['Duplicated']=df1_only.duplicated(subset=uid, keep=False)  #keep=False, marks all duplicates as True
            df1_only = df1_only[df1_only['Duplicated']==False]
            df2_only = df2.append(df_merge).reset_index(drop=True)
            df2_only['Duplicated']=df2_only.duplicated(subset=uid, keep=False)
            df2_only = df2_only[df2_only['Duplicated']==False]

            label = labels[0]+' or '+labels[1]
            df_lc = df1_only.copy()
            df_lc[label] = labels[0]
            df_rc = df2_only.copy()
            df_rc[label] = labels[1]
            df_c = df_lc.append(df_rc).reset_index(drop=True)
            df_c['Duplicated'] = df_c.duplicated(subset=uid, keep=False)
            df_c1 = df_c[df_c['Duplicated']==True]
            df_c1 = df_c1.drop('Duplicated', axis=1)
            df_uc = df_c[df_c['Duplicated']==False]

            df_uc_left = df_uc[df_uc[label]==labels[0]]
            df_uc_right = df_uc[df_uc[label]==labels[1]]

            dict_result[labels[0]+'_only'] = df_uc_left.drop(['Duplicated', label], axis=1)
            dict_result[labels[1]+'_only'] = df_uc_right.drop(['Duplicated', label], axis=1)
            dict_result['Diff'] = df_c1.sort_values(uid).reset_index(drop=True)

    return dict_result

回答by cs95

  1. Set df2.columns = df1.columns

  2. Now, set every column as the index: df1 = df1.set_index(df1.columns.tolist()), and similarly for df2.

  3. You can now do df1.index.difference(df2.index), and df2.index.difference(df1.index), and the two results are your distinct columns.

  1. df2.columns = df1.columns

  2. 现在,将每一列设置为 index: df1 = df1.set_index(df1.columns.tolist()),对于df2.

  3. 您现在可以执行df1.index.difference(df2.index), 和df2.index.difference(df1.index),并且这两个结果是您的不同列。

回答by Shihe Zhang

with

left_df.merge(df,left_on=left_df.columns.tolist(),right_on=df.columns.tolist(),how='outer')

you can get the outer join result.
Similarly, you can get the inner join result.Then make a diff that would be what you want.

您可以获得外连接结果。
同样,您可以获得内部连接结果。然后制作一个您想要的差异。