Python 比较两个熊猫数据框的差异

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19917545/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 14:58:10  来源:igfitidea点击:

Comparing two pandas dataframes for differences

pythonpython-2.7pandas

提问by Ryflex

I've got a script updating 5-10 columns worth of data , but sometimes the start csv will be identical to the end csv so instead of writing an identical csvfile I want it to do nothing...

我有一个更新 5-10 列数据的脚本,但有时起始 csv 将与结束 csv 相同,因此我不想编写相同的 csvfile,我希望它什么都不做...

How can I compare two dataframes to check if they're the same or not?

如何比较两个数据框以检查它们是否相同?

csvdata = pandas.read_csv('csvfile.csv')
csvdata_old = csvdata

# ... do stuff with csvdata dataframe

if csvdata_old != csvdata:
    csvdata.to_csv('csvfile.csv', index=False)

Any ideas?

有任何想法吗?

采纳答案by Andy Hayden

You also need to be careful to create a copy of the DataFrame, otherwise the csvdata_old will be updated with csvdata (since it points to the same object):

您还需要小心创建 DataFrame 的副本,否则 csvdata_old 将使用 csvdata 更新(因为它指向同一个对象):

csvdata_old = csvdata.copy()

To check whether they are equal, you can use assert_frame_equal as in this answer:

要检查它们是否相等,您可以在此答案中使用 assert_frame_equal

from pandas.util.testing import assert_frame_equal
assert_frame_equal(csvdata, csvdata_old)

You can wrap this in a function with something like:

您可以将其包装在一个函数中,例如:

try:
    assert_frame_equal(csvdata, csvdata_old)
    return True
except:  # appeantly AssertionError doesn't catch all
    return False

There was discussion of a better way...

有一个更好的方法的讨论......

回答by Tristan Forward

This compares the valuesof two dataframes note the number of row/columns needs to be the same between tables

这比较了两个数据框的,注意表之间的行/列数需要相同

comparison_array = table.values == expected_table.values
print (comparison_array)

>>>[[True, True, True]
    [True, False, True]]

if False in comparison_array:
    print ("Not the same")

#Return the position of the False values
np.where(comparison_array==False)

>>>(array([1]), array([1]))

You could then use this index information to return the value that does not match between tables. Since it's zero indexed, it's referring to the 2nd array in the 2nd position which is correct.

然后,您可以使用此索引信息返回表之间不匹配的值。由于它是零索引,它指的是第二个位置的第二个数组,这是正确的。

回答by sobes

Not sure if this existed at the time the question was posted, but pandas now has a built-in function to test equality between two dataframes: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.equals.html.

不确定在发布问题时这是否存在,但熊猫现在有一个内置函数来测试两个数据帧之间的相等性:http: //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame .equals.html

回答by Surya

Check using: df_1.equals(df_2)# Returns True or False, details herebelow

检查使用:df_1.equals(df_2) # 返回 True 或 False详情如下

In [45]: import numpy as np

In [46]: import pandas as pd

In [47]: np.random.seed(5)

In [48]: df_1= pd.DataFrame(np.random.randn(3,3))

In [49]: df_1
Out[49]: 
          0         1         2
0  0.441227 -0.330870  2.430771
1 -0.252092  0.109610  1.582481
2 -0.909232 -0.591637  0.187603

In [50]: np.random.seed(5)

In [51]: df_2= pd.DataFrame(np.random.randn(3,3))

In [52]: df_2
Out[52]: 
          0         1         2
0  0.441227 -0.330870  2.430771
1 -0.252092  0.109610  1.582481
2 -0.909232 -0.591637  0.187603

In [53]: df_1.equals(df_2)
Out[53]: True


In [54]: df_3= pd.DataFrame(np.random.randn(3,3))

In [55]: df_3
Out[55]: 
          0         1         2
0 -0.329870 -1.192765 -0.204877
1 -0.358829  0.603472 -1.664789
2 -0.700179  1.151391  1.857331

In [56]: df_1.equals(df_3)
Out[56]: False

回答by Dennis Golomazov

A more accurate comparison should check for index names separately, because DataFrame.equalsdoes not test for that. All the other properties (index values (single/multiindex), values, columns, dtypes) are checked by it correctly.

更准确的比较应该单独检查索引名称,因为DataFrame.equals不会对此进行测试。所有其他属性(索引值(单/多索引)、值、列、数据类型)都由它正确检查。

df1 = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['num', 'name'])
df1 = df1.set_index('name')
df2 = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['num', 'another_name'])
df2 = df2.set_index('another_name')

df1.equals(df2)
True

df1.index.names == df2.index.names
False

Note: using index.namesinstead of index.namemakes it work for multi-indexed dataframes as well.

注意:使用index.names而不是index.name使它也适用于多索引数据帧。

回答by Tom Chapin

Not sure if this is helpful or not, but I whipped together this quick python method for returning just the differences between two dataframes that both have the same columns and shape.

不确定这是否有帮助,但我将这个快速的 Python 方法组合在一起,用于仅返回具有相同列和形状的两个数据帧之间的差异。

def get_different_rows(source_df, new_df):
    """Returns just the rows from the new dataframe that differ from the source dataframe"""
    merged_df = source_df.merge(new_df, indicator=True, how='outer')
    changed_rows_df = merged_df[merged_df['_merge'] == 'right_only']
    return changed_rows_df.drop('_merge', axis=1)

回答by alpha_989

In my case, I had a weird error, whereby even though the indices, column-names and values were same, the DataFramesdidnt match. I tracked it down to the data-types, and it seems pandascan sometimes use different datatypes, resulting in such problems

就我而言,我有一个奇怪的错误,即使索引、列名和值相同,也不DataFrames匹配。我追踪到数据类型,似乎pandas有时可以使用不同的数据类型,从而导致此类问题

For example:

例如:

param2 = pd.DataFrame({'a': [1]}) param1 = pd.DataFrame({'a': [1], 'b': [2], 'c': [2], 'step': ['alpha']})

param2 = pd.DataFrame({'a': [1]}) param1 = pd.DataFrame({'a': [1], 'b': [2], 'c': [2], 'step': ['alpha']})

if you check param1.dtypesand param2.dtypes, you will find that 'a' is of type objectfor param1and is of type int64for param2. Now, if you do some manipulation using a combination of param1and param2, other parameters of the dataframe will deviate from the default ones.

如果您检查param1.dtypesparam2.dtypes,您会发现 'a' 的类型object为 forparam1并且类型int64为 for param2。现在,如果你使用的组合一些操作param1param2,数据帧的其他参数会偏离默认的。

So after the final dataframe is generated, even though the actual values that are printed out are same, final_df1.equals(final_df2), may turn out to be not-equal, because those samll parameters like Axis 1, ObjectBlock, IntBlockmaynot be the same.

从而产生最终的数据帧之后,即使被打印出的实际值是相同的,final_df1.equals(final_df2)可以变成是不等于,因为那些samll参数喜欢Axis 1ObjectBlockIntBlockmaynot是相同的。

A easy way to get around this and compare the values is to use

解决此问题并比较值的一种简单方法是使用

final_df1==final_df2.

final_df1==final_df2.

However, this will do a element by element comparison, so it wont work if you are using it to assert a statement for example in pytest.

但是,这将进行逐个元素的比较,因此如果您使用它来断言声明,例如 in pytest.

TL;DR

TL; 博士

What works well is

什么效果好是

all(final_df1 == final_df2).

all(final_df1 == final_df2).

This does a element by element comparison, while neglecting the parameters not important for comparison.

这是一个逐个元素的比较,同时忽略了对比较不重要的参数。

TL;DR2

TL;DR2

If your values and indices are same, but final_df1.equals(final_df2)is showing False, you can use final_df1._dataand final_df2._datato check the rest of the elements of the dataframes.

如果您的值和索引相同,但final_df1.equals(final_df2)显示False,则可以使用final_df1._datafinal_df2._data检查数据框的其余元素。

回答by leerssej

To pull out the symmetric differences:

拉出对称差异:

df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)

For example:

例如:

df1 = pd.DataFrame({
    'num': [1, 4, 3],
    'name': ['a', 'b', 'c'],
})
df2 = pd.DataFrame({
    'num': [1, 2, 3],
    'name': ['a', 'b', 'd'],
})

Will yield:

将产生:

enter image description here

在此处输入图片说明

Note: until the next release of pandas, to avoid the warning about how the sort argument will be set in the future, just add the sort=Falseargument. As below:

注意:在pandas 的下一个版本之前,为了避免关于未来如何设置排序参数的警告,只需添加sort=False参数即可。如下:

df_diff = pd.concat([df1,df2], sort=False).drop_duplicates(keep=False)