Python 比较两个熊猫数据框的差异
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19917545/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Comparing two pandas dataframes for differences
提问by Ryflex
I've got a script updating 5-10 columns worth of data , but sometimes the start csv will be identical to the end csv so instead of writing an identical csvfile I want it to do nothing...
我有一个更新 5-10 列数据的脚本,但有时起始 csv 将与结束 csv 相同,因此我不想编写相同的 csvfile,我希望它什么都不做...
How can I compare two dataframes to check if they're the same or not?
如何比较两个数据框以检查它们是否相同?
csvdata = pandas.read_csv('csvfile.csv')
csvdata_old = csvdata
# ... do stuff with csvdata dataframe
if csvdata_old != csvdata:
csvdata.to_csv('csvfile.csv', index=False)
Any ideas?
有任何想法吗?
采纳答案by Andy Hayden
You also need to be careful to create a copy of the DataFrame, otherwise the csvdata_old will be updated with csvdata (since it points to the same object):
您还需要小心创建 DataFrame 的副本,否则 csvdata_old 将使用 csvdata 更新(因为它指向同一个对象):
csvdata_old = csvdata.copy()
To check whether they are equal, you can use assert_frame_equal as in this answer:
要检查它们是否相等,您可以在此答案中使用 assert_frame_equal:
from pandas.util.testing import assert_frame_equal
assert_frame_equal(csvdata, csvdata_old)
You can wrap this in a function with something like:
您可以将其包装在一个函数中,例如:
try:
assert_frame_equal(csvdata, csvdata_old)
return True
except: # appeantly AssertionError doesn't catch all
return False
There was discussion of a better way...
有一个更好的方法的讨论......
回答by Tristan Forward
This compares the valuesof two dataframes note the number of row/columns needs to be the same between tables
这比较了两个数据框的值,注意表之间的行/列数需要相同
comparison_array = table.values == expected_table.values
print (comparison_array)
>>>[[True, True, True]
[True, False, True]]
if False in comparison_array:
print ("Not the same")
#Return the position of the False values
np.where(comparison_array==False)
>>>(array([1]), array([1]))
You could then use this index information to return the value that does not match between tables. Since it's zero indexed, it's referring to the 2nd array in the 2nd position which is correct.
然后,您可以使用此索引信息返回表之间不匹配的值。由于它是零索引,它指的是第二个位置的第二个数组,这是正确的。
回答by sobes
Not sure if this existed at the time the question was posted, but pandas now has a built-in function to test equality between two dataframes: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.equals.html.
不确定在发布问题时这是否存在,但熊猫现在有一个内置函数来测试两个数据帧之间的相等性:http: //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame .equals.html。
回答by Surya
Check using: df_1.equals(df_2)# Returns True or False, details herebelow
检查使用:df_1.equals(df_2) # 返回 True 或 False,详情如下
In [45]: import numpy as np
In [46]: import pandas as pd
In [47]: np.random.seed(5)
In [48]: df_1= pd.DataFrame(np.random.randn(3,3))
In [49]: df_1
Out[49]:
0 1 2
0 0.441227 -0.330870 2.430771
1 -0.252092 0.109610 1.582481
2 -0.909232 -0.591637 0.187603
In [50]: np.random.seed(5)
In [51]: df_2= pd.DataFrame(np.random.randn(3,3))
In [52]: df_2
Out[52]:
0 1 2
0 0.441227 -0.330870 2.430771
1 -0.252092 0.109610 1.582481
2 -0.909232 -0.591637 0.187603
In [53]: df_1.equals(df_2)
Out[53]: True
In [54]: df_3= pd.DataFrame(np.random.randn(3,3))
In [55]: df_3
Out[55]:
0 1 2
0 -0.329870 -1.192765 -0.204877
1 -0.358829 0.603472 -1.664789
2 -0.700179 1.151391 1.857331
In [56]: df_1.equals(df_3)
Out[56]: False
回答by Dennis Golomazov
A more accurate comparison should check for index names separately, because DataFrame.equals
does not test for that. All the other properties (index values (single/multiindex), values, columns, dtypes) are checked by it correctly.
更准确的比较应该单独检查索引名称,因为DataFrame.equals
不会对此进行测试。所有其他属性(索引值(单/多索引)、值、列、数据类型)都由它正确检查。
df1 = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['num', 'name'])
df1 = df1.set_index('name')
df2 = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['num', 'another_name'])
df2 = df2.set_index('another_name')
df1.equals(df2)
True
df1.index.names == df2.index.names
False
Note: using index.names
instead of index.name
makes it work for multi-indexed dataframes as well.
注意:使用index.names
而不是index.name
使它也适用于多索引数据帧。
回答by Tom Chapin
Not sure if this is helpful or not, but I whipped together this quick python method for returning just the differences between two dataframes that both have the same columns and shape.
不确定这是否有帮助,但我将这个快速的 Python 方法组合在一起,用于仅返回具有相同列和形状的两个数据帧之间的差异。
def get_different_rows(source_df, new_df):
"""Returns just the rows from the new dataframe that differ from the source dataframe"""
merged_df = source_df.merge(new_df, indicator=True, how='outer')
changed_rows_df = merged_df[merged_df['_merge'] == 'right_only']
return changed_rows_df.drop('_merge', axis=1)
回答by alpha_989
In my case, I had a weird error, whereby even though the indices, column-names
and values were same, the DataFrames
didnt match. I tracked it down to the
data-types, and it seems pandas
can sometimes use different datatypes,
resulting in such problems
就我而言,我有一个奇怪的错误,即使索引、列名和值相同,也不DataFrames
匹配。我追踪到数据类型,似乎pandas
有时可以使用不同的数据类型,从而导致此类问题
For example:
例如:
param2 = pd.DataFrame({'a': [1]})
param1 = pd.DataFrame({'a': [1], 'b': [2], 'c': [2], 'step': ['alpha']})
param2 = pd.DataFrame({'a': [1]})
param1 = pd.DataFrame({'a': [1], 'b': [2], 'c': [2], 'step': ['alpha']})
if you check param1.dtypes
and param2.dtypes
, you will find that 'a' is of
type object
for param1
and is of type int64
for param2
. Now, if you do
some manipulation using a combination of param1
and param2
, other
parameters of the dataframe will deviate from the default ones.
如果您检查param1.dtypes
和param2.dtypes
,您会发现 'a' 的类型object
为 forparam1
并且类型int64
为 for param2
。现在,如果你使用的组合一些操作param1
和param2
,数据帧的其他参数会偏离默认的。
So after the final dataframe is generated, even though the actual values that
are printed out are same, final_df1.equals(final_df2)
, may turn out to be
not-equal, because those samll parameters like Axis 1
, ObjectBlock
,
IntBlock
maynot be the same.
从而产生最终的数据帧之后,即使被打印出的实际值是相同的,final_df1.equals(final_df2)
可以变成是不等于,因为那些samll参数喜欢Axis 1
,ObjectBlock
,
IntBlock
maynot是相同的。
A easy way to get around this and compare the values is to use
解决此问题并比较值的一种简单方法是使用
final_df1==final_df2
.
final_df1==final_df2
.
However, this will do a element by element comparison, so it wont work if you
are using it to assert a statement for example in pytest
.
但是,这将进行逐个元素的比较,因此如果您使用它来断言声明,例如 in pytest
.
TL;DR
TL; 博士
What works well is
什么效果好是
all(final_df1 == final_df2)
.
all(final_df1 == final_df2)
.
This does a element by element comparison, while neglecting the parameters not important for comparison.
这是一个逐个元素的比较,同时忽略了对比较不重要的参数。
TL;DR2
TL;DR2
If your values and indices are same, but final_df1.equals(final_df2)
is showing False
, you can use final_df1._data
and final_df2._data
to check the rest of the elements of the dataframes.
如果您的值和索引相同,但final_df1.equals(final_df2)
显示False
,则可以使用final_df1._data
和final_df2._data
检查数据框的其余元素。
回答by leerssej
To pull out the symmetric differences:
拉出对称差异:
df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)
For example:
例如:
df1 = pd.DataFrame({
'num': [1, 4, 3],
'name': ['a', 'b', 'c'],
})
df2 = pd.DataFrame({
'num': [1, 2, 3],
'name': ['a', 'b', 'd'],
})
Will yield:
将产生:
Note: until the next release of pandas, to avoid the warning about how the sort argument will be set in the future, just add the sort=False
argument. As below:
注意:在pandas 的下一个版本之前,为了避免关于未来如何设置排序参数的警告,只需添加sort=False
参数即可。如下:
df_diff = pd.concat([df1,df2], sort=False).drop_duplicates(keep=False)