Python Pandas DataFrames 中的相等性 - 列顺序很重要吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14224172/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 10:48:30  来源:igfitidea点击:

Equality in Pandas DataFrames - Column Order Matters?

pythonpandas

提问by jcrudy

As part of a unit test, I need to test two DataFrames for equality. The order of the columns in the DataFrames is not important to me. However, it seems to matter to Pandas:

作为单元测试的一部分,我需要测试两个 DataFrame 是否相等。DataFrame 中列的顺序对我来说并不重要。但是,对 Pandas 来说似乎很重要:

import pandas
df1 = pandas.DataFrame(index = [1,2,3,4])
df2 = pandas.DataFrame(index = [1,2,3,4])
df1['A'] = [1,2,3,4]
df1['B'] = [2,3,4,5]
df2['B'] = [2,3,4,5]
df2['A'] = [1,2,3,4]
df1 == df2

Results in:

结果是:

Exception: Can only compare identically-labeled DataFrame objects

I believe the expression df1 == df2should evaluate to a DataFrame containing all Truevalues. Obviously it's debatable what the correct functionality of ==should be in this context. My question is: Is there a Pandas method that does what I want? That is, is there a way to do equality comparison that ignores column order?

我相信表达式df1 == df2应该计算为包含所有True值的 DataFrame 。显然,==在这种情况下,正确的功能应该是什么是有争议的。我的问题是:是否有一种 Pandas 方法可以满足我的要求?也就是说,有没有办法进行忽略列顺序的相等比较?

采纳答案by Andy Hayden

You could sort the columns using sort_index:

您可以使用sort_index以下方法对列进行排序:

df1.sort_index(axis=1) == df2.sort_index(axis=1)

This will evaluate to a dataframe of all Truevalues.

这将评估为所有True值的数据框。



As @osa comments this fails for NaN's and isn't particularly robust either, in practise using something similar to @quant's answer is probably recommended (Note: we want a bool rather than raise if there's an issue):

正如@osa 评论的那样,这对于 NaN 失败并且也不是特别健壮,在实践中可能建议使用类似于 @quant 的答案(注意:如果有问题,我们想要一个 bool 而不是 raise ):

def my_equal(df1, df2):
    from pandas.util.testing import assert_frame_equal
    try:
        assert_frame_equal(df1.sort_index(axis=1), df2.sort_index(axis=1), check_names=True)
        return True
    except (AssertionError, ValueError, TypeError):  perhaps something else?
        return False

回答by Quant

def equal( df1, df2 ):
    """ Check if two DataFrames are equal, ignoring nans """
    return df1.fillna(1).sort_index(axis=1).eq(df2.fillna(1).sort_index(axis=1)).all().all()

回答by Quant

The most common intent is handled like this:

最常见的意图是这样处理的:

def assertFrameEqual(df1, df2, **kwds ):
    """ Assert that two dataframes are equal, ignoring ordering of columns"""
    from pandas.util.testing import assert_frame_equal
    return assert_frame_equal(df1.sort_index(axis=1), df2.sort_index(axis=1), check_names=True, **kwds )

Of course see pandas.util.testing.assert_frame_equalfor other parameters you can pass

当然看看pandas.util.testing.assert_frame_equal你可以传递的其他参数

回答by Srijith Sreedharan

Sorting column only works if the row and column labels match across the frames. Say, you have 2 dataframes with identical values in cells but with different labels,then the sort solution will not work. I ran into this scenario when implementing k-modes clustering using pandas.

仅当行和列标签在整个框架中匹配时,对列进行排序才有效。假设您在单元格中有 2 个具有相同值但具有不同标签的数据框,那么排序解决方案将不起作用。我在使用 Pandas 实现 k 模式聚类时遇到了这种情况。

I got around it with a simple equals function to check cell equality(code below)

我用一个简单的 equals 函数绕过它来检查单元格相等性(下面的代码)

def frames_equal(df1,df2) :
    if not isinstance(df1,DataFrame) or not isinstance(df2,DataFrame) :
        raise Exception(
            "dataframes should be an instance of pandas.DataFrame")

    if df1.shape != df2.shape:
        return False

    num_rows,num_cols = df1.shape
    for i in range(num_rows):
       match = sum(df1.iloc[i] == df2.iloc[i])
       if match != num_cols :
          return False
   return True

回答by ccook5760

have you tried using df1.equals(df2)? i think it's more reliable that df1==df2, though i'm not sure if it will resolve your issues with column order.

你试过使用 df1.equals(df2) 吗?我认为 df1==df2 更可靠,但我不确定它是否会解决您的列顺序问题。

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.equals.html

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.equals.html

回答by Murray Lynch

Usually you're going to want speedy tests and the sorting method can be brutally inefficient for larger indices (like if you were using rows instead of columns for this problem). The sort method is also susceptible to false negatives on non-unique indices.

通常你会想要快速测试,而排序方法对于较大的索引可能会非常低效(就像你使用行而不是列来解决这个问题一样)。sort 方法也容易受到非唯一索引的假阴性的影响。

Fortunately, pandas.util.testing.assert_frame_equalhas since been updated with a check_likeoption. Set this to true and the ordering will not be considered in the test.

幸运的是,pandas.util.testing.assert_frame_equal此后已经更新了一个check_like选项。将此设置为 true,测试中将不考虑排序。

With non-unique indices, you'll get the cryptic ValueError: cannot reindex from a duplicate axis. This is raised by the under-the-hood reindex_likeoperation that rearranges one of the DataFrames to match the other's order. Reindexing is muchfaster than sorting as evidenced below.

使用非唯一索引,您将获得神秘的ValueError: cannot reindex from a duplicate axis. 这是由reindex_like重新排列 DataFrame 之一以匹配另一个顺序的幕后操作引发的。重新编制索引是很多比如下证明排序更快。

import pandas as pd
from pandas.util.testing import assert_frame_equal

df  = pd.DataFrame(np.arange(1e6))
df1 = df.sample(frac=1, random_state=42)
df2 = df.sample(frac=1, random_state=43)

%timeit -n 1 -r 5 assert_frame_equal(df1.sort_index(), df2.sort_index())
## 5.73 s ± 329 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

%timeit -n 1 -r 5 assert_frame_equal(df1, df2, check_like=True)
## 1.04 s ± 237 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

For those who enjoy a good performance comparison plot:

对于那些喜欢良好性能比较图的人:

Reindexing vs sorting on int and str indices(str even more drastic)

重新索引与对 int 和 str 索引进行排序(str 更加激烈)

回答by Vito

When working with dataframes containing python objects such as tuples and lists df.eq(df2)and df == df2will not suffice. Even if a the same cells in each dataframes contain the same object, such as (0, 0), the equality comparison will result to False. To get around this, convert all columns to strings before comparison:

当处理包含元组和列表等 Python 对象的数据帧时df.eq(df2)df == df2这还不够。即使每个数据帧中的相同单元格包含相同的对象,例如(0, 0),相等比较也会导致False。要解决此问题,请在比较之前将所有列转换为字符串:

df.apply(lambda x: x.astype(str)).eq(df2.apply(lambda x: x.astype(str)))

df.apply(lambda x: x.astype(str)).eq(df2.apply(lambda x: x.astype(str)))

回答by Hyyudu

Probably you may need function to compare DataFrames ignoring both row and column order? Only requirement is to have some unique column to use it as index.

您可能需要函数来比较 DataFrames 忽略行和列顺序?唯一的要求是有一些唯一的列将其用作索引。

f1 = pd.DataFrame([
    {"id": 1, "foo": "1", "bar": None},
    {"id": 2, "foo": "2", "bar": 2},
    {"id": 3, "foo": "3", "bar": 3},
    {"id": 4, "foo": "4", "bar": 4}
])
f2 = pd.DataFrame([
    {"id": 3, "foo": "3", "bar": 3},
    {"id": 1, "bar": None, "foo": "1",},
    {"id": 2, "foo": "2", "bar": 2},
    {"id": 4, "foo": "4", "bar": 4}
])

def comparable(df, index_col='id'):
    return df.fillna(value=0).set_index(index_col).to_dict('index')

comparable(f1) == comparable (f2)  # returns True