Python 具有 NaN 相等性比较的 Pandas DataFrames
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19322506/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas DataFrames with NaNs equality comparison
提问by Steve Pike
In the context of unit testing some functions, I'm trying to establish the equality of 2 DataFrames using python pandas:
在对某些函数进行单元测试的上下文中,我尝试使用 python pandas 建立 2 个 DataFrame 的相等性:
ipdb> expect
1 2
2012-01-01 00:00:00+00:00 NaN 3
2013-05-14 12:00:00+00:00 3 NaN
ipdb> df
identifier 1 2
timestamp
2012-01-01 00:00:00+00:00 NaN 3
2013-05-14 12:00:00+00:00 3 NaN
ipdb> df[1][0]
nan
ipdb> df[1][0], expect[1][0]
(nan, nan)
ipdb> df[1][0] == expect[1][0]
False
ipdb> df[1][1] == expect[1][1]
True
ipdb> type(df[1][0])
<type 'numpy.float64'>
ipdb> type(expect[1][0])
<type 'numpy.float64'>
ipdb> (list(df[1]), list(expect[1]))
([nan, 3.0], [nan, 3.0])
ipdb> df1, df2 = (list(df[1]), list(expect[1])) ;; df1 == df2
False
Given that I'm trying to test the entire of expect
against the entire of df
, including NaN
positions, what am I doing wrong?
鉴于我正在尝试expect
针对整个测试整个df
,包括NaN
职位,我做错了什么?
What is the simplest way to compare equality of Series/DataFrames including NaN
s?
比较包括NaN
s在内的Series/DataFrames 的相等性的最简单方法是什么?
采纳答案by Andy Hayden
You can use assert_frame_equals with check_names=False (so as not to check the index/columns names), which will raise if they are not equal:
您可以将 assert_frame_equals 与 check_names=False 一起使用(以免检查索引/列名称),如果它们不相等,则会引发:
In [11]: from pandas.testing import assert_frame_equal
In [12]: assert_frame_equal(df, expected, check_names=False)
You can wrap this in a function with something like:
您可以将其包装在一个函数中,例如:
try:
assert_frame_equal(df, expected, check_names=False)
return True
except AssertionError:
return False
In more recent pandas this functionality has been added as .equals
:
在最近的熊猫中,此功能已添加为.equals
:
df.equals(expected)
回答by Phillip Cloud
One of the properties of NaN
is that NaN != NaN
is True
.
其中一个特性NaN
是,NaN != NaN
是True
。
Check out this answerfor a nice way to do this using numexpr
.
查看此答案,了解使用numexpr
.
(a == b) | ((a != a) & (b != b))
says this (in pseudocode):
说这个(用伪代码):
a == b or (isnan(a) and isnan(b))
So, either a
equals b
, or both a
and b
are NaN
.
所以,无论是a
平等的b
,或两者a
并b
有NaN
。
If you have small frames then assert_frame_equal
will be okay. However, for large frames (10M rows) assert_frame_equal
is pretty much useless. I had to interrupt it, it was taking so long.
如果你有小框架,那assert_frame_equal
就没问题了。然而,对于大帧(10M 行)assert_frame_equal
来说几乎没有用。我不得不打断它,它花了这么长时间。
In [1]: df = DataFrame(rand(1e7, 15))
In [2]: df = df[df > 0.5]
In [3]: df2 = df.copy()
In [4]: df
Out[4]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Columns: 15 entries, 0 to 14
dtypes: float64(15)
In [5]: timeit (df == df2) | ((df != df) & (df2 != df2))
1 loops, best of 3: 598 ms per loop
timeit
of the (presumably) desired single bool
indicating whether the two DataFrame
s are equal:
timeit
(大概)所需的单个bool
指示两个DataFrame
s 是否相等:
In [9]: timeit ((df == df2) | ((df != df) & (df2 != df2))).values.all()
1 loops, best of 3: 687 ms per loop
回答by Jeff
Like @PhillipCloud answer, but more written out
像@PhillipCloud 的回答,但写得更多
In [26]: df1 = DataFrame([[np.nan,1],[2,np.nan]])
In [27]: df2 = df1.copy()
They really are equivalent
他们真的是等价的
In [28]: result = df1 == df2
In [29]: result[pd.isnull(df1) == pd.isnull(df2)] = True
In [30]: result
Out[30]:
0 1
0 True True
1 True True
A nan in df2 that doesn't exist in df1
df2 中的 nan 在 df1 中不存在
In [31]: df2 = DataFrame([[np.nan,1],[np.nan,np.nan]])
In [32]: result = df1 == df2
In [33]: result[pd.isnull(df1) == pd.isnull(df2)] = True
In [34]: result
Out[34]:
0 1
0 True True
1 False True
You can also fill with a value you know not to be in the frame
您还可以填充一个您知道不在框架中的值
In [38]: df1.fillna(-999) == df1.fillna(-999)
Out[38]:
0 1
0 True True
1 True True
回答by Lydia
Any equality comparison using == with np.NaN is False, even np.NaN == np.NaN is False.
任何使用 == 和 np.NaN 的相等比较都是假的,甚至 np.NaN == np.NaN 也是假的。
Simply, df1.fillna('NULL') == df2.fillna('NULL')
, if 'NULL' is not a value in the original data.
简单地说,df1.fillna('NULL') == df2.fillna('NULL')
如果 'NULL' 不是原始数据中的值。
To be safe, do the following:
为安全起见,请执行以下操作:
Example a) Compare two dataframes with NaN values
示例 a) 比较两个具有 NaN 值的数据帧
bools = (df1 == df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = True
assert bools.all().all()
Example b) Filter rows in df1 that do not match with df2
示例 b) 过滤 df1 中与 df2 不匹配的行
bools = (df1 != df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = False
df_outlier = df1[bools.all(axis=1)]
(Note: this is wrong - bools[pd.isnull(df1) == pd.isnull(df2)] = False)
(注意:这是错误的 - bools[pd.isnull(df1) == pd.isnull(df2)] = False)
回答by stephentgrammer
df.fillna(0) == df2.fillna(0)
You can use fillna()
. Documenation here.
您可以使用fillna()
. 文档在这里。
from pandas import DataFrame
# create a dataframe with NaNs
df = DataFrame([{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}])
df2 = df
# comparison fails!
print df == df2
# all is well
print df.fillna(0) == df2.fillna(0)