pandas 忽略nan的Python比较

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48452933/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:05:40  来源:igfitidea点击:

Python comparison ignoring nan

pythonpython-2.7pandasnanequality

提问by sds

While nan == nanis always False, in many cases people want to treat them as equal, and this is enshrined in pandas.DataFrame.equals:

虽然nan == nan总是False,但在许多情况下,人们希望将它们视为平等,这体现在pandas.DataFrame.equals

NaNs in the same location are considered equal.

同一位置的 NaN 被认为是相等的。

Of course, I can write

当然,我可以写

def equalp(x, y):
    return (x == y) or (math.isnan(x) and math.isnan(y))

However, this will fail on containers like [float("nan")]and isnanbarfs on non-numbers (so the complexity increases).

但是,这将在非数字上的容器[float("nan")]isnanbarfs 上失败(因此复杂性增加)。

So, what do people do to compare complex Python objects which may contain nan?

那么,人们如何比较可能包含的复杂 Python 对象nan呢?

PS. Motivation: when comparing two rows in a pandas DataFrame, I would convert them into dictsand compare dicts element-wise.

附注。动机:当比较 pandas 中的两行时DataFrame,我会将它们转换为dicts并按元素比较 dicts 。

PPS. When I say "compare", I am thinking diff, not equalp.

聚苯乙烯。当我说“比较”时,我在想diff,不是equalp

采纳答案by juanpa.arrivillaga

Suppose you have a data-frame with nanvalues:

假设您有一个包含nan值的数据框:

In [10]: df = pd.DataFrame(np.random.randint(0, 20, (10, 10)).astype(float), columns=["c%d"%d for d in range(10)])

In [10]: df.where(np.random.randint(0,2, df.shape).astype(bool), np.nan, inplace=True)

In [10]: df
Out[10]:
     c0    c1    c2    c3    c4    c5    c6    c7   c8    c9
0   NaN   6.0  14.0   NaN   5.0   NaN   2.0  12.0  3.0   7.0
1   NaN   6.0   5.0  17.0   NaN   NaN  13.0   NaN  NaN   NaN
2   NaN  17.0   NaN   8.0   6.0   NaN   NaN  13.0  NaN   NaN
3   3.0   NaN   NaN  15.0   NaN   8.0   3.0   NaN  3.0   NaN
4   7.0   8.0   7.0   NaN   9.0  19.0   NaN   0.0  NaN  11.0
5   NaN   NaN  14.0   2.0   NaN   NaN   0.0   NaN  NaN   8.0
6   3.0  13.0   NaN   NaN   NaN   NaN   NaN  12.0  3.0   NaN
7  13.0  14.0   NaN   5.0  13.0   NaN  18.0   6.0  NaN   5.0
8   3.0   9.0  14.0  19.0  11.0   NaN   NaN   NaN  NaN   5.0
9   3.0  17.0   NaN   NaN   0.0   NaN  11.0   NaN  NaN   0.0

And you want to compare rows, say, row 0 and 8. Then just use fillnaand do vectorized comparison:

并且您想比较行,例如第 0 行和第 8 行。然后只需使用fillna并进行矢量化比较:

In [12]: df.iloc[0,:].fillna(0) != df.iloc[8,:].fillna(0)
Out[12]:
c0     True
c1     True
c2    False
c3     True
c4     True
c5    False
c6     True
c7     True
c8     True
c9     True
dtype: bool

You can use the resulting boolean array to index into the columns, if you just want to know which columns are different:

如果您只想知道哪些列不同,您可以使用生成的布尔数组对列进行索引:

In [14]: df.columns[df.iloc[0,:].fillna(0) != df.iloc[8,:].fillna(0)]
Out[14]: Index(['c0', 'c1', 'c3', 'c4', 'c6', 'c7', 'c8', 'c9'], dtype='object')

回答by ascripter

I assume you have array-data or can at least convert to a numpy array?

我假设您有数组数据或至少可以转换为 numpy 数组?

One way is to mask all the nans using a numpy.maarray, then comparing the arrays. So your starting situation would be sth. like this

一种方法是使用numpy.ma数组屏蔽所有 nan ,然后比较数组。所以你的开始情况会是…… 像这样

import numpy as np
import numpy.ma as ma
arr1 = ma.array([3,4,6,np.nan,2])
arr2 = ma.array([3,4,6,np.nan,2])

print arr1 == arr2
print ma.all(arr1==arr2)

>>> [ True  True  True False  True]
>>> False  # <-- you want this to show True

Solution:

解决方案:

arr1[np.isnan(arr1)] = ma.masked
arr2[np.isnan(arr2)] = ma.masked

print arr1 == arr2
print ma.all(arr1==arr2)

>>> [True True True -- True]
>>> True