pandas 忽略nan的Python比较

Question

提问by sds

While nan == nanis always False, in many cases people want to treat them as equal, and this is enshrined in pandas.DataFrame.equals:

虽然nan == nan总是False，但在许多情况下，人们希望将它们视为平等，这体现在pandas.DataFrame.equals：

NaNs in the same location are considered equal.

同一位置的 NaN 被认为是相等的。

Of course, I can write

当然，我可以写

def equalp(x, y):
    return (x == y) or (math.isnan(x) and math.isnan(y))

However, this will fail on containers like [float("nan")]and isnanbarfs on non-numbers (so the complexity increases).

但是，这将在非数字上的容器[float("nan")]和isnanbarfs 上失败（因此复杂性增加）。

So, what do people do to compare complex Python objects which may contain nan?

那么，人们如何比较可能包含的复杂 Python 对象nan呢？

PS. Motivation: when comparing two rows in a pandas DataFrame, I would convert them into dictsand compare dicts element-wise.

附注。动机：当比较 pandas 中的两行时DataFrame，我会将它们转换为dicts并按元素比较 dicts 。

PPS. When I say "compare", I am thinking diff, not equalp.

聚苯乙烯。当我说“比较”时，我在想diff，不是equalp。

Answer 1

采纳答案by juanpa.arrivillaga

Suppose you have a data-frame with nanvalues:

假设您有一个包含nan值的数据框：

In [10]: df = pd.DataFrame(np.random.randint(0, 20, (10, 10)).astype(float), columns=["c%d"%d for d in range(10)])

In [10]: df.where(np.random.randint(0,2, df.shape).astype(bool), np.nan, inplace=True)

In [10]: df
Out[10]:
     c0    c1    c2    c3    c4    c5    c6    c7   c8    c9
0   NaN   6.0  14.0   NaN   5.0   NaN   2.0  12.0  3.0   7.0
1   NaN   6.0   5.0  17.0   NaN   NaN  13.0   NaN  NaN   NaN
2   NaN  17.0   NaN   8.0   6.0   NaN   NaN  13.0  NaN   NaN
3   3.0   NaN   NaN  15.0   NaN   8.0   3.0   NaN  3.0   NaN
4   7.0   8.0   7.0   NaN   9.0  19.0   NaN   0.0  NaN  11.0
5   NaN   NaN  14.0   2.0   NaN   NaN   0.0   NaN  NaN   8.0
6   3.0  13.0   NaN   NaN   NaN   NaN   NaN  12.0  3.0   NaN
7  13.0  14.0   NaN   5.0  13.0   NaN  18.0   6.0  NaN   5.0
8   3.0   9.0  14.0  19.0  11.0   NaN   NaN   NaN  NaN   5.0
9   3.0  17.0   NaN   NaN   0.0   NaN  11.0   NaN  NaN   0.0

And you want to compare rows, say, row 0 and 8. Then just use fillnaand do vectorized comparison:

并且您想比较行，例如第 0 行和第 8 行。然后只需使用fillna并进行矢量化比较：

In [12]: df.iloc[0,:].fillna(0) != df.iloc[8,:].fillna(0)
Out[12]:
c0     True
c1     True
c2    False
c3     True
c4     True
c5    False
c6     True
c7     True
c8     True
c9     True
dtype: bool

You can use the resulting boolean array to index into the columns, if you just want to know which columns are different:

如果您只想知道哪些列不同，您可以使用生成的布尔数组对列进行索引：

In [14]: df.columns[df.iloc[0,:].fillna(0) != df.iloc[8,:].fillna(0)]
Out[14]: Index(['c0', 'c1', 'c3', 'c4', 'c6', 'c7', 'c8', 'c9'], dtype='object')

Answer 2

回答by ascripter

I assume you have array-data or can at least convert to a numpy array?

我假设您有数组数据或至少可以转换为 numpy 数组？

One way is to mask all the nans using a numpy.maarray, then comparing the arrays. So your starting situation would be sth. like this

一种方法是使用numpy.ma数组屏蔽所有 nan ，然后比较数组。所以你的开始情况会是…… 像这样

import numpy as np
import numpy.ma as ma
arr1 = ma.array([3,4,6,np.nan,2])
arr2 = ma.array([3,4,6,np.nan,2])

print arr1 == arr2
print ma.all(arr1==arr2)

>>> [ True  True  True False  True]
>>> False  # <-- you want this to show True

Solution:

解决方案：

arr1[np.isnan(arr1)] = ma.masked
arr2[np.isnan(arr2)] = ma.masked

print arr1 == arr2
print ma.all(arr1==arr2)

>>> [True True True -- True]
>>> True

pandas 忽略nan的Python比较

提问by sds

采纳答案by juanpa.arrivillaga

回答by ascripter

相关推荐

最近更新

标签

pandas 忽略nan的Python比较

提问by sds

采纳答案by juanpa.arrivillaga

回答by ascripter

相关推荐

为什么使用 pandas.assign 而不是简单地初始化新列？

Python3 如何在电子邮件中发送 Pandas Dataframe

pandas 在熊猫数据框中将一列拆分为具有特定名称的多列

pandas Python：在列表中存储多个数据帧

相关推荐

最近更新

标签