pandas 为什么 max() 有时会返回 nan 有时会忽略它?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47788361/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:54:04  来源:igfitidea点击:

Why does max() sometimes return nan and sometimes ignores it?

pythonpandasreplacemissing-data

提问by Cleb

This question is motivated by an answerI gave a while ago.

这个问题的动机是我不久前给出的答案

Let's say I have a dataframe like this

假设我有一个这样的数据框

import numpy as np
import pandas as pd

df = pd.DataFrame({'a': [1, 2, np.nan], 'b': [3, np.nan, 10], 'c':[np.nan, 5, 34]})

     a     b     c
0  1.0   3.0   NaN
1  2.0   NaN   5.0
2  NaN  10.0  34.0

and I want to replace the NaNby the maximum of the row, I can do

我想用NaN行的最大值替换,我可以做到

df.apply(lambda row: row.fillna(row.max()), axis=1)

which gives me the desired output

这给了我想要的输出

      a     b     c
0   1.0   3.0   3.0
1   2.0   5.0   5.0
2  34.0  10.0  34.0

When I, however, use

但是,当我使用

df.apply(lambda row: row.fillna(max(row)), axis=1)

for some reason it is replaced correctly only in two of three cases:

出于某种原因,它仅在以下三种情况中的两种情况下被正确替换:

     a     b     c
0  1.0   3.0   3.0
1  2.0   5.0   5.0
2  NaN  10.0  34.0

Indeed, if I check by hand

确实,如果我手动检查

max(df.iloc[0, :])
max(df.iloc[1, :])
max(df.iloc[2, :])

Then it prints

然后它打印

3.0
5.0
nan

When doing

做的时候

df.iloc[0, :].max()
df.iloc[1, :].max()
df.iloc[2, :].max()

it prints the expected

它打印预期的

3.0
5.0
34.0

My question is why max()fails in 1 of three cases but not in all 3. Why are the NaNsometimes ignored and sometimes not?

我的问题是为什么max()在三种情况中的一种都失败了,而在所有 3 种情况下都失败了。为什么NaN有时会被忽略有时不会?

回答by BrenBarn

The reason is that maxworks by taking the first value as the "max seen so far", and then checking each other value to see if it is bigger than the max seen so far. But nanis defined so that comparisons with it always return False --- that is, nan > 1is false but 1 > nanis also false.

原因是它的max工作原理是将第一个值作为“迄今为止看到的最大值”,然后检查彼此的值以查看它是否大于目前所见的最大值。但是nan被定义为与它的比较总是返回 False --- 也就是说,nan > 1是假的但1 > nan也是假的。

So if you start with nanas the first value in the array, every subsequent comparison will be check whether some_other_value > nan. This will always be false, so nanwill retain its position as "max seen so far". On the other hand, if nanis not the first value, then when it is reached, the comparison nan > max_so_farwill again be false. But in this case that means the current "max seen so far" (which is not nan) will remain the max seen so far, so the nan will always be discarded.

因此,如果您从nan数组中的第一个值开始,则每次后续比较都将检查some_other_value > nan. 这将始终是错误的,因此nan将保留其“迄今为止所见最大”的位置。另一方面,如果nan不是第一个值,那么当它到达时,比较nan > max_so_far将再次为假。但在这种情况下,这意味着当前的“迄今为止看到的最大值”(不是nan)将保持迄今为止看到的最大值,因此 nan 将始终被丢弃。

回答by James Elderfield

In the first case you are using the numpy maxfunction, which is aware of how to handle numpy.nan.

在第一种情况下,您使用的是 numpymax函数,它知道如何处理numpy.nan.

In the second case you are using the builtin maxfunction from python. This is not aware of how to handle numpy.nan. Presumably this effect is due to the fact that any comparison (>, <, == etc.) of numpy.nanwith a float leads to False. An obvious way to implement maxwould be to iterate the iterable (the row in this case) and check if each value is larger than the previous, and store it as the maximum value if so. Since this larger than comparison will always be False when one of the compared values is numpy.nan, whether the recorded maximum is the number you want or numpy.nandepends entirely on whether the first value is numpy.nanor not.

在第二种情况下,您使用的max是 python的内置函数。这个不知道怎么处理numpy.nan。据推测,这种效果是由于任何numpy.nan与浮点数的比较(>、<、== 等)都会导致 False。一个明显的实现max方法是迭代可迭代对象(本例中的行)并检查每个值是否大于前一个值,如果是,则将其存储为最大值。由于当比较值之一为 时,这个大于比较将始终为 False numpy.nan,因此记录的最大值是您想要的数字还是numpy.nan完全取决于第一个值是否为numpy.nan

回答by Thomas Kühn

This is due to the ordering of the elements in the list. First off, if you type

这是由于列表中元素的顺序。首先,如果你输入

max([1, 2, np.nan])

The result is 2, while

结果是2,而

max([np.nan, 2, 3])

gives np.nan. The reason for this is that the maxfunction goes through the values in the list one by one with a comparison like this:

np.nan. 这样做的原因是该max函数通过这样的比较一一遍历列表中的值:

if a > b

now if we look at what we get when comparing to nan, both np.nan > 2and 1 > np.nanboth give False, so in one case the running maximum is replaced with nanand in the other it is not.

现在,如果我们看看在与nan、两者np.nan > 21 > np.nan两者进行比较时得到的结果False,那么在一种情况下,运行最大值被替换为,nan而在另一种情况下则不是。

回答by zyun

the two are different: max() vs df.max().

两者是不同的:max() 与 df.max()。

max(): python built-in function, it must be a non-empty iterable. Check here: https://docs.python.org/2/library/functions.html#max

max():python 内置函数,它必须是一个非空的可迭代对象。在这里查看:https: //docs.python.org/2/library/functions.html#max

While pandas dataframe -- df.max(skipna=..), there is a parameter called skipna, the default value is True, which means the NA/null values are excluded. Check here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html

而pandas dataframe——df.max(skipna=..),有一个参数叫skipna,默认值为True,表示排除NA/null值。在这里查看:https: //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html