pandas 为什么 max() 有时会返回 nan 有时会忽略它?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47788361/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why does max() sometimes return nan and sometimes ignores it?
提问by Cleb
This question is motivated by an answerI gave a while ago.
这个问题的动机是我不久前给出的答案。
Let's say I have a dataframe like this
假设我有一个这样的数据框
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 2, np.nan], 'b': [3, np.nan, 10], 'c':[np.nan, 5, 34]})
a b c
0 1.0 3.0 NaN
1 2.0 NaN 5.0
2 NaN 10.0 34.0
and I want to replace the NaN
by the maximum of the row, I can do
我想用NaN
行的最大值替换,我可以做到
df.apply(lambda row: row.fillna(row.max()), axis=1)
which gives me the desired output
这给了我想要的输出
a b c
0 1.0 3.0 3.0
1 2.0 5.0 5.0
2 34.0 10.0 34.0
When I, however, use
但是,当我使用
df.apply(lambda row: row.fillna(max(row)), axis=1)
for some reason it is replaced correctly only in two of three cases:
出于某种原因,它仅在以下三种情况中的两种情况下被正确替换:
a b c
0 1.0 3.0 3.0
1 2.0 5.0 5.0
2 NaN 10.0 34.0
Indeed, if I check by hand
确实,如果我手动检查
max(df.iloc[0, :])
max(df.iloc[1, :])
max(df.iloc[2, :])
Then it prints
然后它打印
3.0
5.0
nan
When doing
做的时候
df.iloc[0, :].max()
df.iloc[1, :].max()
df.iloc[2, :].max()
it prints the expected
它打印预期的
3.0
5.0
34.0
My question is why max()
fails in 1 of three cases but not in all 3. Why are the NaN
sometimes ignored and sometimes not?
我的问题是为什么max()
在三种情况中的一种都失败了,而在所有 3 种情况下都失败了。为什么NaN
有时会被忽略有时不会?
回答by BrenBarn
The reason is that max
works by taking the first value as the "max seen so far", and then checking each other value to see if it is bigger than the max seen so far. But nan
is defined so that comparisons with it always return False --- that is, nan > 1
is false but 1 > nan
is also false.
原因是它的max
工作原理是将第一个值作为“迄今为止看到的最大值”,然后检查彼此的值以查看它是否大于目前所见的最大值。但是nan
被定义为与它的比较总是返回 False --- 也就是说,nan > 1
是假的但1 > nan
也是假的。
So if you start with nan
as the first value in the array, every subsequent comparison will be check whether some_other_value > nan
. This will always be false, so nan
will retain its position as "max seen so far". On the other hand, if nan
is not the first value, then when it is reached, the comparison nan > max_so_far
will again be false. But in this case that means the current "max seen so far" (which is not nan
) will remain the max seen so far, so the nan will always be discarded.
因此,如果您从nan
数组中的第一个值开始,则每次后续比较都将检查some_other_value > nan
. 这将始终是错误的,因此nan
将保留其“迄今为止所见最大”的位置。另一方面,如果nan
不是第一个值,那么当它到达时,比较nan > max_so_far
将再次为假。但在这种情况下,这意味着当前的“迄今为止看到的最大值”(不是nan
)将保持迄今为止看到的最大值,因此 nan 将始终被丢弃。
回答by James Elderfield
In the first case you are using the numpy max
function, which is aware of how to handle numpy.nan
.
在第一种情况下,您使用的是 numpymax
函数,它知道如何处理numpy.nan
.
In the second case you are using the builtin max
function from python. This is not aware of how to handle numpy.nan
. Presumably this effect is due to the fact that any comparison (>, <, == etc.) of numpy.nan
with a float leads to False. An obvious way to implement max
would be to iterate the iterable (the row in this case) and check if each value is larger than the previous, and store it as the maximum value if so. Since this larger than comparison will always be False when one of the compared values is numpy.nan
, whether the recorded maximum is the number you want or numpy.nan
depends entirely on whether the first value is numpy.nan
or not.
在第二种情况下,您使用的max
是 python的内置函数。这个不知道怎么处理numpy.nan
。据推测,这种效果是由于任何numpy.nan
与浮点数的比较(>、<、== 等)都会导致 False。一个明显的实现max
方法是迭代可迭代对象(本例中的行)并检查每个值是否大于前一个值,如果是,则将其存储为最大值。由于当比较值之一为 时,这个大于比较将始终为 False numpy.nan
,因此记录的最大值是您想要的数字还是numpy.nan
完全取决于第一个值是否为numpy.nan
。
回答by Thomas Kühn
This is due to the ordering of the elements in the list. First off, if you type
这是由于列表中元素的顺序。首先,如果你输入
max([1, 2, np.nan])
The result is 2
, while
结果是2
,而
max([np.nan, 2, 3])
gives np.nan
. The reason for this is that the max
function goes through the values in the list one by one with a comparison like this:
给np.nan
. 这样做的原因是该max
函数通过这样的比较一一遍历列表中的值:
if a > b
now if we look at what we get when comparing to nan
, both np.nan > 2
and 1 > np.nan
both give False
, so in one case the running maximum is replaced with nan
and in the other it is not.
现在,如果我们看看在与nan
、两者np.nan > 2
和1 > np.nan
两者进行比较时得到的结果False
,那么在一种情况下,运行最大值被替换为,nan
而在另一种情况下则不是。
回答by zyun
the two are different: max() vs df.max().
两者是不同的:max() 与 df.max()。
max(): python built-in function, it must be a non-empty iterable. Check here: https://docs.python.org/2/library/functions.html#max
max():python 内置函数,它必须是一个非空的可迭代对象。在这里查看:https: //docs.python.org/2/library/functions.html#max
While pandas dataframe -- df.max(skipna=..), there is a parameter called skipna, the default value is True, which means the NA/null values are excluded. Check here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html
而pandas dataframe——df.max(skipna=..),有一个参数叫skipna,默认值为True,表示排除NA/null值。在这里查看:https: //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html