Python spark中null和NaN之间的区别?如何处理?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43882699/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:29:12  来源:igfitidea点击:

Differences between null and NaN in spark? How to deal with it?

pythonapache-sparknullpysparknan

提问by Ivan Lee

In my DataFrame, there are columns including values of null and NaN respectively, such as:

在我的 DataFrame 中,有分别包含 null 和 NaN 值的列,例如:

df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()

+----+---+
|   a|  b|
+----+---+
|   1|NaN|
|null|1.0|
+----+---+

Are there any difference between those? How can they be dealt with?

它们之间有什么区别吗?如何处理?

回答by Shaido - Reinstate Monica

nullvalues represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists.

null值表示“无值”或“无”,它甚至不是空字符串或零。它可以用来表示没有任何有用的东西存在。

NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0.

NaN 代表“非数字”,它通常是没有意义的数学运算的结果,例如0.0/0.0.

One possible way to handle nullvalues is to remove them with:

处理值的一种可能方法是使用以下方法删除它们:

df.na.drop()

Or you can change them to an actual value (here I used 0) with:

或者您可以将它们更改为实际值(这里我使用了 0):

df.na.fill(0)

Another way would be to select the rows where a specific column is nullfor further processing:

另一种方法是选择特定列为的行以进行进一步处理:

df.where(col("a").isNull())
df.where(col("a").isNotNull())

Rows with NaN can also be selected using the equivalent method:

也可以使用等效方法选择带有 NaN 的行:

from pyspark.sql.functions import isnan
df.where(isnan(col("a")))

回答by Damián Rafael Lattenero

You can diference your NaN values using the function isnan, like this example

您可以使用函数 isnan 来区分 NaN 值,如下例所示

>>> df = spark.createDataFrame([(1.0, float('nan')), (float('nan'), 2.0)], ("a", "b"))
>>> df.select(isnan("a").alias("r1"), isnan(df.a).alias("r2")).collect()
[Row(r1=False, r2=False), Row(r1=True, r2=True)]

The difference is in the type of the object that generates the value. NaN (not a number) is an old fashion way to deal with the "None value for a number", you can think that you have all the numbers (-1-2...0,1,2...) and there is the need to have and extra value, for cases of errors (example, 1/0), I want that 1/0 gives me a number, but which number? well, like there is number for 1/0, they create a new value called NaN, that is also of type Number.

区别在于生成值的对象的类型。NaN(不是数字)是一种处理“数字无值”的老式方法,您可以认为您拥有所有数字 (-1-2...0,1,2...) 和需要有额外的价值,对于错误的情况(例如,1/0),我希望 1/0 给我一个数字,但哪个数字?好吧,就像 1/0 有数字一样,他们创建了一个名为 NaN 的新值,它也是 Number 类型。

None is used for the void, absence of an element, is even more abstract, because inside the number type, you have, besides de NaN value, the None value. The None value is present in all the sets of values of all the types

None 用于 void,没有元素,甚至更抽象,因为在数字类型内部,除了 de NaN 值之外,还有 None 值。None 值存在于所有类型的所有值集中