SQL Spark DataDrame 中 === null 和 isNull 的区别

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41533290/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 05:04:49  来源:igfitidea点击:

Difference between === null and isNull in Spark DataDrame

sqlscalaapache-sparkdataframeapache-spark-sql

提问by John

I am bit confused with the difference when we are using

当我们使用时,我对差异感到有些困惑

 df.filter(col("c1") === null) and df.filter(col("c1").isNull) 

Same dataframe I am getting counts in === null but zero counts in isNull. Please help me to understand the difference. Thanks

相同的数据帧我在 === null 中计数,但在 isNull 中计数为零。请帮助我理解其中的区别。谢谢

回答by zero323

First and foremost don't use nullin your Scala code unless you really have to for compatibility reasons.

首先,不要null在 Scala 代码中使用,除非出于兼容性原因确实必须这样做。

Regarding your question it is plain SQL. col("c1") === nullis interpreted as c1 = NULLand, because NULLmarks undefined values, result is undefined for any value including NULLitself.

关于你的问题,它是普通的 SQL。col("c1") === null被解释为c1 = NULLand,因为NULL标记了未定义的值,结果对于任何值(包括NULL它自己)都是未定义的。

spark.sql("SELECT NULL = NULL").show
+-------------+
|(NULL = NULL)|
+-------------+
|         null|
+-------------+
spark.sql("SELECT NULL != NULL").show
+-------------------+
|(NOT (NULL = NULL))|
+-------------------+
|               null|
+-------------------+
spark.sql("SELECT TRUE != NULL").show
+------------------------------------+
|(NOT (true = CAST(NULL AS BOOLEAN)))|
+------------------------------------+
|                                null|
+------------------------------------+
spark.sql("SELECT TRUE = NULL").show
+------------------------------+
|(true = CAST(NULL AS BOOLEAN))|
+------------------------------+
|                          null|
+------------------------------+

The only valid methods to check for NULLare:

唯一有效的检查方法NULL是:

  • IS NULL:

    spark.sql("SELECT NULL IS NULL").show
    
    +--------------+
    |(NULL IS NULL)|
    +--------------+
    |          true|
    +--------------+
    
    spark.sql("SELECT TRUE IS NULL").show
    
    +--------------+
    |(true IS NULL)|
    +--------------+
    |         false|
    +--------------+
    
  • IS NOT NULL:

    spark.sql("SELECT NULL IS NOT NULL").show
    
    +------------------+
    |(NULL IS NOT NULL)|
    +------------------+
    |             false|
    +------------------+
    
    spark.sql("SELECT TRUE IS NOT NULL").show
    
    +------------------+
    |(true IS NOT NULL)|
    +------------------+
    |              true|
    +------------------+
    
  • IS NULL

    spark.sql("SELECT NULL IS NULL").show
    
    +--------------+
    |(NULL IS NULL)|
    +--------------+
    |          true|
    +--------------+
    
    spark.sql("SELECT TRUE IS NULL").show
    
    +--------------+
    |(true IS NULL)|
    +--------------+
    |         false|
    +--------------+
    
  • IS NOT NULL

    spark.sql("SELECT NULL IS NOT NULL").show
    
    +------------------+
    |(NULL IS NOT NULL)|
    +------------------+
    |             false|
    +------------------+
    
    spark.sql("SELECT TRUE IS NOT NULL").show
    
    +------------------+
    |(true IS NOT NULL)|
    +------------------+
    |              true|
    +------------------+
    

implemented in DataFrameDSL as Column.isNulland Column.isNotNullrespectively.

中实现DataFrameDSL作为Column.isNullColumn.isNotNull分别。

Note:

注意

For NULL-safe comparisons use IS DISTINCT/ IS NOT DISTINCT:

对于安全NULL比较,请使用IS DISTINCT/ IS NOT DISTINCT

spark.sql("SELECT NULL IS NOT DISTINCT FROM NULL").show
+---------------+
|(NULL <=> NULL)|
+---------------+
|           true|
+---------------+
spark.sql("SELECT NULL IS NOT DISTINCT FROM TRUE").show
+--------------------------------+
|(CAST(NULL AS BOOLEAN) <=> true)|
+--------------------------------+
|                           false|
+--------------------------------+

or not(_ <=> _)/ <=>

not(_ <=> _)/<=>

spark.sql("SELECT NULL AS col1, NULL AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
|           true|
+---------------+
spark.sql("SELECT NULL AS col1, TRUE AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
|          false|
+---------------+

in SQL and DataFrameDSL respectively.

DataFrame分别在 SQL 和DSL 中。

Related:

相关

Including null values in an Apache Spark Join

在 Apache Spark Join 中包含空值

回答by mattinbits

Usually the best way to shed light onto unexpected results in Spark Dataframes is to look at the explain plan. Consider the following example:

通常,揭示 Spark Dataframes 中意外结果的最佳方法是查看解释计划。考虑以下示例:

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._

object Example extends App {

  val session = SparkSession.builder().master("local[*]").getOrCreate()
  case class Record(c1: String, c2: String)
  val data = List(Record("a", "b"), Record(null, "c"))
  val rdd = session.sparkContext.parallelize(data)
  import session.implicits._

  val df: DataFrame = rdd.toDF
  val filtered = df.filter(col("c1") === null)
  println(filtered.count()) // <-- outputs 0, not expected

  val filtered2 = df.filter(col("c1").isNull)
  println(filtered2.count())
  println(filtered2) // <- outputs 1, as expected

  filtered.explain(true)
  filtered2.explain(true)
}

The first explain plan shows:

第一个解释计划显示:

== Physical Plan ==
*Filter (isnotnull(c1#2) && null)
+- Scan ExistingRDD[c1#2,c2#3]
== Parsed Logical Plan ==
'Filter isnull('c1)
+- LogicalRDD [c1#2, c2#3]

This filter clause looks nonsensical. The &&to nullensures this can never resolve to true.

这个过滤器子句看起来毫无意义。该&&null确保这一点永远不能解析true

The second explain plan looks like:

第二个解释计划看起来像:

== Physical Plan ==
*Filter isnull(c1#2)
+- Scan ExistingRDD[c1#2,c2#3]

Here the filter is what expect and want.

这里的过滤器是期望和想要的。