SQL Spark DataDrame 中 === null 和 isNull 的区别

Question

提问by John

I am bit confused with the difference when we are using

当我们使用时，我对差异感到有些困惑

 df.filter(col("c1") === null) and df.filter(col("c1").isNull)

Same dataframe I am getting counts in === null but zero counts in isNull. Please help me to understand the difference. Thanks

相同的数据帧我在 === null 中计数，但在 isNull 中计数为零。请帮助我理解其中的区别。谢谢

Answer 1

回答by zero323

First and foremost don't use nullin your Scala code unless you really have to for compatibility reasons.

首先，不要null在 Scala 代码中使用，除非出于兼容性原因确实必须这样做。

Regarding your question it is plain SQL. col("c1") === nullis interpreted as c1 = NULLand, because NULLmarks undefined values, result is undefined for any value including NULLitself.

关于你的问题，它是普通的 SQL。col("c1") === null被解释为c1 = NULLand，因为NULL标记了未定义的值，结果对于任何值（包括NULL它自己）都是未定义的。

spark.sql("SELECT NULL = NULL").show

+-------------+
|(NULL = NULL)|
+-------------+
|         null|
+-------------+

spark.sql("SELECT NULL != NULL").show

+-------------------+
|(NOT (NULL = NULL))|
+-------------------+
|               null|
+-------------------+

spark.sql("SELECT TRUE != NULL").show

+------------------------------------+
|(NOT (true = CAST(NULL AS BOOLEAN)))|
+------------------------------------+
|                                null|
+------------------------------------+

spark.sql("SELECT TRUE = NULL").show

+------------------------------+
|(true = CAST(NULL AS BOOLEAN))|
+------------------------------+
|                          null|
+------------------------------+

The only valid methods to check for NULLare:

唯一有效的检查方法NULL是：

IS NULL:

spark.sql("SELECT NULL IS NULL").show

+--------------+
|(NULL IS NULL)|
+--------------+
|          true|
+--------------+

spark.sql("SELECT TRUE IS NULL").show

+--------------+
|(true IS NULL)|
+--------------+
|         false|
+--------------+

IS NOT NULL:

spark.sql("SELECT NULL IS NOT NULL").show

+------------------+
|(NULL IS NOT NULL)|
+------------------+
|             false|
+------------------+

spark.sql("SELECT TRUE IS NOT NULL").show

+------------------+
|(true IS NOT NULL)|
+------------------+
|              true|
+------------------+

IS NULL：

spark.sql("SELECT NULL IS NULL").show

+--------------+
|(NULL IS NULL)|
+--------------+
|          true|
+--------------+

spark.sql("SELECT TRUE IS NULL").show

+--------------+
|(true IS NULL)|
+--------------+
|         false|
+--------------+

IS NOT NULL：

spark.sql("SELECT NULL IS NOT NULL").show

+------------------+
|(NULL IS NOT NULL)|
+------------------+
|             false|
+------------------+

spark.sql("SELECT TRUE IS NOT NULL").show

+------------------+
|(true IS NOT NULL)|
+------------------+
|              true|
+------------------+

implemented in DataFrameDSL as Column.isNulland Column.isNotNullrespectively.

中实现DataFrameDSL作为Column.isNull和Column.isNotNull分别。

Note:

注意：

For NULL-safe comparisons use IS DISTINCT/ IS NOT DISTINCT:

对于安全NULL比较，请使用IS DISTINCT/ IS NOT DISTINCT：

spark.sql("SELECT NULL IS NOT DISTINCT FROM NULL").show

+---------------+
|(NULL <=> NULL)|
+---------------+
|           true|
+---------------+

spark.sql("SELECT NULL IS NOT DISTINCT FROM TRUE").show

+--------------------------------+
|(CAST(NULL AS BOOLEAN) <=> true)|
+--------------------------------+
|                           false|
+--------------------------------+

or not(_ <=> _)/ <=>

或 not(_ <=> _)/<=>

spark.sql("SELECT NULL AS col1, NULL AS col2").select($"col1" <=> $"col2").show

+---------------+
|(col1 <=> col2)|
+---------------+
|           true|
+---------------+

spark.sql("SELECT NULL AS col1, TRUE AS col2").select($"col1" <=> $"col2").show

+---------------+
|(col1 <=> col2)|
+---------------+
|          false|
+---------------+

in SQL and DataFrameDSL respectively.

DataFrame分别在 SQL 和DSL 中。

Related:

回答by mattinbits

Usually the best way to shed light onto unexpected results in Spark Dataframes is to look at the explain plan. Consider the following example:

通常，揭示 Spark Dataframes 中意外结果的最佳方法是查看解释计划。考虑以下示例：

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._

object Example extends App {

  val session = SparkSession.builder().master("local[*]").getOrCreate()
  case class Record(c1: String, c2: String)
  val data = List(Record("a", "b"), Record(null, "c"))
  val rdd = session.sparkContext.parallelize(data)
  import session.implicits._

  val df: DataFrame = rdd.toDF
  val filtered = df.filter(col("c1") === null)
  println(filtered.count()) // <-- outputs 0, not expected

  val filtered2 = df.filter(col("c1").isNull)
  println(filtered2.count())
  println(filtered2) // <- outputs 1, as expected

  filtered.explain(true)
  filtered2.explain(true)
}

The first explain plan shows:

第一个解释计划显示：

== Physical Plan ==
*Filter (isnotnull(c1#2) && null)
+- Scan ExistingRDD[c1#2,c2#3]
== Parsed Logical Plan ==
'Filter isnull('c1)
+- LogicalRDD [c1#2, c2#3]

This filter clause looks nonsensical. The &&to nullensures this can never resolve to true.

这个过滤器子句看起来毫无意义。该&&来null确保这一点永远不能解析true。

The second explain plan looks like:

第二个解释计划看起来像：

== Physical Plan ==
*Filter isnull(c1#2)
+- Scan ExistingRDD[c1#2,c2#3]

Here the filter is what expect and want.

这里的过滤器是期望和想要的。

SQL Spark DataDrame 中 === null 和 isNull 的区别

提问by John

回答by zero323

回答by mattinbits

相关推荐

最近更新

标签

SQL Spark DataDrame 中 === null 和 isNull 的区别

提问by John

回答by zero323

回答by mattinbits

相关推荐

SQL 如何最好地使用 CASE STATEMENT Count(*)？

将主键添加到 sql 视图

SQL 如何使用for循环在postgresql中插入数据？

SQL 如何加入到第一行

相关推荐

最近更新

标签