SQL Spark Dataframe 嵌套 Case When 语句

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46640862/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 05:20:08  来源:igfitidea点击:

Spark Dataframe Nested Case When Statement

sqlapache-sparkdataframeapache-spark-sql

提问by RaAm

I need to implement the below SQL logic in Spark DataFrame

我需要在 Spark 中实现以下 SQL 逻辑 DataFrame

SELECT KEY,
    CASE WHEN tc in ('a','b') THEN 'Y'
         WHEN tc in ('a') AND amt > 0 THEN 'N'
         ELSE NULL END REASON,
FROM dataset1;

My input DataFrameis as below:

我的输入DataFrame如下:

val dataset1 = Seq((66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4")).toDF("KEY", "tc", "amt")

dataset1.show()
+---+---+---+
|KEY| tc|amt|
+---+---+---+
| 66|  a|  4|
| 67|  a|  0|
| 70|  b|  4|
| 71|  d|  4|
+---+---+---+

I have implement the nested case when statement as:

我已经将嵌套 case when 语句实现为:

dataset1.withColumn("REASON", when(col("tc").isin("a", "b"), "Y")
  .otherwise(when(col("tc").equalTo("a") && col("amt").geq(0), "N")
    .otherwise(null))).show()
+---+---+---+------+
|KEY| tc|amt|REASON|
+---+---+---+------+
| 66|  a|  4|     Y|
| 67|  a|  0|     Y|
| 70|  b|  4|     Y|
| 71|  d|  4|  null|
+---+---+---+------+

Readability of the above logic with "otherwise" statement is little messy if the nested when statements goes further.

如果嵌套的 when 语句更进一步,上述带有“otherwise”语句的逻辑的可读性就不会很混乱。

Is there any better way of implementing nested case when statements in Spark DataFrames?

有没有更好的方法在 Spark 中的语句时实现嵌套 case DataFrames

回答by zero323

There is no nesting here, therefore there is no need for otherwise. All you need is chained when:

这里没有嵌套,因此不需要otherwise. 所有你需要的是链接when

import spark.implicits._

when($"tc" isin ("a", "b"), "Y")
  .when($"tc" === "a" && $"amt" >= 0, "N")

ELSE NULLis implicit so you can omit it completely.

ELSE NULL是隐式的,所以你可以完全省略它。

Pattern you use, is more more applicable for foldingover a data structure:

您使用的模式更适用于folding数据结构:

val cases = Seq(
  ($"tc" isin ("a", "b"), "Y"),
  ($"tc" === "a" && $"amt" >= 0, "N")
)

where when- otherwisenaturally follows recursion pattern and nullprovides the base case.

where when-otherwise自然遵循递归模式并null提供基本情况。

cases.foldLeft(lit(null)) {
  case (acc, (expr, value)) => when(expr, value).otherwise(acc)
}

Please note, that it is impossible to reach "N" outcome, with this chain of conditions. If tcis equal to "a" it will be captured by the first clause. If it is not, it will fail to satisfy both predicates and default to NULL. You should rather:

请注意,在这一系列条件下,不可能达到“N”结果。如果tc等于“a”,它将被第一个子句捕获。如果不是,它将无法同时满足谓词并默认为NULL。你应该:

when($"tc" === "a" && $"amt" >= 0, "N")
 .when($"tc" isin ("a", "b"), "Y")

回答by Raphael Roth

For more complex logic, I prefer to use UDFs for better readability:

对于更复杂的逻辑,我更喜欢使用 UDF 以获得更好的可读性:

val selectCase = udf((tc: String, amt: String) =>
  if (Seq("a", "b").contains(tc)) "Y"
  else if (tc == "a" && amt.toInt <= 0) "N"
  else null
)


dataset1.withColumn("REASON", selectCase(col("tc"), col("amt")))
  .show

回答by PoojanKothari

you can simply use selectExpr on your dataset

您可以简单地在数据集上使用 selectExpr

dataset1.selectExpr("*", "CASE WHEN tc in ('a') AND amt > 0 THEN 'N' WHEN tc in ('a','b') THEN 'Y' ELSE NULL END
REASON").show()

+---+---+---+------+
|KEY| tc|amt|REASON|
+---+---+---+------+
| 66|  a|  4|     N|
| 67|  a|  0|     Y|
| 70|  b|  4|     Y|
| 71|  d|  4|  null|
+---+---+---+------+

Second condition should be place before first one, as first condition is more generic one.

第二个条件应该放在第一个之前,因为第一个条件更通用。

WHEN tc in ('a') AND amt > 0 THEN 'N'

当 tc in ('a') AND amt > 0 THEN 'N'