SQL Spark Dataframe 嵌套 Case When 语句
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46640862/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark Dataframe Nested Case When Statement
提问by RaAm
I need to implement the below SQL logic in Spark DataFrame
我需要在 Spark 中实现以下 SQL 逻辑 DataFrame
SELECT KEY,
CASE WHEN tc in ('a','b') THEN 'Y'
WHEN tc in ('a') AND amt > 0 THEN 'N'
ELSE NULL END REASON,
FROM dataset1;
My input DataFrame
is as below:
我的输入DataFrame
如下:
val dataset1 = Seq((66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4")).toDF("KEY", "tc", "amt")
dataset1.show()
+---+---+---+
|KEY| tc|amt|
+---+---+---+
| 66| a| 4|
| 67| a| 0|
| 70| b| 4|
| 71| d| 4|
+---+---+---+
I have implement the nested case when statement as:
我已经将嵌套 case when 语句实现为:
dataset1.withColumn("REASON", when(col("tc").isin("a", "b"), "Y")
.otherwise(when(col("tc").equalTo("a") && col("amt").geq(0), "N")
.otherwise(null))).show()
+---+---+---+------+
|KEY| tc|amt|REASON|
+---+---+---+------+
| 66| a| 4| Y|
| 67| a| 0| Y|
| 70| b| 4| Y|
| 71| d| 4| null|
+---+---+---+------+
Readability of the above logic with "otherwise" statement is little messy if the nested when statements goes further.
如果嵌套的 when 语句更进一步,上述带有“otherwise”语句的逻辑的可读性就不会很混乱。
Is there any better way of implementing nested case when statements in Spark DataFrames
?
有没有更好的方法在 Spark 中的语句时实现嵌套 case DataFrames
?
回答by zero323
There is no nesting here, therefore there is no need for otherwise
. All you need is chained when
:
这里没有嵌套,因此不需要otherwise
. 所有你需要的是链接when
:
import spark.implicits._
when($"tc" isin ("a", "b"), "Y")
.when($"tc" === "a" && $"amt" >= 0, "N")
ELSE NULL
is implicit so you can omit it completely.
ELSE NULL
是隐式的,所以你可以完全省略它。
Pattern you use, is more more applicable for folding
over a data structure:
您使用的模式更适用于folding
数据结构:
val cases = Seq(
($"tc" isin ("a", "b"), "Y"),
($"tc" === "a" && $"amt" >= 0, "N")
)
where when
- otherwise
naturally follows recursion pattern and null
provides the base case.
where when
-otherwise
自然遵循递归模式并null
提供基本情况。
cases.foldLeft(lit(null)) {
case (acc, (expr, value)) => when(expr, value).otherwise(acc)
}
Please note, that it is impossible to reach "N" outcome, with this chain of conditions. If tc
is equal to "a" it will be captured by the first clause. If it is not, it will fail to satisfy both predicates and default to NULL
. You should rather:
请注意,在这一系列条件下,不可能达到“N”结果。如果tc
等于“a”,它将被第一个子句捕获。如果不是,它将无法同时满足谓词并默认为NULL
。你应该:
when($"tc" === "a" && $"amt" >= 0, "N")
.when($"tc" isin ("a", "b"), "Y")
回答by Raphael Roth
For more complex logic, I prefer to use UDFs for better readability:
对于更复杂的逻辑,我更喜欢使用 UDF 以获得更好的可读性:
val selectCase = udf((tc: String, amt: String) =>
if (Seq("a", "b").contains(tc)) "Y"
else if (tc == "a" && amt.toInt <= 0) "N"
else null
)
dataset1.withColumn("REASON", selectCase(col("tc"), col("amt")))
.show
回答by PoojanKothari
you can simply use selectExpr on your dataset
您可以简单地在数据集上使用 selectExpr
dataset1.selectExpr("*", "CASE WHEN tc in ('a') AND amt > 0 THEN 'N' WHEN tc in ('a','b') THEN 'Y' ELSE NULL END
REASON").show()
+---+---+---+------+
|KEY| tc|amt|REASON|
+---+---+---+------+
| 66| a| 4| N|
| 67| a| 0| Y|
| 70| b| 4| Y|
| 71| d| 4| null|
+---+---+---+------+
Second condition should be place before first one, as first condition is more generic one.
第二个条件应该放在第一个之前,因为第一个条件更通用。
WHEN tc in ('a') AND amt > 0 THEN 'N'
当 tc in ('a') AND amt > 0 THEN 'N'