Scala Dataframe空检查列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40500732/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scala Dataframe null check for columns
提问by Subhod Lagade
val new_df = df.filter($"type_interne" !== "" || $"type_interne" !== "null")
Give me error value ||is not member of string
给我错误值||不是字符串的成员
When i use ===works well for filter
当我使用===过滤器时效果很好
val new_df = df.filter($"type_interne" === "" || $"type_interne" === "null")
回答by Raphael Roth
The problem seems to be the operator precedence, try to use braces:
问题似乎是运算符优先级,尝试使用大括号:
val new_df = df.filter(($"type_interne" !== "") || ($"type_interne" !== null))
you can also write it like this:
你也可以这样写:
val new_df = df.filter(($"type_interne" !== "") or $"type_interne".isNotNull)
回答by Nikita
Though Raphael's answer was fully correct at the time of writing, spark evolving...
Operator !==is deprecated since version 2.0, but you can use =!=which solves precedence problem above without using parenthesis. See corresponding comments in source code:
https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L319-L320
尽管在撰写本文时 Raphael 的答案是完全正确的,但 Spark 正在演变... Operator!==自 2.0 版以来已被弃用,但您可以使用=!=which 解决上述优先级问题,而无需使用括号。源码见相应注释:https:
//github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L319-L320
Detailed answer:
I'd like also to note something which was not obvious for me in the beginning.
There are notions of DataFrame (DF) and DataSet (DS), which also divide their usage in above context into:
1) strings which interpreted by catalyst (error are caught up only in run time) - both DF and DS
case class NullStrings(n: Int, s: String)
详细回答:
我还想指出一些一开始对我来说并不明显的东西。有 DataFrame (DF) 和 DataSet (DS) 的概念,它们也将它们在上述上下文中的用法分为:
1) 由催化剂解释的字符串(错误仅在运行时捕获) - DF 和 DS 案例类 NullStrings( n:整数,s:字符串)
val df = spark.sparkContext.parallelize(Seq(
(1, "abc"),
(2, "ABC"),
(3, null),
(4, ""))
).toDF("n", "s")
df.filter("s is not null and s != ''").show()
+---+---+
| n| s|
+---+---+
| 1|abc|
| 2|ABC|
+---+---+
2) dataframe syntax using Columnnotion ($with spark.implicits._import) partially compile checked:
2)使用Column概念($带spark.implicits._导入)的数据帧语法部分编译检查:
df.filter($"s" =!= "" || $"s" =!= null).show()
but in fact =!=ignores nulls (see <=>for null-safe comparison), hence below is equal to
但实际上=!=忽略了空值(请参阅<=>空值安全比较),因此下面等于
df.filter($"s" =!= "").show()
+---+---+
| n| s|
+---+---+
| 1|abc|
| 2|ABC|
+---+---+
3) dataset
3) 数据集
val ds = df.as[NullStrings]
ds.filter(r => r.s != null && r.s.nonEmpty).show()
+---+---+
| n| s|
+---+---+
| 1|abc|
| 2|ABC|
+---+---+
Bewareif you use Optionin case class, you have to deal with it, not simple string.
请注意,如果您Option在 case 类中使用,则必须处理它,而不是简单的字符串。
case class NullStringsOption(n: Int, s: Option[String])
val ds1 = df.as[NullStringsOption]
ds1.filter(_.s.exists(_.nonEmpty)).show()

