Scala Dataframe空检查列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40500732/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:50:35  来源:igfitidea点击:

Scala Dataframe null check for columns

scalaapache-sparkdataframeapache-spark-sql

提问by Subhod Lagade

val new_df = df.filter($"type_interne" !== "" || $"type_interne" !== "null")

Give me error value ||is not member of string

给我错误值||不是字符串的成员

When i use ===works well for filter

当我使用===过滤器时效果很好

val new_df = df.filter($"type_interne" === "" || $"type_interne" === "null")

回答by Raphael Roth

The problem seems to be the operator precedence, try to use braces:

问题似乎是运算符优先级,尝试使用大括号:

 val new_df = df.filter(($"type_interne" !== "") || ($"type_interne" !== null))

you can also write it like this:

你也可以这样写:

val new_df = df.filter(($"type_interne" !== "") or $"type_interne".isNotNull)

回答by Nikita

Though Raphael's answer was fully correct at the time of writing, spark evolving... Operator !==is deprecated since version 2.0, but you can use =!=which solves precedence problem above without using parenthesis. See corresponding comments in source code: https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L319-L320

尽管在撰写本文时 Raphael 的答案是完全正确的,但 Spark 正在演变... Operator!==自 2.0 版以来已被弃用,但您可以使用=!=which 解决上述优先级问题,而无需使用括号。源码见相应注释:https: //github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L319-L320

Detailed answer:
I'd like also to note something which was not obvious for me in the beginning. There are notions of DataFrame (DF) and DataSet (DS), which also divide their usage in above context into:
1) strings which interpreted by catalyst (error are caught up only in run time) - both DF and DS case class NullStrings(n: Int, s: String)

详细回答:
我还想指出一些一开始对我来说并不明显的东西。有 DataFrame (DF) 和 DataSet (DS) 的概念,它们也将它们在上述上下文中的用法分为:
1) 由催化剂解释的字符串(错误仅在运行时捕获) - DF 和 DS 案例类 NullStrings( n:整数,s:字符串)

val df = spark.sparkContext.parallelize(Seq(
    (1, "abc"),
    (2, "ABC"),
    (3, null),
    (4, ""))
).toDF("n", "s")

df.filter("s is not null and s != ''").show()

+---+---+
|  n|  s|
+---+---+
|  1|abc|
|  2|ABC|
+---+---+

2) dataframe syntax using Columnnotion ($with spark.implicits._import) partially compile checked:

2)使用Column概念($spark.implicits._导入)的数据帧语法部分编译检查:

df.filter($"s" =!= "" || $"s" =!= null).show() 

but in fact =!=ignores nulls (see <=>for null-safe comparison), hence below is equal to

但实际上=!=忽略了空值(请参阅<=>空值安全比较),因此下面等于

df.filter($"s" =!= "").show()

+---+---+
|  n|  s|
+---+---+
|  1|abc|
|  2|ABC|
+---+---+

3) dataset

3) 数据集

val ds = df.as[NullStrings]

ds.filter(r => r.s != null && r.s.nonEmpty).show()
+---+---+
|  n|  s|
+---+---+
|  1|abc|
|  2|ABC|
+---+---+

Bewareif you use Optionin case class, you have to deal with it, not simple string.

请注意,如果您Option在 case 类中使用,则必须处理它,而不是简单的字符串。

case class NullStringsOption(n: Int, s: Option[String])

val ds1 = df.as[NullStringsOption]

ds1.filter(_.s.exists(_.nonEmpty)).show()