Scala Dataframe空检查列

Question

提问by Subhod Lagade

val new_df = df.filter($"type_interne" !== "" || $"type_interne" !== "null")

Give me error value ||is not member of string

给我错误值||不是字符串的成员

When i use ===works well for filter

当我使用===过滤器时效果很好

val new_df = df.filter($"type_interne" === "" || $"type_interne" === "null")

Answer 1

回答by Raphael Roth

The problem seems to be the operator precedence, try to use braces:

问题似乎是运算符优先级，尝试使用大括号：

 val new_df = df.filter(($"type_interne" !== "") || ($"type_interne" !== null))

you can also write it like this:

你也可以这样写：

val new_df = df.filter(($"type_interne" !== "") or $"type_interne".isNotNull)

Answer 2

回答by Nikita

Though Raphael's answer was fully correct at the time of writing, spark evolving... Operator !==is deprecated since version 2.0, but you can use =!=which solves precedence problem above without using parenthesis. See corresponding comments in source code: https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L319-L320

尽管在撰写本文时 Raphael 的答案是完全正确的，但 Spark 正在演变... Operator!==自 2.0 版以来已被弃用，但您可以使用=!=which 解决上述优先级问题，而无需使用括号。源码见相应注释：https: //github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L319-L320

Detailed answer:
I'd like also to note something which was not obvious for me in the beginning. There are notions of DataFrame (DF) and DataSet (DS), which also divide their usage in above context into:
1) strings which interpreted by catalyst (error are caught up only in run time) - both DF and DS case class NullStrings(n: Int, s: String)

详细回答：
我还想指出一些一开始对我来说并不明显的东西。有 DataFrame (DF) 和 DataSet (DS) 的概念，它们也将它们在上述上下文中的用法分为：
1) 由催化剂解释的字符串（错误仅在运行时捕获） - DF 和 DS 案例类 NullStrings( n：整数，s：字符串）

val df = spark.sparkContext.parallelize(Seq(
    (1, "abc"),
    (2, "ABC"),
    (3, null),
    (4, ""))
).toDF("n", "s")

df.filter("s is not null and s != ''").show()

+---+---+
|  n|  s|
+---+---+
|  1|abc|
|  2|ABC|
+---+---+

2) dataframe syntax using Columnnotion ($with spark.implicits._import) partially compile checked:

2）使用Column概念（$带spark.implicits._导入）的数据帧语法部分编译检查：

df.filter($"s" =!= "" || $"s" =!= null).show()

but in fact =!=ignores nulls (see <=>for null-safe comparison), hence below is equal to

但实际上=!=忽略了空值（请参阅<=>空值安全比较），因此下面等于

df.filter($"s" =!= "").show()

+---+---+
|  n|  s|
+---+---+
|  1|abc|
|  2|ABC|
+---+---+

3) dataset

3) 数据集

val ds = df.as[NullStrings]

ds.filter(r => r.s != null && r.s.nonEmpty).show()
+---+---+
|  n|  s|
+---+---+
|  1|abc|
|  2|ABC|
+---+---+

Bewareif you use Optionin case class, you have to deal with it, not simple string.

请注意，如果您Option在 case 类中使用，则必须处理它，而不是简单的字符串。

case class NullStringsOption(n: Int, s: Option[String])

val ds1 = df.as[NullStringsOption]

ds1.filter(_.s.exists(_.nonEmpty)).show()

Scala Dataframe空检查列

提问by Subhod Lagade

回答by Raphael Roth

回答by Nikita

相关推荐

最近更新

标签

Scala Dataframe空检查列

提问by Subhod Lagade

回答by Raphael Roth

回答by Nikita

相关推荐

scala 用平均值替换缺失值 - Spark Dataframe

scala 如何将 RDD 保存到 HDFS 中并稍后读回？

scala if 语句的否定条件

Scala 中的 toString 函数

相关推荐

最近更新

标签