scala 在火花数据帧左外连接后用 0 替换空值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31799099/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:26:15  来源:igfitidea点击:

Replacing null values with 0 after spark dataframe left outer join

scalajoinapache-sparkspark-dataframe

提问by Mihir Shinde

I have two dataframes called leftand right.

我有两个名为leftright 的数据框。

scala> left.printSchema
root
|-- user_uid: double (nullable = true)
|-- labelVal: double (nullable = true)
|-- probability_score: double (nullable = true)

scala> right.printSchema
root
|-- user_uid: double (nullable = false)
|-- real_labelVal: double (nullable = false)

Then, I join them to get the joined Dataframe. It is a left outer join. Anyone interested in the natjoin function can find it here.

然后,我加入他们以获取加入的 Dataframe。它是一个左外连接。任何对 natjoin 功能感兴趣的人都可以在这里找到它。

https://gist.github.com/anonymous/f02bd79528ac75f57ae8

https://gist.github.com/anonymous/f02bd79528ac75f57ae8

scala> val joinedData = natjoin(predictionDataFrame, labeledObservedDataFrame, "left_outer")

scala> joinedData.printSchema
|-- user_uid: double (nullable = true)
|-- labelVal: double (nullable = true)
|-- probability_score: double (nullable = true)
|-- real_labelVal: double (nullable = false)

Since it is a left outer join, the real_labelVal column has nulls when user_uid is not present in right.

由于它是左外连接,因此当 user_uid 不存在于右侧时,real_labelVal 列具有空值。

scala> val realLabelVal = joinedData.select("real_labelval").distinct.collect
realLabelVal: Array[org.apache.spark.sql.Row] = Array([0.0], [null])

I want to replace the null values in the realLabelVal column with 1.0.

我想用 1.0 替换 realLabelVal 列中的空值。

Currently I do the following:

目前我执行以下操作:

  1. I find the index of real_labelval column and use the spark.sql.Row API to set the nulls to 1.0. (This gives me a RDD[Row])
  2. Then I apply the schema of the joined dataframe to get the cleaned dataframe.
  1. 我找到 real_labelval 列的索引并使用 spark.sql.Row API 将空值设置为 1.0。(这给了我一个 RDD[Row])
  2. 然后我应用加入的数据帧的模式来获取清理过的数据帧。

The code is as follows:

代码如下:

 val real_labelval_index = 3
 def replaceNull(row: Row) = {
    val rowArray = row.toSeq.toArray
     rowArray(real_labelval_index) = 1.0
     Row.fromSeq(rowArray)
 }

 val cleanRowRDD = joinedData.map(row => if (row.isNullAt(real_labelval_index)) replaceNull(row) else row)
 val cleanJoined = sqlContext.createDataFrame(cleanRowRdd, joinedData.schema)

Is there an elegant or efficient way to do this?

有没有一种优雅或有效的方法来做到这一点?

Goolging hasn't helped much. Thanks in advance.

谷歌搜索并没有太大帮助。提前致谢。

回答by Justin Pihony

Have you tried using na

你有没有试过使用 na

joinedData.na.fill(1.0, Seq("real_labelval"))