Scala:如何使用 Scala 替换 Dataframes 中的值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32357774/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:33:06  来源:igfitidea点击:

Scala: How can I replace value in Dataframes using scala

scalaapache-sparkdataframe

提问by Tong

For example I want to replace all numbers equal to 0.2 in a column to 0. How can I do that in Scala? Thanks

例如,我想将列中所有等于 0.2 的数字替换为 0。我该如何在 Scala 中做到这一点?谢谢

Edit:

编辑

|year| make|model| comment            |blank|
|2012|Tesla| S   | No comment         |     | 
|1997| Ford| E350|Go get one now th...|     | 
|2015|Chevy| Volt| null               | null| 

This is my Dataframe I'm trying to change Tesla in make column to S

这是我的数据框,我正在尝试将 make 列中的 Tesla 更改为 S

采纳答案by ccheneson

Note: As mentionned by Olivier Girardot, this answer is not optimized and the withColumnsolution is the one to use (Azeroth2b answer)

注意: 正如 Olivier Girardot 所提到的,这个答案没有优化,withColumn解决方案是可以使用的(Azeroth2b 答案)

Can not delete this answer as it has been accepted

无法删除此答案,因为它已被接受



Here is my take on this one:

这是我对这个的看法:

 val rdd = sc.parallelize(
      List( (2012,"Tesla","S"), (1997,"Ford","E350"), (2015,"Chevy","Volt"))
  )
  val sqlContext = new SQLContext(sc)

  // this is used to implicitly convert an RDD to a DataFrame.
  import sqlContext.implicits._

  val dataframe = rdd.toDF()

  dataframe.foreach(println)

 dataframe.map(row => {
    val row1 = row.getAs[String](1)
    val make = if (row1.toLowerCase == "tesla") "S" else row1
    Row(row(0),make,row(2))
  }).collect().foreach(println)

//[2012,S,S]
//[1997,Ford,E350]
//[2015,Chevy,Volt]

You can actually use directly mapon the DataFrame.

您实际上可以直接mapDataFrame.

So you basically check the column 1 for the String tesla. If it's tesla, use the value Sfor makeelse you the current value of column 1

因此,您基本上检查了 String 的第 1 列tesla。如果是 tesla,请使用值S进行make其他你第1列的当前值

Then build a tuple with all data from the row using the indexes (zero based) (Row(row(0),make,row(2))) in my example)

然后Row(row(0),make,row(2))在我的示例中使用索引(从零开始)( )使用行中的所有数据构建一个元组

There is probably a better way to do it. I am not that familiar yet with the Spark umbrella

可能有更好的方法来做到这一点。我还不太熟悉 Spark 雨伞

回答by Azeroth2b

Spark 1.6.2, Java code (sorry), this will change every instance of Tesla to S for the entire dataframe without passing through an RDD:

Spark 1.6.2,Java代码(抱歉),这会将整个数据帧的每个Tesla实例更改为S,而无需通过RDD:

dataframe.withColumn("make", when(col("make").equalTo("Tesla"), "S")
                             .otherwise(col("make") 
                    );

Edited to add @marshall245 "otherwise" to ensure non-Tesla columns aren't converted to NULL.

编辑添加@marshall245“否则”以确保非特斯拉列不会转换为NULL。

回答by marshall245

Building off of the solution from @Azeroth2b. If you want to replace only a couple of items and leave the rest unchanged. Do the following. Without using the otherwise(...) method, the remainder of the column becomes null.

基于@Azeroth2b 的解决方案构建。如果您只想更换几个项目而其余保持不变。请执行下列操作。如果不使用 else(...) 方法,列的其余部分将变为空。

import org.apache.spark.sql.functions._
val newsdf = sdf.withColumn("make", when(col("make") === "Tesla", "S")
                                   .otherwise(col("make"))
                           );

Old DataFrame

旧数据帧

+-----+-----+ 
| make|model| 
+-----+-----+ 
|Tesla|    S| 
| Ford| E350| 
|Chevy| Volt| 
+-----+-----+ 

New Datarame

新数据帧

+-----+-----+
| make|model|
+-----+-----+
|    S|    S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+

回答by Al M

This can be achieved in dataframes with user defined functions (udf).

这可以在具有用户定义函数 (udf) 的数据帧中实现。

import org.apache.spark.sql.functions._
val sqlcont = new org.apache.spark.sql.SQLContext(sc)
val df1 = sqlcont.jsonRDD(sc.parallelize(Array(
      """{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": ""}""",
      """{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": ""}""",
      """{"year":2015, "make": "Chevy", "model": "Volt", "comment": "", "blank": ""}"""
    )))

val makeSIfTesla = udf {(make: String) => 
  if(make == "Tesla") "S" else make
}
df1.withColumn("make", makeSIfTesla(df1("make"))).show

回答by Akshay Pandya

df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show()

df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show()

replace in class DataFrameNaFunctions of type [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame

在类型为 [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame 的类 DataFrameNaFunctions 中替换

For running this function you must have active spark object and dataframe with headers ON.

要运行此功能,您必须具有活动的火花对象和带有标题的数据帧。