Scala：如何使用 Scala 替换 Dataframes 中的值

Question

提问by Tong

For example I want to replace all numbers equal to 0.2 in a column to 0. How can I do that in Scala? Thanks

例如，我想将列中所有等于 0.2 的数字替换为 0。我该如何在 Scala 中做到这一点？谢谢

Edit:

编辑：

|year| make|model| comment            |blank|
|2012|Tesla| S   | No comment         |     | 
|1997| Ford| E350|Go get one now th...|     | 
|2015|Chevy| Volt| null               | null|

This is my Dataframe I'm trying to change Tesla in make column to S

这是我的数据框，我正在尝试将 make 列中的 Tesla 更改为 S

Answer 1

采纳答案by ccheneson

Note: As mentionned by Olivier Girardot, this answer is not optimized and the withColumnsolution is the one to use (Azeroth2b answer)

注意： 正如 Olivier Girardot 所提到的，这个答案没有优化，withColumn解决方案是可以使用的（Azeroth2b 答案）

Can not delete this answer as it has been accepted

无法删除此答案，因为它已被接受

Here is my take on this one:

这是我对这个的看法：

 val rdd = sc.parallelize(
      List( (2012,"Tesla","S"), (1997,"Ford","E350"), (2015,"Chevy","Volt"))
  )
  val sqlContext = new SQLContext(sc)

  // this is used to implicitly convert an RDD to a DataFrame.
  import sqlContext.implicits._

  val dataframe = rdd.toDF()

  dataframe.foreach(println)

 dataframe.map(row => {
    val row1 = row.getAs[String](1)
    val make = if (row1.toLowerCase == "tesla") "S" else row1
    Row(row(0),make,row(2))
  }).collect().foreach(println)

//[2012,S,S]
//[1997,Ford,E350]
//[2015,Chevy,Volt]

You can actually use directly mapon the DataFrame.

您实际上可以直接map在DataFrame.

So you basically check the column 1 for the String tesla. If it's tesla, use the value Sfor makeelse you the current value of column 1

因此，您基本上检查了 String 的第 1 列tesla。如果是 tesla，请使用值S进行make其他你第1列的当前值

Then build a tuple with all data from the row using the indexes (zero based) (Row(row(0),make,row(2))) in my example)

然后Row(row(0),make,row(2))在我的示例中使用索引（从零开始）( )使用行中的所有数据构建一个元组

There is probably a better way to do it. I am not that familiar yet with the Spark umbrella

可能有更好的方法来做到这一点。我还不太熟悉 Spark 雨伞

Answer 2

回答by Azeroth2b

Spark 1.6.2, Java code (sorry), this will change every instance of Tesla to S for the entire dataframe without passing through an RDD:

Spark 1.6.2，Java代码（抱歉），这会将整个数据帧的每个Tesla实例更改为S，而无需通过RDD：

dataframe.withColumn("make", when(col("make").equalTo("Tesla"), "S")
                             .otherwise(col("make") 
                    );

Edited to add @marshall245 "otherwise" to ensure non-Tesla columns aren't converted to NULL.

编辑添加@marshall245“否则”以确保非特斯拉列不会转换为NULL。

Answer 3

回答by marshall245

Building off of the solution from @Azeroth2b. If you want to replace only a couple of items and leave the rest unchanged. Do the following. Without using the otherwise(...) method, the remainder of the column becomes null.

基于@Azeroth2b 的解决方案构建。如果您只想更换几个项目而其余保持不变。请执行下列操作。如果不使用 else(...) 方法，列的其余部分将变为空。

import org.apache.spark.sql.functions._
val newsdf = sdf.withColumn("make", when(col("make") === "Tesla", "S")
                                   .otherwise(col("make"))
                           );

Old DataFrame

旧数据帧

+-----+-----+ 
| make|model| 
+-----+-----+ 
|Tesla|    S| 
| Ford| E350| 
|Chevy| Volt| 
+-----+-----+

New Datarame

新数据帧

+-----+-----+
| make|model|
+-----+-----+
|    S|    S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+

Answer 4

回答by Al M

This can be achieved in dataframes with user defined functions (udf).

这可以在具有用户定义函数 (udf) 的数据帧中实现。

import org.apache.spark.sql.functions._
val sqlcont = new org.apache.spark.sql.SQLContext(sc)
val df1 = sqlcont.jsonRDD(sc.parallelize(Array(
      """{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": ""}""",
      """{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": ""}""",
      """{"year":2015, "make": "Chevy", "model": "Volt", "comment": "", "blank": ""}"""
    )))

val makeSIfTesla = udf {(make: String) => 
  if(make == "Tesla") "S" else make
}
df1.withColumn("make", makeSIfTesla(df1("make"))).show

Answer 5

回答by Akshay Pandya

df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show()

replace in class DataFrameNaFunctions of type [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame

在类型为 [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame 的类 DataFrameNaFunctions 中替换

For running this function you must have active spark object and dataframe with headers ON.

要运行此功能，您必须具有活动的火花对象和带有标题的数据帧。

Scala：如何使用 Scala 替换 Dataframes 中的值

提问by Tong

采纳答案by ccheneson

回答by Azeroth2b

回答by marshall245

回答by Al M

回答by Akshay Pandya

相关推荐

最近更新

标签

Scala：如何使用 Scala 替换 Dataframes 中的值

提问by Tong

采纳答案by ccheneson

回答by Azeroth2b

回答by marshall245

回答by Al M

回答by Akshay Pandya

相关推荐

如何在 Scala 中验证数字字符？

scala 在 Apache Spark 中将 Dataframe 的列值提取为 List

scala 如何在scala中定义一个函数不返回或返回void

scala Spark 多类分类示例

相关推荐

最近更新

标签