Scala:如何使用 Scala 替换 Dataframes 中的值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32357774/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scala: How can I replace value in Dataframes using scala
提问by Tong
For example I want to replace all numbers equal to 0.2 in a column to 0. How can I do that in Scala? Thanks
例如,我想将列中所有等于 0.2 的数字替换为 0。我该如何在 Scala 中做到这一点?谢谢
Edit:
编辑:
|year| make|model| comment |blank|
|2012|Tesla| S | No comment | |
|1997| Ford| E350|Go get one now th...| |
|2015|Chevy| Volt| null | null|
This is my Dataframe I'm trying to change Tesla in make column to S
这是我的数据框,我正在尝试将 make 列中的 Tesla 更改为 S
采纳答案by ccheneson
Note:
As mentionned by Olivier Girardot, this answer is not optimized and the withColumnsolution is the one to use (Azeroth2b answer)
注意:
正如 Olivier Girardot 所提到的,这个答案没有优化,withColumn解决方案是可以使用的(Azeroth2b 答案)
Can not delete this answer as it has been accepted
无法删除此答案,因为它已被接受
Here is my take on this one:
这是我对这个的看法:
val rdd = sc.parallelize(
List( (2012,"Tesla","S"), (1997,"Ford","E350"), (2015,"Chevy","Volt"))
)
val sqlContext = new SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
val dataframe = rdd.toDF()
dataframe.foreach(println)
dataframe.map(row => {
val row1 = row.getAs[String](1)
val make = if (row1.toLowerCase == "tesla") "S" else row1
Row(row(0),make,row(2))
}).collect().foreach(println)
//[2012,S,S]
//[1997,Ford,E350]
//[2015,Chevy,Volt]
You can actually use directly mapon the DataFrame.
您实际上可以直接map在DataFrame.
So you basically check the column 1 for the String tesla.
If it's tesla, use the value Sfor makeelse you the current value of column 1
因此,您基本上检查了 String 的第 1 列tesla。如果是 tesla,请使用值S进行make其他你第1列的当前值
Then build a tuple with all data from the row using the indexes (zero based) (Row(row(0),make,row(2))) in my example)
然后Row(row(0),make,row(2))在我的示例中使用索引(从零开始)( )使用行中的所有数据构建一个元组
There is probably a better way to do it. I am not that familiar yet with the Spark umbrella
可能有更好的方法来做到这一点。我还不太熟悉 Spark 雨伞
回答by Azeroth2b
Spark 1.6.2, Java code (sorry), this will change every instance of Tesla to S for the entire dataframe without passing through an RDD:
Spark 1.6.2,Java代码(抱歉),这会将整个数据帧的每个Tesla实例更改为S,而无需通过RDD:
dataframe.withColumn("make", when(col("make").equalTo("Tesla"), "S")
.otherwise(col("make")
);
Edited to add @marshall245 "otherwise" to ensure non-Tesla columns aren't converted to NULL.
编辑添加@marshall245“否则”以确保非特斯拉列不会转换为NULL。
回答by marshall245
Building off of the solution from @Azeroth2b. If you want to replace only a couple of items and leave the rest unchanged. Do the following. Without using the otherwise(...) method, the remainder of the column becomes null.
基于@Azeroth2b 的解决方案构建。如果您只想更换几个项目而其余保持不变。请执行下列操作。如果不使用 else(...) 方法,列的其余部分将变为空。
import org.apache.spark.sql.functions._
val newsdf = sdf.withColumn("make", when(col("make") === "Tesla", "S")
.otherwise(col("make"))
);
Old DataFrame
旧数据帧
+-----+-----+
| make|model|
+-----+-----+
|Tesla| S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+
New Datarame
新数据帧
+-----+-----+
| make|model|
+-----+-----+
| S| S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+
回答by Al M
This can be achieved in dataframes with user defined functions (udf).
这可以在具有用户定义函数 (udf) 的数据帧中实现。
import org.apache.spark.sql.functions._
val sqlcont = new org.apache.spark.sql.SQLContext(sc)
val df1 = sqlcont.jsonRDD(sc.parallelize(Array(
"""{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": ""}""",
"""{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": ""}""",
"""{"year":2015, "make": "Chevy", "model": "Volt", "comment": "", "blank": ""}"""
)))
val makeSIfTesla = udf {(make: String) =>
if(make == "Tesla") "S" else make
}
df1.withColumn("make", makeSIfTesla(df1("make"))).show
回答by Akshay Pandya
df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show()
df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show()
replace in class DataFrameNaFunctions of type [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame
在类型为 [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame 的类 DataFrameNaFunctions 中替换
For running this function you must have active spark object and dataframe with headers ON.
要运行此功能,您必须具有活动的火花对象和带有标题的数据帧。

