scala Scala通过表达式向数据框添加新列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46087420/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scala add new column to dataframe by expression
提问by Robin Wang
I am going to add new column to a dataframe with expression. for example, I have a dataframe of
我将使用表达式将新列添加到数据框。例如,我有一个数据框
+-----+----------+----------+-----+
| C1 | C2 | C3 |C4 |
+-----+----------+----------+-----+
|steak|1 |1 | 150|
|steak|2 |2 | 180|
| fish|3 |3 | 100|
+-----+----------+----------+-----+
and I want to create a new column C5 with expression "C2/C3+C4", assuming there are several new columns need to add, and the expressions may be different and come from database.
我想用表达式“C2/C3+C4”创建一个新列C5,假设有几个新列需要添加,并且表达式可能不同并且来自数据库。
Is there a good way to do this?
有没有好的方法可以做到这一点?
I know that if I have an expression like "2+3*4" I can use scala.tools.reflect.ToolBox to eval it.
我知道如果我有一个像 "2+3*4" 这样的表达式,我可以使用 scala.tools.reflect.ToolBox 来评估它。
And normally I am using df.withColumn to add new column.
通常我使用 df.withColumn 添加新列。
Seems I need to create an UDF, but how can I pass the columns value as parameters to UDF? especially there maybe multiple expression need different columns calculate.
似乎我需要创建一个 UDF,但是如何将列值作为参数传递给 UDF?特别是可能有多个表达式需要不同的列计算。
回答by Raphael Roth
This can be done using exprto create a Columnfrom an expression:
这可以使用从表达式expr创建 a来完成Column:
val df = Seq((1,2)).toDF("x","y")
val myExpression = "x+y"
import org.apache.spark.sql.functions.expr
df.withColumn("z",expr(myExpression)).show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 3|
+---+---+---+
回答by Rajesh Gupta
Two approaches:
两种做法:
import spark.implicits._ //so that you could use .toDF
val df = Seq(
("steak", 1, 1, 150),
("steak", 2, 2, 180),
("fish", 3, 3, 100)
).toDF("C1", "C2", "C3", "C4")
import org.apache.spark.sql.functions._
// 1st approach using expr
df.withColumn("C5", expr("C2/(C3 + C4)")).show()
// 2nd approach using selectExpr
df.selectExpr("*", "(C2/(C3 + C4)) as C5").show()
+-----+---+---+---+--------------------+
| C1| C2| C3| C4| C5|
+-----+---+---+---+--------------------+
|steak| 1| 1|150|0.006622516556291391|
|steak| 2| 2|180| 0.01098901098901099|
| fish| 3| 3|100| 0.02912621359223301|
+-----+---+---+---+--------------------+
回答by Vidura Mudalige
In Spark 2.x, you can create a new column C5 with expression "C2/C3+C4" using withColumn()and org.apache.spark.sql.functions._,
在 Spark 2.x 中,您可以使用withColumn()和创建一个带有表达式“C2/C3+C4”的新列 C5 org.apache.spark.sql.functions._,
val currentDf = Seq(
("steak", 1, 1, 150),
("steak", 2, 2, 180),
("fish", 3, 3, 100)
).toDF("C1", "C2", "C3", "C4")
val requiredDf = currentDf
.withColumn("C5", (col("C2")/col("C3")+col("C4")))
Also, you can do the same using org.apache.spark.sql.Columnas well.
(But the space complexity is bit higher in this approach than using org.apache.spark.sql.functions._due to the Column object creation)
此外,您也可以使用相同的方法org.apache.spark.sql.Column。(但是org.apache.spark.sql.functions._由于 Column 对象的创建,这种方法的空间复杂度比使用的要高一些)
val requiredDf = currentDf
.withColumn("C5", (new Column("C2")/new Column("C3")+new Column("C4")))
This worked perfectly for me. I am using Spark 2.0.2.
这对我来说非常有效。我正在使用 Spark 2.0.2。

