scala Spark:有条件地将列添加到数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34908448/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark: Add column to dataframe conditionally
提问by mcmcmc
I am trying to take my input data:
我正在尝试获取我的输入数据:
A B C
--------------
4 blah 2
2 3
56 foo 3
And add a column to the end based on whether B is empty or not:
并根据 B 是否为空在末尾添加一列:
A B C D
--------------------
4 blah 2 1
2 3 0
56 foo 3 1
I can do this easily by registering the input dataframe as a temp table, then typing up a SQL query.
我可以通过将输入数据框注册为临时表,然后键入 SQL 查询来轻松完成此操作。
But I'd really like to know how to do this with just Scala methods and not having to type out a SQL query within Scala.
但我真的很想知道如何仅使用 Scala 方法来做到这一点,而不必在 Scala 中键入 SQL 查询。
I've tried .withColumn, but I can't get that to do what I want.
我试过了.withColumn,但我不能让它做我想做的事。
回答by emeth
Try withColumnwith the function whenas follows:
尝试withColumn使用以下功能when:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ // for `toDF` and $""
import org.apache.spark.sql.functions._ // for `when`
val df = sc.parallelize(Seq((4, "blah", 2), (2, "", 3), (56, "foo", 3), (100, null, 5)))
.toDF("A", "B", "C")
val newDf = df.withColumn("D", when($"B".isNull or $"B" === "", 0).otherwise(1))
newDf.show()shows
newDf.show()显示
+---+----+---+---+
| A| B| C| D|
+---+----+---+---+
| 4|blah| 2| 1|
| 2| | 3| 0|
| 56| foo| 3| 1|
|100|null| 5| 0|
+---+----+---+---+
I added the (100, null, 5)row for testing the isNullcase.
我添加了(100, null, 5)用于测试isNull案例的行。
I tried this code with Spark 1.6.0but as commented in the code of when, it works on the versions after 1.4.0.
我尝试了这段代码,Spark 1.6.0但正如 的代码中所评论的when,它适用于1.4.0.
回答by Roberto Congiu
My bad, I had missed one part of the question.
我的错,我错过了问题的一部分。
Best, cleanest way is to use a UDF.
Explanation within the code.
最好、最干净的方法是使用UDF. 代码中的解释。
// create some example data...BY DataFrame
// note, third record has an empty string
case class Stuff(a:String,b:Int)
val d= sc.parallelize(Seq( ("a",1),("b",2),
("",3) ,("d",4)).map { x => Stuff(x._1,x._2) }).toDF
// now the good stuff.
import org.apache.spark.sql.functions.udf
// function that returns 0 is string empty
val func = udf( (s:String) => if(s.isEmpty) 0 else 1 )
// create new dataframe with added column named "notempty"
val r = d.select( $"a", $"b", func($"a").as("notempty") )
scala> r.show
+---+---+--------+
| a| b|notempty|
+---+---+--------+
| a| 1| 1111|
| b| 2| 1111|
| | 3| 0|
| d| 4| 1111|
+---+---+--------+
回答by Justin Pihony
How about something like this?
这样的事情怎么样?
val newDF = df.filter($"B" === "").take(1) match {
case Array() => df
case _ => df.withColumn("D", $"B" === "")
}
Using take(1)should have a minimal hit
使用take(1)应该有最小的打击

