scala Spark：有条件地将列添加到数据框

Question

提问by mcmcmc

I am trying to take my input data:

我正在尝试获取我的输入数据：

A    B       C
--------------
4    blah    2
2            3
56   foo     3

And add a column to the end based on whether B is empty or not:

并根据 B 是否为空在末尾添加一列：

A    B       C     D
--------------------
4    blah    2     1
2            3     0
56   foo     3     1

I can do this easily by registering the input dataframe as a temp table, then typing up a SQL query.

我可以通过将输入数据框注册为临时表，然后键入 SQL 查询来轻松完成此操作。

But I'd really like to know how to do this with just Scala methods and not having to type out a SQL query within Scala.

但我真的很想知道如何仅使用 Scala 方法来做到这一点，而不必在 Scala 中键入 SQL 查询。

I've tried .withColumn, but I can't get that to do what I want.

我试过了.withColumn，但我不能让它做我想做的事。

Answer 1

回答by emeth

Try withColumnwith the function whenas follows:

尝试withColumn使用以下功能when：

val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ // for `toDF` and $""
import org.apache.spark.sql.functions._ // for `when`

val df = sc.parallelize(Seq((4, "blah", 2), (2, "", 3), (56, "foo", 3), (100, null, 5)))
    .toDF("A", "B", "C")

val newDf = df.withColumn("D", when($"B".isNull or $"B" === "", 0).otherwise(1))

newDf.show()shows

newDf.show()显示

+---+----+---+---+
|  A|   B|  C|  D|
+---+----+---+---+
|  4|blah|  2|  1|
|  2|    |  3|  0|
| 56| foo|  3|  1|
|100|null|  5|  0|
+---+----+---+---+

I added the (100, null, 5)row for testing the isNullcase.

我添加了(100, null, 5)用于测试isNull案例的行。

I tried this code with Spark 1.6.0but as commented in the code of when, it works on the versions after 1.4.0.

我尝试了这段代码，Spark 1.6.0但正如的代码中所评论的when，它适用于1.4.0.

Answer 2

回答by Roberto Congiu

My bad, I had missed one part of the question.

我的错，我错过了问题的一部分。

Best, cleanest way is to use a UDF. Explanation within the code.

最好、最干净的方法是使用UDF. 代码中的解释。

// create some example data...BY DataFrame
// note, third record has an empty string
case class Stuff(a:String,b:Int)
val d= sc.parallelize(Seq( ("a",1),("b",2),
     ("",3) ,("d",4)).map { x => Stuff(x._1,x._2)  }).toDF

// now the good stuff.
import org.apache.spark.sql.functions.udf
// function that returns 0 is string empty 
val func = udf( (s:String) => if(s.isEmpty) 0 else 1 )
// create new dataframe with added column named "notempty"
val r = d.select( $"a", $"b", func($"a").as("notempty") )

    scala> r.show
+---+---+--------+
|  a|  b|notempty|
+---+---+--------+
|  a|  1|    1111|
|  b|  2|    1111|
|   |  3|       0|
|  d|  4|    1111|
+---+---+--------+

Answer 3

回答by Justin Pihony

How about something like this?

这样的事情怎么样？

val newDF = df.filter($"B" === "").take(1) match {
  case Array() => df
  case _ => df.withColumn("D", $"B" === "")
}

Using take(1)should have a minimal hit

使用take(1)应该有最小的打击

scala Spark：有条件地将列添加到数据框

提问by mcmcmc

回答by emeth

回答by Roberto Congiu

回答by Justin Pihony

相关推荐

最近更新

标签

scala Spark：有条件地将列添加到数据框

提问by mcmcmc

回答by emeth

回答by Roberto Congiu

回答by Justin Pihony

相关推荐

scala IntelliJ 无法解析 build.sbt 中的符号

scala Apache Spark 使用管道分隔的 CSV 文件

scala 为什么 sbt 编译失败并出现 StackOverflowError？

scala Spark Group By Key to (Key,List) Pair

相关推荐

最近更新

标签