scala Apache Spark,将“CASE WHEN ... ELSE ...”计算列添加到现有数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30783517/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:14:57  来源:igfitidea点击:

Apache Spark, add an "CASE WHEN ... ELSE ..." calculated column to an existing DataFrame

scalaapache-sparkdataframeapache-spark-sql

提问by Leonardo Biagioli

I'm trying to add an "CASE WHEN ... ELSE ..." calculated column to an existing DataFrame, using Scala APIs. Starting dataframe:

我正在尝试使用 Scala API 将“CASE WHEN ... ELSE ...”计算列添加到现有 DataFrame 中。起始数据帧:

color
Red
Green
Blue

Desired dataframe (SQL syntax: CASE WHEN color == Green THEN 1 ELSE 0 END AS bool):

所需的数据框(SQL 语法:CASE WHEN color == Green THEN 1 ELSE 0 END AS bool):

color bool
Red   0
Green 1
Blue  0

How should I implement this logic?

我应该如何实现这个逻辑?

回答by Herman

In the upcoming SPARK 1.4.0 release (should be released in the next couple of days). You can use the when/otherwise syntax:

在即将发布的 SPARK 1.4.0 版本中(应该在接下来的几天内发布)。您可以使用 when/otherwise 语法:

// Create the dataframe
val df = Seq("Red", "Green", "Blue").map(Tuple1.apply).toDF("color")

// Use when/otherwise syntax
val df1 = df.withColumn("Green_Ind", when($"color" === "Green", 1).otherwise(0))

If you are using SPARK 1.3.0 you can chose to use a UDF:

如果您使用的是 SPARK 1.3.0,您可以选择使用 UDF:

// Define the UDF
val isGreen = udf((color: String) => {
  if (color == "Green") 1
  else 0
})
val df2 = df.withColumn("Green_Ind", isGreen($"color"))

回答by Robert Chevallier

In Spark 1.5.0: you can also use the SQL syntax expr function

在 Spark 1.5.0 中:您还可以使用 SQL 语法 expr 函数

val df3 = df.withColumn("Green_Ind", expr("case when color = 'green' then 1 else 0 end"))

or plain spark-sql

或普通的 spark-sql

df.registerTempTable("data")
val df4 = sql(""" select *, case when color = 'green' then 1 else 0 end as Green_ind from data """)

回答by ozma

I found this:

我找到了这个:

https://issues.apache.org/jira/browse/SPARK-3813

https://issues.apache.org/jira/browse/SPARK-3813

Worked for me on spark 2.1.0:

在 spark 2.1.0 上对我来说有效:

import sqlContext._
val rdd = sc.parallelize((1 to 100).map(i => Record(i, s"val_$i")))
rdd.registerTempTable("records")
println("Result of SELECT *:")
sql("SELECT case key when '93' then 'ravi' else key end FROM records").collect()

回答by Ehud Lev

I was looking for that long time so here is example of SPARK 2.1 JAVA with group by- for other java users.

我找了很长时间,所以这里是 SPARK 2.1 JAVA 的示例,为其他 Java 用户提供 group by-。

import static org.apache.spark.sql.functions.*;
 //...
    Column uniqTrue = col("uniq").equalTo(true);
    Column uniqFalse = col("uniq").equalTo(false);

    Column testModeFalse = col("testMode").equalTo(false);
    Column testModeTrue = col("testMode").equalTo(true);

    Dataset<Row> x = basicEventDataset
            .groupBy(col(group_field))
            .agg(
                    sum(when((testModeTrue).and(uniqTrue), 1).otherwise(0)).as("tt"),
                    sum(when((testModeFalse).and(uniqTrue), 1).otherwise(0)).as("ft"),
                    sum(when((testModeTrue).and(uniqFalse), 1).otherwise(0)).as("tf"),
                    sum(when((testModeFalse).and(uniqFalse), 1).otherwise(0)).as("ff")
            );