scala 如何在 Spark SQL(DataFrame) 的 UDF 中使用常量值

Question

提问by emeth

I have a dataframe which includes timestamp. To aggregate by time(minute, hour, or day), I have tried as:

我有一个包含timestamp. 要按时间（分钟、小时或天）聚合，我尝试过：

val toSegment = udf((timestamp: String) => {
  val asLong = timestamp.toLong
  asLong - asLong % 3600000 // period = 1 hour
})

val df: DataFrame // the dataframe
df.groupBy(toSegment($"timestamp")).count()

This works fine.

这工作正常。

My question is how to generalize the UDF toSegmentas

我的问题是如何将 UDF 概括toSegment为

val toSegmentGeneralized = udf((timestamp: String, period: Int) => {
  val asLong = timestamp.toLong
  asLong - asLong % period
})

I have tried as follows but it doesn't work

我试过如下，但它不起作用

df.groupBy(toSegment($"timestamp", $"3600000")).count()

It seems to find the column named 3600000.

似乎找到名为3600000.

Possible solution is to use constant columnbut I couldn't find it.

可能的解决方案是使用常量列，但我找不到它。

Answer 1

回答by Spiro Michaylov

You can use org.apache.spark.sql.functions.lit()to create the constant column:

您可以使用org.apache.spark.sql.functions.lit()来创建常量列：

import org.apache.spark.sql.functions._

df.groupBy(toSegment($"timestamp", lit(3600000))).count()

scala 如何在 Spark SQL(DataFrame) 的 UDF 中使用常量值

提问by emeth

回答by Spiro Michaylov

相关推荐

最近更新

标签

scala 如何在 Spark SQL(DataFrame) 的 UDF 中使用常量值

提问by emeth

回答by Spiro Michaylov

相关推荐

scala 如何从 pyspark 设置 hadoop 配置值

scala 如何从命令行或 spark shell 显示拼花文件的方案（包括类型）？

从 Scala 中的配置读取值

scala 将Scala生成的数据写入文本文件

相关推荐

最近更新

标签