scala 如何在 Spark SQL(DataFrame) 的 UDF 中使用常量值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29406913/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:02:40  来源:igfitidea点击:

How to use constant value in UDF of Spark SQL(DataFrame)

scalaapache-sparkapache-spark-sql

提问by emeth

I have a dataframe which includes timestamp. To aggregate by time(minute, hour, or day), I have tried as:

我有一个包含timestamp. 要按时间(分钟、小时或天)聚合,我尝试过:

val toSegment = udf((timestamp: String) => {
  val asLong = timestamp.toLong
  asLong - asLong % 3600000 // period = 1 hour
})

val df: DataFrame // the dataframe
df.groupBy(toSegment($"timestamp")).count()

This works fine.

这工作正常。

My question is how to generalize the UDF toSegmentas

我的问题是如何将 UDF 概括toSegment

val toSegmentGeneralized = udf((timestamp: String, period: Int) => {
  val asLong = timestamp.toLong
  asLong - asLong % period
})

I have tried as follows but it doesn't work

我试过如下,但它不起作用

df.groupBy(toSegment($"timestamp", $"3600000")).count()

It seems to find the column named 3600000.

似乎找到名为3600000.

Possible solution is to use constant columnbut I couldn't find it.

可能的解决方案是使用常量列,但我找不到它。

回答by Spiro Michaylov

You can use org.apache.spark.sql.functions.lit()to create the constant column:

您可以使用org.apache.spark.sql.functions.lit()来创建常量列:

import org.apache.spark.sql.functions._

df.groupBy(toSegment($"timestamp", lit(3600000))).count()