scala 如何在 Spark SQL(DataFrame) 的 UDF 中使用常量值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29406913/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to use constant value in UDF of Spark SQL(DataFrame)
提问by emeth
I have a dataframe which includes timestamp. To aggregate by time(minute, hour, or day), I have tried as:
我有一个包含timestamp. 要按时间(分钟、小时或天)聚合,我尝试过:
val toSegment = udf((timestamp: String) => {
val asLong = timestamp.toLong
asLong - asLong % 3600000 // period = 1 hour
})
val df: DataFrame // the dataframe
df.groupBy(toSegment($"timestamp")).count()
This works fine.
这工作正常。
My question is how to generalize the UDF toSegmentas
我的问题是如何将 UDF 概括toSegment为
val toSegmentGeneralized = udf((timestamp: String, period: Int) => {
val asLong = timestamp.toLong
asLong - asLong % period
})
I have tried as follows but it doesn't work
我试过如下,但它不起作用
df.groupBy(toSegment($"timestamp", $"3600000")).count()
It seems to find the column named 3600000.
似乎找到名为3600000.
Possible solution is to use constant columnbut I couldn't find it.
可能的解决方案是使用常量列,但我找不到它。
回答by Spiro Michaylov
You can use org.apache.spark.sql.functions.lit()to create the constant column:
您可以使用org.apache.spark.sql.functions.lit()来创建常量列:
import org.apache.spark.sql.functions._
df.groupBy(toSegment($"timestamp", lit(3600000))).count()

