scala 通过 Spark groupBy 数据帧查找时间戳的最小值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36427212/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Find minimum for a timestamp through Spark groupBy dataframe
提问by Jake Fund
When I try to group my dataframe on a column then try to find the minimum for each grouping groupbyDatafram.min('timestampCol')it appears I cannot do it on non numerical columns. Then how can I properly filter the minimum (earliest) date on the groupby?
当我尝试在列上对数据框进行分组时groupbyDatafram.min('timestampCol'),然后尝试找到每个分组的最小值时,我无法在非数字列上执行此操作。那么如何正确过滤 groupby 上的最小(最早)日期?
I am streaming the dataframe from a postgresql S3 instance, so that data is already configured.
我正在从 postgresql S3 实例流式传输数据帧,以便数据已经配置。
回答by zero323
Just perform aggregation directly instead of using minhelper:
直接执行聚合而不是使用min助手:
import org.apache.spark.sql.functions.min
val sqlContext: SQLContext = ???
import sqlContext.implicits._
val df = Seq((1L, "2016-04-05 15:10:00"), (1L, "2014-01-01 15:10:00"))
.toDF("id", "ts")
.withColumn("ts", $"ts".cast("timestamp"))
df.groupBy($"id").agg(min($"ts")).show
// +---+--------------------+
// | id| min(ts)|
// +---+--------------------+
// | 1|2014-01-01 15:10:...|
// +---+--------------------+
Unlike minit will work on any Orderabletype.
不像min它适用于任何Orderable类型。

