scala 通过 Spark groupBy 数据帧查找时间戳的最小值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36427212/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:08:42  来源:igfitidea点击:

Find minimum for a timestamp through Spark groupBy dataframe

sqlscalaapache-sparkapache-spark-sql

提问by Jake Fund

When I try to group my dataframe on a column then try to find the minimum for each grouping groupbyDatafram.min('timestampCol')it appears I cannot do it on non numerical columns. Then how can I properly filter the minimum (earliest) date on the groupby?

当我尝试在列上对数据框进行分组时groupbyDatafram.min('timestampCol'),然后尝试找到每个分组的最小值时,我无法在非数字列上执行此操作。那么如何正确过滤 groupby 上的最小(最早)日期?

I am streaming the dataframe from a postgresql S3 instance, so that data is already configured.

我正在从 postgresql S3 实例流式传输数据帧,以便数据已经配置。

回答by zero323

Just perform aggregation directly instead of using minhelper:

直接执行聚合而不是使用min助手:

import org.apache.spark.sql.functions.min

val sqlContext: SQLContext = ???

import sqlContext.implicits._

val df = Seq((1L, "2016-04-05 15:10:00"), (1L, "2014-01-01 15:10:00"))
  .toDF("id", "ts")
  .withColumn("ts", $"ts".cast("timestamp"))

df.groupBy($"id").agg(min($"ts")).show

// +---+--------------------+
// | id|             min(ts)|
// +---+--------------------+
// |  1|2014-01-01 15:10:...|
// +---+--------------------+

Unlike minit will work on any Orderabletype.

不像min它适用于任何Orderable类型。