scala 通过 Spark groupBy 数据帧查找时间戳的最小值

Question

提问by Jake Fund

When I try to group my dataframe on a column then try to find the minimum for each grouping groupbyDatafram.min('timestampCol')it appears I cannot do it on non numerical columns. Then how can I properly filter the minimum (earliest) date on the groupby?

当我尝试在列上对数据框进行分组时groupbyDatafram.min('timestampCol')，然后尝试找到每个分组的最小值时，我无法在非数字列上执行此操作。那么如何正确过滤 groupby 上的最小（最早）日期？

I am streaming the dataframe from a postgresql S3 instance, so that data is already configured.

我正在从 postgresql S3 实例流式传输数据帧，以便数据已经配置。

Answer 1

回答by zero323

Just perform aggregation directly instead of using minhelper:

直接执行聚合而不是使用min助手：

import org.apache.spark.sql.functions.min

val sqlContext: SQLContext = ???

import sqlContext.implicits._

val df = Seq((1L, "2016-04-05 15:10:00"), (1L, "2014-01-01 15:10:00"))
  .toDF("id", "ts")
  .withColumn("ts", $"ts".cast("timestamp"))

df.groupBy($"id").agg(min($"ts")).show

// +---+--------------------+
// | id|             min(ts)|
// +---+--------------------+
// |  1|2014-01-01 15:10:...|
// +---+--------------------+

Unlike minit will work on any Orderabletype.

不像min它适用于任何Orderable类型。

scala 通过 Spark groupBy 数据帧查找时间戳的最小值

提问by Jake Fund

回答by zero323

相关推荐

最近更新

标签

scala 通过 Spark groupBy 数据帧查找时间戳的最小值

提问by Jake Fund

回答by zero323

相关推荐

在 Spark Scala 中重命名 DataFrame 的列名

scala Spark ：检查您的集群 UI 以确保工作人员已注册

scala Spark SQL DataFrame - distinct() 与 dropDuplicates()

scala 在字符串上过滤 spark DataFrame 包含

相关推荐

最近更新

标签