Spark Scala:按小时或分钟显示两列的DateDiff

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37058016/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:16:42  来源:igfitidea点击:

Spark Scala: DateDiff of two columns by hour or minute

scalaapache-spark

提问by mt88

I have two timestamp columns in a dataframe that I'd like to get the minute difference of, or alternatively, the hour difference of. Currently I'm able to get the day difference, with rounding, by doing

我在数据框中有两个时间戳列,我想获得它们的分钟差异,或者小时差异。目前,我可以通过四舍五入来获得日差

val df2 = df1.withColumn("time", datediff(df1("ts1"), df1("ts2")))

However, when i looked at the doc page https://issues.apache.org/jira/browse/SPARK-8185I didn't see any extra parameters to change the unit. Is their a different function I should be using for this?

但是,当我查看文档页面 https://issues.apache.org/jira/browse/SPARK-8185 时,我没有看到任何额外的参数来更改单位。我应该为此使用它们的不同功能吗?

回答by Daniel de Paula

You can get the difference in seconds by

您可以通过以下方式获得几秒钟的差异

import org.apache.spark.sql.functions._
val diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")

Then you can do some math to get the unit you want. For example:

然后你可以做一些数学运算来得到你想要的单位。例如:

val df2 = df1
  .withColumn( "diff_secs", diff_secs_col )
  .withColumn( "diff_mins", diff_secs_col / 60D )
  .withColumn( "diff_hrs",  diff_secs_col / 3600D )
  .withColumn( "diff_days", diff_secs_col / (24D * 3600D) )

Or, in pyspark:

或者,在 pyspark 中:

from pyspark.sql.functions import *
diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")

df2 = df1 \
  .withColumn( "diff_secs", diff_secs_col ) \
  .withColumn( "diff_mins", diff_secs_col / 60D ) \
  .withColumn( "diff_hrs",  diff_secs_col / 3600D ) \
  .withColumn( "diff_days", diff_secs_col / (24D * 3600D) )

回答by Jeremy

The answer given by Daniel de Paulaworks, but that solution does not work in the case where the difference is needed for every row in your table. Here is a solution that will do that for each row:

Daniel de Paula给出的答案有效,但该解决方案在表中每一行都需要差异的情况下不起作用。这是一个可以为每一行执行此操作的解决方案:

import org.apache.spark.sql.functions

val df2 = df1.selectExpr("(unix_timestamp(ts1) - unix_timestamp(ts2))/3600")

This first converts the data in the columns to a unix timestamp in seconds, subtracts them and then converts the difference to hours.

这首先将列中的数据以秒为单位转换为 unix 时间戳,减去它们,然后将差异转换为小时。

A useful list of functions can be found at: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.functions$

可以在以下位置找到有用的函数列表:http: //spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.functions$