scala 如何在sparksql中获得两个日期之间的月、年差异

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46304245/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:26:51  来源:igfitidea点击:

how to get months,years difference between two dates in sparksql

scalaapache-sparkapache-spark-sql

提问by Kumar

I am getting the error:

我收到错误:

org.apache.spark.sql.analysisexception: cannot resolve 'year'

My input data:

我的输入数据:

1,2012-07-21,2014-04-09

My code:

我的代码:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class c (id:Int,start:String,end:String)
val c1 = sc.textFile("date.txt")
val c2 = c1.map(_.split(",")).map(r=>(c(r(0).toInt,r(1).toString,r(2).toString)))
val c3 = c2.toDF();
c3.registerTempTable("c4")
val r = sqlContext.sql("select id,datediff(year,to_date(end), to_date(start)) AS date from c4")

What can I do resolve above error?

我该怎么做才能解决上述错误?

I have tried the following code but I got the output in days and I need it in years

我已经尝试了以下代码,但我在几天内得到了输出,并且在几年内需要它

val r = sqlContext.sql("select id,datediff(to_date(end), to_date(start)) AS date from c4")

Please advise me if i can use any function like to_date to get year difference.

请告诉我是否可以使用像 to_date 这样的任何函数来获得年差。

采纳答案by hagarwal

val r = sqlContext.sql("select id,datediff(year,to_date(end), to_date(start)) AS date from c4")

In the above code, "year" is not a column in the data frame i.e it is not a valid column in table "c4" that is why analysis exception is thrown as query is invalid, query is not able to find the "year" column.

在上面的代码中,“年份”不是数据框中的列,即它不是表“c4”中的有效列,这就是为什么当查询无效时抛出分析异常,查询无法找到“年份”柱子。

Use Spark User Defined Function (UDF), that will be a more robust approach.

使用 Spark User Defined Function (UDF),这将是一种更强大的方法。

回答by Rishikesh Teke

Another simple way to cast the string to dateType in spark sql and apply sql dates and time functionson the columns like following :

另一种在 spark sql 中将字符串转换为 dateType在列上应用sql 日期和时间函数的简单方法,如下所示:

import org.apache.spark.sql.types._
val c4 = c3.select(col("id"),col("start").cast(DateType),col("end").cast(DateType))

c4.withColumn("dateDifference", datediff(col("end"),col("start")))
  .withColumn("monthDifference", months_between(col("end"),col("start")))
  .withColumn("yearDifference", year(col("end"))-year(col("start")))
  .show()

回答by Kumar

One of the above answers doesn't return the right Year when days between two dates less than 365. Below example provides the right year and rounds the month and year to 2 decimal.

当两个日期之间的天数小于 365 时,上述答案之一不会返回正确的年份。下面的示例提供了正确的年份并将月份和年份四舍五入到小数点后两位。

Seq(("2019-07-01"),("2019-06-24"),("2019-08-24"),("2018-12-23"),("2018-07-20")).toDF("startDate").select(
col("startDate"),current_date().as("endDate"))
.withColumn("datesDiff", datediff(col("endDate"),col("startDate")))
.withColumn("montsDiff", months_between(col("endDate"),col("startDate")))
.withColumn("montsDiff_round", round(months_between(col("endDate"),col("startDate")),2))
.withColumn("yearsDiff", months_between(col("endDate"),col("startDate"),true).divide(12))
.withColumn("yearsDiff_round", round(months_between(col("endDate"),col("startDate"),true).divide(12),2))
.show()

Outputs:

输出:

+----------+----------+---------+-----------+---------------+--------------------+---------------+
| startDate|   endDate|datesDiff|  montsDiff|montsDiff_round|           yearsDiff|yearsDiff_round|
+----------+----------+---------+-----------+---------------+--------------------+---------------+
|2019-07-01|2019-07-24|       23| 0.74193548|           0.74| 0.06182795666666666|           0.06|
|2019-06-24|2019-07-24|       30|        1.0|            1.0| 0.08333333333333333|           0.08|
|2019-08-24|2019-07-24|      -31|       -1.0|           -1.0|-0.08333333333333333|          -0.08|
|2018-12-23|2019-07-24|      213| 7.03225806|           7.03|         0.586021505|           0.59|
|2018-07-20|2019-07-24|      369|12.12903226|          12.13|  1.0107526883333333|           1.01|
+----------+----------+---------+-----------+---------------+--------------------+---------------+

You can find a complete working example at below URL

您可以在以下 URL 找到完整的工作示例

https://sparkbyexamples.com/spark-calculate-difference-between-two-dates-in-days-months-and-years/

https://sparkbyexamples.com/spark-calculate-difference-between-two-dates-in-days-months-and-years/

Hope this helps.

希望这可以帮助。

Happy Learning !!

快乐学习!!

回答by Franzi

Since dateDiffonly returns the difference between days. I prefer to use my own UDF.

因为dateDiff只返回天之间的差异。我更喜欢使用我自己的 UDF。

import java.sql.Timestamp
import java.time.Instant
import java.time.temporal.ChronoUnit

import org.apache.spark.sql.functions.{udf, col}
import org.apache.spark.sql.DataFrame

def timeDiff(chronoUnit: ChronoUnit)(dateA: Timestamp, dateB: Timestamp): Long = {
    chronoUnit.between(
      Instant.ofEpochMilli(dateA.getTime),
      Instant.ofEpochMilli(dateB.getTime)
    )
}

def withTimeDiff(dateA: String, dateB: String, colName: String, chronoUnit: ChronoUnit)(df: DataFrame): DataFrame = {
    val timeDiffUDF = udf[Long, Timestamp, Timestamp](timeDiff(chronoUnit))
    df.withColumn(colName, timeDiffUDF(col(dateA), col(dateB)))
}

Then I call it as a dataframe transformation.

然后我将其称为数据帧转换。

df.transform(withTimeDiff("sleepTime", "wakeupTime", "minutes", ChronoUnit.MINUTES)