scala 如何在sparksql中获得两个日期之间的月、年差异
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46304245/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to get months,years difference between two dates in sparksql
提问by Kumar
I am getting the error:
我收到错误:
org.apache.spark.sql.analysisexception: cannot resolve 'year'
My input data:
我的输入数据:
1,2012-07-21,2014-04-09
My code:
我的代码:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class c (id:Int,start:String,end:String)
val c1 = sc.textFile("date.txt")
val c2 = c1.map(_.split(",")).map(r=>(c(r(0).toInt,r(1).toString,r(2).toString)))
val c3 = c2.toDF();
c3.registerTempTable("c4")
val r = sqlContext.sql("select id,datediff(year,to_date(end), to_date(start)) AS date from c4")
What can I do resolve above error?
我该怎么做才能解决上述错误?
I have tried the following code but I got the output in days and I need it in years
我已经尝试了以下代码,但我在几天内得到了输出,并且在几年内需要它
val r = sqlContext.sql("select id,datediff(to_date(end), to_date(start)) AS date from c4")
Please advise me if i can use any function like to_date to get year difference.
请告诉我是否可以使用像 to_date 这样的任何函数来获得年差。
采纳答案by hagarwal
val r = sqlContext.sql("select id,datediff(year,to_date(end), to_date(start)) AS date from c4")
In the above code, "year" is not a column in the data frame i.e it is not a valid column in table "c4" that is why analysis exception is thrown as query is invalid, query is not able to find the "year" column.
在上面的代码中,“年份”不是数据框中的列,即它不是表“c4”中的有效列,这就是为什么当查询无效时抛出分析异常,查询无法找到“年份”柱子。
Use Spark User Defined Function (UDF), that will be a more robust approach.
使用 Spark User Defined Function (UDF),这将是一种更强大的方法。
回答by Rishikesh Teke
Another simple way to cast the string to dateType in spark sql and apply sql dates and time functionson the columns like following :
另一种在 spark sql 中将字符串转换为 dateType并在列上应用sql 日期和时间函数的简单方法,如下所示:
import org.apache.spark.sql.types._
val c4 = c3.select(col("id"),col("start").cast(DateType),col("end").cast(DateType))
c4.withColumn("dateDifference", datediff(col("end"),col("start")))
.withColumn("monthDifference", months_between(col("end"),col("start")))
.withColumn("yearDifference", year(col("end"))-year(col("start")))
.show()
回答by Kumar
One of the above answers doesn't return the right Year when days between two dates less than 365. Below example provides the right year and rounds the month and year to 2 decimal.
当两个日期之间的天数小于 365 时,上述答案之一不会返回正确的年份。下面的示例提供了正确的年份并将月份和年份四舍五入到小数点后两位。
Seq(("2019-07-01"),("2019-06-24"),("2019-08-24"),("2018-12-23"),("2018-07-20")).toDF("startDate").select(
col("startDate"),current_date().as("endDate"))
.withColumn("datesDiff", datediff(col("endDate"),col("startDate")))
.withColumn("montsDiff", months_between(col("endDate"),col("startDate")))
.withColumn("montsDiff_round", round(months_between(col("endDate"),col("startDate")),2))
.withColumn("yearsDiff", months_between(col("endDate"),col("startDate"),true).divide(12))
.withColumn("yearsDiff_round", round(months_between(col("endDate"),col("startDate"),true).divide(12),2))
.show()
Outputs:
输出:
+----------+----------+---------+-----------+---------------+--------------------+---------------+
| startDate| endDate|datesDiff| montsDiff|montsDiff_round| yearsDiff|yearsDiff_round|
+----------+----------+---------+-----------+---------------+--------------------+---------------+
|2019-07-01|2019-07-24| 23| 0.74193548| 0.74| 0.06182795666666666| 0.06|
|2019-06-24|2019-07-24| 30| 1.0| 1.0| 0.08333333333333333| 0.08|
|2019-08-24|2019-07-24| -31| -1.0| -1.0|-0.08333333333333333| -0.08|
|2018-12-23|2019-07-24| 213| 7.03225806| 7.03| 0.586021505| 0.59|
|2018-07-20|2019-07-24| 369|12.12903226| 12.13| 1.0107526883333333| 1.01|
+----------+----------+---------+-----------+---------------+--------------------+---------------+
You can find a complete working example at below URL
您可以在以下 URL 找到完整的工作示例
https://sparkbyexamples.com/spark-calculate-difference-between-two-dates-in-days-months-and-years/
https://sparkbyexamples.com/spark-calculate-difference-between-two-dates-in-days-months-and-years/
Hope this helps.
希望这可以帮助。
Happy Learning !!
快乐学习!!
回答by Franzi
Since dateDiffonly returns the difference between days. I prefer to use my own UDF.
因为dateDiff只返回天之间的差异。我更喜欢使用我自己的 UDF。
import java.sql.Timestamp
import java.time.Instant
import java.time.temporal.ChronoUnit
import org.apache.spark.sql.functions.{udf, col}
import org.apache.spark.sql.DataFrame
def timeDiff(chronoUnit: ChronoUnit)(dateA: Timestamp, dateB: Timestamp): Long = {
chronoUnit.between(
Instant.ofEpochMilli(dateA.getTime),
Instant.ofEpochMilli(dateB.getTime)
)
}
def withTimeDiff(dateA: String, dateB: String, colName: String, chronoUnit: ChronoUnit)(df: DataFrame): DataFrame = {
val timeDiffUDF = udf[Long, Timestamp, Timestamp](timeDiff(chronoUnit))
df.withColumn(colName, timeDiffUDF(col(dateA), col(dateB)))
}
Then I call it as a dataframe transformation.
然后我将其称为数据帧转换。
df.transform(withTimeDiff("sleepTime", "wakeupTime", "minutes", ChronoUnit.MINUTES)

