scala Spark SQL 未正确转换时区

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35761586/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:03:35  来源:igfitidea点击:

Spark SQL is not converting timezone correctly

scalaapache-sparkhivetimezone

提问by Gaurav Shah

Using Scala 2.10.4 and spark 1.5.1 and spark 1.6

使用 Scala 2.10.4 和 spark 1.5.1 和 spark 1.6

sqlContext.sql(
  """
    |select id,
    |to_date(from_utc_timestamp(from_unixtime(at), 'US/Pacific')),
    |from_utc_timestamp(from_unixtime(at), 'US/Pacific'),
    |from_unixtime(at),
    |to_date(from_unixtime(at)),
    | at
    |from events
    | limit 100
  """.stripMargin).collect().foreach(println)

Spark-Submit options: --driver-java-options '-Duser.timezone=US/Pacific'

Spark-提交选项: --driver-java-options '-Duser.timezone=US/Pacific'

result:

结果:

[56d2a9573bc4b5c38453eae7,2016-02-28,2016-02-27 16:01:27.0,2016-02-28 08:01:27,2016-02-28,1456646487]
[56d2aa1bfd2460183a571762,2016-02-28,2016-02-27 16:04:43.0,2016-02-28 08:04:43,2016-02-28,1456646683]
[56d2aaa9eb63bbb63456d5b5,2016-02-28,2016-02-27 16:07:05.0,2016-02-28 08:07:05,2016-02-28,1456646825]
[56d2aab15a21fa5f4c4f42a7,2016-02-28,2016-02-27 16:07:13.0,2016-02-28 08:07:13,2016-02-28,1456646833]
[56d2aac8aeeee48b74531af0,2016-02-28,2016-02-27 16:07:36.0,2016-02-28 08:07:36,2016-02-28,1456646856]
[56d2ab1d87fd3f4f72567788,2016-02-28,2016-02-27 16:09:01.0,2016-02-28 08:09:01,2016-02-28,1456646941]

The time in US/Pacific should be 2016-02-28 00:01:27etc but some how it subtracts "8" hours twice

美国/太平洋的时间应该是2016-02-28 00:01:27等等,但它是如何减去“8”小时两次的

回答by Gaurav Shah

after reading for sometime following are the conclusions:

阅读一段时间后得出以下结论:

  • Spark-Sql doesn't support date-time, and nor timezones
  • Using timestamp is the only solution
  • from_unixtime(at)parses the epoch time correctly, just that the printing of it as a string changes it due to timezone. It is safe to assume that the from_unixtimewill convert it correctly ( although printing it might show different results)
  • from_utc_timestampwill shift ( not just convert) the timestamp to that timezone, in this case it will subtract 8 hours to the time since (-08:00)
  • printing sql results messes up the times with respect to timezone param
  • Spark-Sql 不支持日期时间,也不支持时区
  • 使用时间戳是唯一的解决方案
  • from_unixtime(at)正确解析纪元时间,只是将其打印为字符串会因时区而改变它。可以安全地假设 from_unixtime将正确转换它(尽管打印它可能会显示不同的结果)
  • from_utc_timestamp会将时间戳移动(不仅仅是转换)到该时区,在这种情况下,它将从 (-08:00) 开始减去 8 小时
  • 打印 sql 结果会弄乱与时区参数相关的时间

回答by Michel Lemay

For the record, here we convert Long values like that using an UDF.

作为记录,这里我们使用 UDF 转换 Long 值。

For our purpose, we are interested in only the Date string representation of the timestamp (in ms since epoch in UTC)

出于我们的目的,我们只对时间戳的日期字符串表示感兴趣(自 UTC 纪元以来的毫秒数)

val udfToDateUTC = udf((epochMilliUTC: Long) => {
  val dateFormatter = java.time.format.DateTimeFormatter.ofPattern("yyyy-MM-dd").withZone(java.time.ZoneId.of("UTC"))
  dateFormatter.format(java.time.Instant.ofEpochMilli(epochMilliUTC))
})

This way, we control the parsing as well as the rendering of the dates.

通过这种方式,我们可以控制日期的解析和呈现。