Python 如何计算pyspark中的日期差异?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44020818/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to calculate date difference in pyspark?
提问by dlwlrma
I have data like this:
我有这样的数据:
df = sqlContext.createDataFrame([
('1986/10/15', 'z', 'null'),
('1986/10/15', 'z', 'null'),
('1986/10/15', 'c', 'null'),
('1986/10/15', 'null', 'null'),
('1986/10/16', 'null', '4.0')],
('low', 'high', 'normal'))
I want to calculate the date difference between low
column and 2017-05-02
and replace low
column with the difference. I've tried related solutions on stackoverflow but neither of them works.
我想计算low
列之间的日期差异2017-05-02
并用low
差异替换列。我已经在 stackoverflow 上尝试过相关的解决方案,但它们都不起作用。
回答by mtoto
You need to cast the column low
to class date and then you can use datediff()
in combination with lit()
. Using Spark 2.2:
您需要将列转换low
为课程日期,然后您可以datediff()
与lit()
. 使用Spark 2.2:
from pyspark.sql.functions import datediff, to_date, lit
df.withColumn("test",
datediff(to_date(lit("2017-05-02")),
to_date("low","yyyy/MM/dd"))).show()
+----------+----+------+-----+
| low|high|normal| test|
+----------+----+------+-----+
|1986/10/15| z| null|11157|
|1986/10/15| z| null|11157|
|1986/10/15| c| null|11157|
|1986/10/15|null| null|11157|
|1986/10/16|null| 4.0|11156|
+----------+----+------+-----+
Using < Spark 2.2, we need to convert the the low
column to class timestamp
first:
使用< Spark 2.2,我们需要先将low
列转换为类timestamp
:
from pyspark.sql.functions import datediff, to_date, lit, unix_timestamp
df.withColumn("test",
datediff(to_date(lit("2017-05-02")),
to_date(unix_timestamp('low', "yyyy/MM/dd").cast("timestamp")))).show()
回答by Artem Zaika
Alternatively, how to find the number of days passed between two subsequent user's actions using pySpark:
或者,如何使用 pySpark 查找两个后续用户操作之间经过的天数:
import pyspark.sql.functions as funcs
from pyspark.sql.window import Window
window = Window.partitionBy('user_id').orderBy('action_date')
df = df.withColumn("days_passed", funcs.datediff(df.action_date,
funcs.lag(df.action_date, 1).over(window)))
+----------+-----------+-----------+
| user_id|action_date|days_passed|
+----------+-----------+-----------+
|623 |2015-10-21| null|
|623 |2015-11-19| 29|
|623 |2016-01-13| 59|
|623 |2016-01-21| 8|
|623 |2016-03-24| 63|
+----------+----------+------------+