Python 如何计算pyspark中的日期差异？

Question

提问by dlwlrma

I have data like this:

我有这样的数据：

df = sqlContext.createDataFrame([
    ('1986/10/15', 'z', 'null'), 
    ('1986/10/15', 'z', 'null'),
    ('1986/10/15', 'c', 'null'),
    ('1986/10/15', 'null', 'null'),
    ('1986/10/16', 'null', '4.0')],
    ('low', 'high', 'normal'))

I want to calculate the date difference between lowcolumn and 2017-05-02and replace lowcolumn with the difference. I've tried related solutions on stackoverflow but neither of them works.

我想计算low列之间的日期差异2017-05-02并用low差异替换列。我已经在 stackoverflow 上尝试过相关的解决方案，但它们都不起作用。

Answer 1

回答by mtoto

You need to cast the column lowto class date and then you can use datediff()in combination with lit(). Using Spark 2.2:

您需要将列转换low为课程日期，然后您可以datediff()与lit(). 使用Spark 2.2：

from pyspark.sql.functions import datediff, to_date, lit

df.withColumn("test", 
              datediff(to_date(lit("2017-05-02")),
                       to_date("low","yyyy/MM/dd"))).show()
+----------+----+------+-----+
|       low|high|normal| test|
+----------+----+------+-----+
|1986/10/15|   z|  null|11157|
|1986/10/15|   z|  null|11157|
|1986/10/15|   c|  null|11157|
|1986/10/15|null|  null|11157|
|1986/10/16|null|   4.0|11156|
+----------+----+------+-----+

Using < Spark 2.2, we need to convert the the lowcolumn to class timestampfirst:

使用< Spark 2.2，我们需要先将low列转换为类timestamp：

from pyspark.sql.functions import datediff, to_date, lit, unix_timestamp

df.withColumn("test", 
              datediff(to_date(lit("2017-05-02")),
                       to_date(unix_timestamp('low', "yyyy/MM/dd").cast("timestamp")))).show()

Answer 2

回答by Artem Zaika

Alternatively, how to find the number of days passed between two subsequent user's actions using pySpark:

或者，如何使用 pySpark 查找两个后续用户操作之间经过的天数：

import pyspark.sql.functions as funcs
from pyspark.sql.window import Window

window = Window.partitionBy('user_id').orderBy('action_date')

df = df.withColumn("days_passed", funcs.datediff(df.action_date, 
                                  funcs.lag(df.action_date, 1).over(window)))



+----------+-----------+-----------+
|   user_id|action_date|days_passed| 
+----------+-----------+-----------+
|623       |2015-10-21|        null|
|623       |2015-11-19|          29|
|623       |2016-01-13|          59|
|623       |2016-01-21|           8|
|623       |2016-03-24|          63|
+----------+----------+------------+

Python 如何计算pyspark中的日期差异？

提问by dlwlrma

回答by mtoto

回答by Artem Zaika

相关推荐

最近更新

标签

Python 如何计算pyspark中的日期差异？

提问by dlwlrma

回答by mtoto

回答by Artem Zaika

相关推荐

Python 在 VideoFileClip 函数中获取“OSError: [WinError 6] The handle is invalid”

Python 连接后如何在 Pyspark 数据框中选择和排序多列

Python 支持 argparse 中的枚举参数

Python 按索引合并两个数据帧

相关推荐

最近更新

标签