SQL Spark 窗口函数 - rangeBetween 日期

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33207164/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 04:07:18  来源:igfitidea点击:

Spark Window Functions - rangeBetween dates

sqlapache-sparkpysparkapache-spark-sqlwindow-functions

提问by Nhor

I am having a Spark SQL DataFramewith data and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out I need to use a Window Functionlike:

我有一个DataFrame带有数据的 Spark SQL ,我想要获取的是给定日期范围内当前行之前的所有行。因此,例如,我希望在给定行之前拥有 7 天前的所有行。我想我需要使用一个Window Function喜欢:

Window \
    .partitionBy('id') \
    .orderBy('start')

and here comes the problem. I want to have a rangeBetween7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just getting all the preceding rows with:

问题来了。我想要rangeBetween7 天,但我在 Spark 文档中找不到任何内容。Spark 甚至提供这样的选项吗?现在我只是得到所有前面的行:

.rowsBetween(-sys.maxsize, 0)

but would like to achieve something like:

但想要实现以下目标:

.rangeBetween("7 days", 0)

If anyone could help me on this one I'll be very grateful. Thanks in advance!

如果有人可以帮助我解决这个问题,我将不胜感激。提前致谢!

回答by zero323

Spark >= 2.3

火花 >= 2.3

Since Spark 2.3 it is possible to use interval objects using SQL API, but the DataFrameAPI support is still work in progress.

从 Spark 2.3 开始,可以使用 SQL API 使用间隔对象,但DataFrameAPI 支持仍在进行中

df.createOrReplaceTempView("df")

spark.sql(
    """SELECT *, mean(some_value) OVER (
        PARTITION BY id 
        ORDER BY CAST(start AS timestamp) 
        RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
     ) AS mean FROM df""").show()

## +---+----------+----------+------------------+       
## | id|     start|some_value|              mean|
## +---+----------+----------+------------------+
## |  1|2015-01-01|      20.0|              20.0|
## |  1|2015-01-06|      10.0|              15.0|
## |  1|2015-01-07|      25.0|18.333333333333332|
## |  1|2015-01-12|      30.0|21.666666666666668|
## |  2|2015-01-01|       5.0|               5.0|
## |  2|2015-01-03|      30.0|              17.5|
## |  2|2015-02-01|      20.0|              20.0|
## +---+----------+----------+------------------+

Spark < 2.3

火花 < 2.3

As far as I know it is not possible directly neither in Spark nor Hive. Both require ORDER BYclause used with RANGEto be numeric. The closest thing I found is conversion to timestamp and operating on seconds. Assuming startcolumn contains datetype:

据我所知,在 Spark 和 Hive 中都不可能直接使用。两者都需要ORDER BY用于RANGE数字的子句。我发现的最接近的事情是转换为时间戳并以秒为单位进行操作。假设start列包含date类型:

from pyspark.sql import Row

row = Row("id", "start", "some_value")
df = sc.parallelize([
    row(1, "2015-01-01", 20.0),
    row(1, "2015-01-06", 10.0),
    row(1, "2015-01-07", 25.0),
    row(1, "2015-01-12", 30.0),
    row(2, "2015-01-01", 5.0),
    row(2, "2015-01-03", 30.0),
    row(2, "2015-02-01", 20.0)
]).toDF().withColumn("start", col("start").cast("date"))

A small helper and window definition:

一个小助手和窗口定义:

from pyspark.sql.window import Window
from pyspark.sql.functions import mean, col


# Hive timestamp is interpreted as UNIX timestamp in seconds*
days = lambda i: i * 86400 

Finally query:

最后查询:

w = (Window()
   .partitionBy(col("id"))
   .orderBy(col("start").cast("timestamp").cast("long"))
   .rangeBetween(-days(7), 0))

df.select(col("*"), mean("some_value").over(w).alias("mean")).show()

## +---+----------+----------+------------------+
## | id|     start|some_value|              mean|
## +---+----------+----------+------------------+
## |  1|2015-01-01|      20.0|              20.0|
## |  1|2015-01-06|      10.0|              15.0|
## |  1|2015-01-07|      25.0|18.333333333333332|
## |  1|2015-01-12|      30.0|21.666666666666668|
## |  2|2015-01-01|       5.0|               5.0|
## |  2|2015-01-03|      30.0|              17.5|
## |  2|2015-02-01|      20.0|              20.0|
## +---+----------+----------+------------------+

Far from pretty but works.

远非漂亮但有效。



* Hive Language Manual, Types

* Hive 语言手册,类型