Python pyspark 的“之间”功能:不包含时间戳的范围搜索

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43403903/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:01:06  来源:igfitidea点击:

pyspark's "between" function: range search on timestamps is not inclusive

pythondatetimerangepysparkbetween

提问by Vinay Kolar

pyspark's 'between' function is not inclusive for timestamp input.

pyspark 的 'between' 函数不包括时间戳输入。

For example, if we want all rows between two dates, say, '2017-04-13' and '2017-04-14', then it performs an "exclusive" search when the dates are passed as strings. i.e., it omits the '2017-04-14 00:00:00' fields

例如,如果我们想要两个日期之间的所有行,比如“2017-04-13”和“2017-04-14”,那么当日期作为字符串传递时,它会执行“独占”搜索。即,它省略了“2017-04-14 00:00:00”字段

However, the document seem to hint that it is inclusive(no reference on timestamp though)

然而,该文件似乎暗示它是包容性的(尽管没有关于时间戳的参考)

Of course, one way is to add a microsecond from the upper bound and pass it to the function. However, not a great fix. Any clean way of doing inclusive search?

当然,一种方法是从上限增加一微秒并将其传递给函数。然而,不是一个很好的修复。任何进行包容性搜索的干净方法?

Example:

例子:

import pandas as pd
from pyspark.sql import functions as F
... sql_context creation ...
test_pd=pd.DataFrame([{"start":'2017-04-13 12:00:00', "value":1.0},{"start":'2017-04-14 00:00:00', "value":1.1}])
test_df = sql_context.createDataFrame(test_pd).withColumn("start", F.col("start").cast('timestamp'))
test_df.show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+

test_df.filter(F.col("start").between('2017-04-13','2017-04-14')).show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
+--------------------+-----+

回答by Vinay Kolar

Found out the answer. pyspark's "between" function is inconsistent in handling timestamp inputs.

找到了答案。pyspark 的“介于”函数在处理时间戳输入时不一致。

  1. If you provide the the input in string format without time, it performs an exclusive search (Not what we expect from the documentation linked above).
  2. If you provide the input as datetime object or with exact time (e.g., '2017-04-14 00:00:00', then it performs an inclusive search.
  1. 如果您在没有时间的情况下以字符串格式提供输入,它将执行排他搜索(不是我们从上面链接的文档中期望的)。
  2. 如果您将输入作为日期时间对象或精确时间(例如,“2017-04-14 00:00:00”)提供,则它会执行包含搜索。

For the above example, here is the output for exclusive search (use pd.to_datetime):

对于上面的例子,这里是独占搜索的输出(使用 pd.to_datetime):

test_df.filter(F.col("start").between(pd.to_datetime('2017-04-13'),pd.to_datetime('2017-04-14'))).show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+

Similarly, if we provide in the date AND time in string format, it seems to perform an inclusive search:

同样,如果我们以字符串格式提供日期和时间,它似乎执行了包含搜索:

test_df.filter(F.col("start").between('2017-04-13 12:00:00','2017-04-14 00:00:00')).show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+