Python pyspark 的“之间”功能:不包含时间戳的范围搜索
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43403903/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pyspark's "between" function: range search on timestamps is not inclusive
提问by Vinay Kolar
pyspark's 'between' function is not inclusive for timestamp input.
pyspark 的 'between' 函数不包括时间戳输入。
For example, if we want all rows between two dates, say, '2017-04-13' and '2017-04-14', then it performs an "exclusive" search when the dates are passed as strings. i.e., it omits the '2017-04-14 00:00:00' fields
例如,如果我们想要两个日期之间的所有行,比如“2017-04-13”和“2017-04-14”,那么当日期作为字符串传递时,它会执行“独占”搜索。即,它省略了“2017-04-14 00:00:00”字段
However, the document seem to hint that it is inclusive(no reference on timestamp though)
然而,该文件似乎暗示它是包容性的(尽管没有关于时间戳的参考)
Of course, one way is to add a microsecond from the upper bound and pass it to the function. However, not a great fix. Any clean way of doing inclusive search?
当然,一种方法是从上限增加一微秒并将其传递给函数。然而,不是一个很好的修复。任何进行包容性搜索的干净方法?
Example:
例子:
import pandas as pd
from pyspark.sql import functions as F
... sql_context creation ...
test_pd=pd.DataFrame([{"start":'2017-04-13 12:00:00', "value":1.0},{"start":'2017-04-14 00:00:00', "value":1.1}])
test_df = sql_context.createDataFrame(test_pd).withColumn("start", F.col("start").cast('timestamp'))
test_df.show()
+--------------------+-----+
| start|value|
+--------------------+-----+
|2017-04-13 12:00:...| 1.0|
|2017-04-14 00:00:...| 1.1|
+--------------------+-----+
test_df.filter(F.col("start").between('2017-04-13','2017-04-14')).show()
+--------------------+-----+
| start|value|
+--------------------+-----+
|2017-04-13 12:00:...| 1.0|
+--------------------+-----+
回答by Vinay Kolar
Found out the answer. pyspark's "between" function is inconsistent in handling timestamp inputs.
找到了答案。pyspark 的“介于”函数在处理时间戳输入时不一致。
- If you provide the the input in string format without time, it performs an exclusive search (Not what we expect from the documentation linked above).
- If you provide the input as datetime object or with exact time (e.g., '2017-04-14 00:00:00', then it performs an inclusive search.
- 如果您在没有时间的情况下以字符串格式提供输入,它将执行排他搜索(不是我们从上面链接的文档中期望的)。
- 如果您将输入作为日期时间对象或精确时间(例如,“2017-04-14 00:00:00”)提供,则它会执行包含搜索。
For the above example, here is the output for exclusive search (use pd.to_datetime):
对于上面的例子,这里是独占搜索的输出(使用 pd.to_datetime):
test_df.filter(F.col("start").between(pd.to_datetime('2017-04-13'),pd.to_datetime('2017-04-14'))).show()
+--------------------+-----+
| start|value|
+--------------------+-----+
|2017-04-13 12:00:...| 1.0|
|2017-04-14 00:00:...| 1.1|
+--------------------+-----+
Similarly, if we provide in the date AND time in string format, it seems to perform an inclusive search:
同样,如果我们以字符串格式提供日期和时间,它似乎执行了包含搜索:
test_df.filter(F.col("start").between('2017-04-13 12:00:00','2017-04-14 00:00:00')).show()
+--------------------+-----+
| start|value|
+--------------------+-----+
|2017-04-13 12:00:...| 1.0|
|2017-04-14 00:00:...| 1.1|
+--------------------+-----+