Python pyspark 的“之间”功能：不包含时间戳的范围搜索

Question

提问by Vinay Kolar

pyspark's 'between' function is not inclusive for timestamp input.

pyspark 的 'between' 函数不包括时间戳输入。

For example, if we want all rows between two dates, say, '2017-04-13' and '2017-04-14', then it performs an "exclusive" search when the dates are passed as strings. i.e., it omits the '2017-04-14 00:00:00' fields

例如，如果我们想要两个日期之间的所有行，比如“2017-04-13”和“2017-04-14”，那么当日期作为字符串传递时，它会执行“独占”搜索。即，它省略了“2017-04-14 00:00:00”字段

However, the document seem to hint that it is inclusive(no reference on timestamp though)

然而，该文件似乎暗示它是包容性的（尽管没有关于时间戳的参考）

Of course, one way is to add a microsecond from the upper bound and pass it to the function. However, not a great fix. Any clean way of doing inclusive search?

当然，一种方法是从上限增加一微秒并将其传递给函数。然而，不是一个很好的修复。任何进行包容性搜索的干净方法？

Example:

例子：

import pandas as pd
from pyspark.sql import functions as F
... sql_context creation ...
test_pd=pd.DataFrame([{"start":'2017-04-13 12:00:00', "value":1.0},{"start":'2017-04-14 00:00:00', "value":1.1}])
test_df = sql_context.createDataFrame(test_pd).withColumn("start", F.col("start").cast('timestamp'))
test_df.show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+

test_df.filter(F.col("start").between('2017-04-13','2017-04-14')).show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
+--------------------+-----+

Answer 1

回答by Vinay Kolar

Found out the answer. pyspark's "between" function is inconsistent in handling timestamp inputs.

找到了答案。pyspark 的“介于”函数在处理时间戳输入时不一致。

If you provide the the input in string format without time, it performs an exclusive search (Not what we expect from the documentation linked above).
If you provide the input as datetime object or with exact time (e.g., '2017-04-14 00:00:00', then it performs an inclusive search.

如果您在没有时间的情况下以字符串格式提供输入，它将执行排他搜索（不是我们从上面链接的文档中期望的）。
如果您将输入作为日期时间对象或精确时间（例如，“2017-04-14 00:00:00”）提供，则它会执行包含搜索。

For the above example, here is the output for exclusive search (use pd.to_datetime):

对于上面的例子，这里是独占搜索的输出（使用 pd.to_datetime）：

test_df.filter(F.col("start").between(pd.to_datetime('2017-04-13'),pd.to_datetime('2017-04-14'))).show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+

Similarly, if we provide in the date AND time in string format, it seems to perform an inclusive search:

同样，如果我们以字符串格式提供日期和时间，它似乎执行了包含搜索：

test_df.filter(F.col("start").between('2017-04-13 12:00:00','2017-04-14 00:00:00')).show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+

Python pyspark 的“之间”功能：不包含时间戳的范围搜索

提问by Vinay Kolar

回答by Vinay Kolar

相关推荐

最近更新

标签

Python pyspark 的“之间”功能：不包含时间戳的范围搜索

提问by Vinay Kolar

回答by Vinay Kolar

相关推荐

Python 使用字典映射数据帧索引

Python 安装 Virtualenv 并激活 virtualenv 不起作用

Python 无效的块标记：'endblock'。您是否忘记注册或加载此标签？

Django Python rest 框架，Chrome 中请求的资源上不存在“Access-Control-Allow-Origin”标头，可在 firefox 中使用

相关推荐

最近更新

标签