Python Pyspark 创建时间戳列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45469438/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pyspark Creating timestamp column
提问by Naveen Srikanth
I am using spark 2.1.0. I am not able to create timestamp column in pyspark I am using below code snippet. Please help
我正在使用火花 2.1.0。我无法在 pyspark 中创建时间戳列我正在使用下面的代码片段。请帮忙
df=df.withColumn('Age',lit(datetime.now()))
I am getting
我正进入(状态
assertion error:col should be Column
断言错误:col 应该是 Column
Please help
请帮忙
回答by balalaika
I am not sure for 2.1.0, on 2.2.1 at least you can just:
我不确定 2.1.0,至少在 2.2.1 上你可以:
from pyspark.sql import functions as F
df.withColumn('Age', F.current_timestamp())
Hope it helps!
希望能帮助到你!
回答by Ankush Singh
Assuming you have dataframe from your code snippet and you want same timestamp for all your rows.
假设您的代码片段中有数据框,并且您希望所有行都有相同的时间戳。
Let me create some dummy dataframe.
让我创建一些虚拟数据框。
>>> dict = [{'name': 'Alice', 'age': 1},{'name': 'Again', 'age': 2}]
>>> df = spark.createDataFrame(dict)
>>> import time
>>> import datetime
>>> timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
>>> type(timestamp)
<class 'str'>
>>> from pyspark.sql.functions import lit,unix_timestamp
>>> timestamp
'2017-08-02 16:16:14'
>>> new_df = df.withColumn('time',unix_timestamp(lit(timestamp),'yyyy-MM-dd HH:mm:ss').cast("timestamp"))
>>> new_df.show(truncate = False)
+---+-----+---------------------+
|age|name |time |
+---+-----+---------------------+
|1 |Alice|2017-08-02 16:16:14.0|
|2 |Again|2017-08-02 16:16:14.0|
+---+-----+---------------------+
>>> new_df.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
|-- time: timestamp (nullable = true)
回答by Nikhil Gupta
Adding on to balalaika, if someone, like me just want to add the date, but not the time with it, then he can follow the below code
添加到巴拉莱卡,如果有人,像我一样只想添加日期,而不是时间,那么他可以按照下面的代码
from pyspark.sql import functions as F
df.withColumn('Age', F.current_date())
Hope this helps
希望这可以帮助