Python 将 PySpark 数据框列类型转换为字符串并替换方括号

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41184116/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 00:31:32  来源:igfitidea点击:

Convert PySpark dataframe column type to string and replace the square brackets

pythonpysparkapache-spark-sql

提问by ben

I need to convert a PySpark df column type from array to string and also remove the square brackets. This is the schema for the dataframe. columns that needs to be processed is CurrencyCode and TicketAmount

我需要将 PySpark df 列类型从数组转换为字符串,并删除方括号。这是数据框的架构。需要处理的列是 CurrencyCode 和 TicketAmount

>>> plan_queryDF.printSchema()
root
 |-- event_type: string (nullable = true)
 |-- publishedDate: string (nullable = true)
 |-- plannedCustomerChoiceID: string (nullable = true)
 |-- assortedCustomerChoiceID: string (nullable = true)
 |-- CurrencyCode: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- TicketAmount: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- currentPlan: boolean (nullable = true)
 |-- originalPlan: boolean (nullable = true)
 |-- globalId: string (nullable = true)
 |-- PlanJsonData: string (nullable = true)

sample data from dataframe

来自数据帧的样本数据

+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|          event_type|       publishedDate|plannedCustomerChoiceID|assortedCustomerChoiceID|CurrencyCode|TicketAmount|currentPlan|originalPlan|    globalId|        PlanJsonData|
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|PlannedCustomerCh...|2016-08-23T04:46:...|   087d1ff1-5f3a-496...|    2539cc4a-37e5-4f3...|       [GBP]|         [0]|      false|       false|000576015000|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T04:30:...|   0a1af217-d1e8-4ab...|    61bc5fda-0160-484...|       [CNY]|       [329]|       true|       false|000189668017|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T05:49:...|   1028b477-f93e-47f...|    c6d5b761-94f2-454...|       [JPY]|      [3400]|       true|       false|000576058003|{"httpStatus":200...|

how can I do it? Currently I am doing a cast to string and then replacing the square braces with regexp_replace. but this approach fails when I process huge amount of data.

我该怎么做?目前我正在对字符串进行转换,然后用 regexp_replace 替换方括号。但是当我处理大量数据时,这种方法失败了。

Is there any other way I can do it?

我还有其他方法可以做到吗?

This is what I want.

这就是我要的。

+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|          event_type|       publishedDate|plannedCustomerChoiceID|assortedCustomerChoiceID|CurrencyCode|TicketAmount|currentPlan|originalPlan|    globalId|        PlanJsonData|
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|PlannedCustomerCh...|2016-08-23T04:46:...|   087d1ff1-5f3a-496...|    2539cc4a-37e5-4f3...|       GBP|         0|      false|       false|000576015000|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T04:30:...|   0a1af217-d1e8-4ab...|    61bc5fda-0160-484...|       CNY|       329|       true|       false|000189668017|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T05:49:...|   1028b477-f93e-47f...|    c6d5b761-94f2-454...|       JPY|      3400|       true|       false|000576058003|{"httpStatus":200...|

回答by Daniel de Paula

You can try getItem(0):

你可以试试getItem(0)

df \
    .withColumn("CurrencyCode", df["CurrencyCode"].getItem(0).cast("string")) \
    .withColumn("TicketAmount", df["TicketAmount"].getItem(0).cast("string")) 

The final cast to string is optional.

最终转换为字符串是可选的。