Python 如何从pyspark中的数组中提取元素

Question

提问by AnmolDave

I have a data frame with following type

我有以下类型的数据框

col1|col2|col3|col4
xxxx|yyyy|zzzz|[1111],[2222]

I want my output to be following type

我希望我的输出遵循类型

col1|col2|col3|col4|col5
xxxx|yyyy|zzzz|1111|2222

My col4 is an array and I want to convert it to a separate column. What needs to be done?

我的 col4 是一个数组，我想将其转换为单独的列。需要做什么？

I saw many answers with flatmap but they are increasing a row, I want just the tuple to be put in another column but in the same row

我看到了很多关于 flatmap 的答案，但它们增加了一行，我只想将元组放在另一列但在同一行

Following is my actual schema:

以下是我的实际架构：

root
 |-- PRIVATE_IP: string (nullable = true)
 |-- PRIVATE_PORT: integer (nullable = true)
 |-- DESTINATION_IP: string (nullable = true)
 |-- DESTINATION_PORT: integer (nullable = true)
 |-- collect_set(TIMESTAMP): array (nullable = true)
 |    |-- element: string (containsNull = true)

Also can please some one help me with explanation on both dataframes and RDD's

也可以请一些人帮助我解释数据帧和 RDD

Answer 1

回答by Psidom

Create sample data:

创建示例数据：

from pyspark.sql import Row
x = [Row(col1="xx", col2="yy", col3="zz", col4=[123,234])]
rdd = sc.parallelize([Row(col1="xx", col2="yy", col3="zz", col4=[123,234])])
df = spark.createDataFrame(rdd)
df.show()
#+----+----+----+----------+
#|col1|col2|col3|      col4|
#+----+----+----+----------+
#|  xx|  yy|  zz|[123, 234]|
#+----+----+----+----------+

Use getItemto extract element from the array column as this, in your actual case replace col4with collect_set(TIMESTAMP):

用于getItem从数组列中提取元素，在您的实际情况下替换col4为collect_set(TIMESTAMP)：

df = df.withColumn("col5", df["col4"].getItem(1)).withColumn("col4", df["col4"].getItem(0))
df.show()
#+----+----+----+----+----+
#|col1|col2|col3|col4|col5|
#+----+----+----+----+----+
#|  xx|  yy|  zz| 123| 234|
#+----+----+----+----+----+

Python 如何从pyspark中的数组中提取元素

提问by AnmolDave

回答by Psidom

相关推荐

最近更新

标签

Python 如何从pyspark中的数组中提取元素

提问by AnmolDave

回答by Psidom

相关推荐

Python 如何更新 Anaconda？

Python 如何从 numpy 二维数组中提取子数组？

Python 编辑seaborn图例

如何在 MySQL 数据库中使用 python 3.5.1

相关推荐

最近更新

标签