Python 如何从pyspark中的数组中提取元素

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45254928/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 16:52:29  来源:igfitidea点击:

How to extract an element from a array in pyspark

pythonapache-sparkpysparkrdd

提问by AnmolDave

I have a data frame with following type

我有以下类型的数据框

col1|col2|col3|col4
xxxx|yyyy|zzzz|[1111],[2222]

I want my output to be following type

我希望我的输出遵循类型

col1|col2|col3|col4|col5
xxxx|yyyy|zzzz|1111|2222

My col4 is an array and I want to convert it to a separate column. What needs to be done?

我的 col4 是一个数组,我想将其转换为单独的列。需要做什么?

I saw many answers with flatmap but they are increasing a row, I want just the tuple to be put in another column but in the same row

我看到了很多关于 flatmap 的答案,但它们增加了一行,我只想将元组放在另一列但在同一行

Following is my actual schema:

以下是我的实际架构:

root
 |-- PRIVATE_IP: string (nullable = true)
 |-- PRIVATE_PORT: integer (nullable = true)
 |-- DESTINATION_IP: string (nullable = true)
 |-- DESTINATION_PORT: integer (nullable = true)
 |-- collect_set(TIMESTAMP): array (nullable = true)
 |    |-- element: string (containsNull = true)

Also can please some one help me with explanation on both dataframes and RDD's

也可以请一些人帮助我解释数据帧和 RDD

回答by Psidom

Create sample data:

创建示例数据:

from pyspark.sql import Row
x = [Row(col1="xx", col2="yy", col3="zz", col4=[123,234])]
rdd = sc.parallelize([Row(col1="xx", col2="yy", col3="zz", col4=[123,234])])
df = spark.createDataFrame(rdd)
df.show()
#+----+----+----+----------+
#|col1|col2|col3|      col4|
#+----+----+----+----------+
#|  xx|  yy|  zz|[123, 234]|
#+----+----+----+----------+

Use getItemto extract element from the array column as this, in your actual case replace col4with collect_set(TIMESTAMP):

用于getItem从数组列中提取元素,在您的实际情况下替换col4collect_set(TIMESTAMP)

df = df.withColumn("col5", df["col4"].getItem(1)).withColumn("col4", df["col4"].getItem(0))
df.show()
#+----+----+----+----+----+
#|col1|col2|col3|col4|col5|
#+----+----+----+----+----+
#|  xx|  yy|  zz| 123| 234|
#+----+----+----+----+----+