scala Apache Spark 如何将列表/数组中的新列附加到 Spark 数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44395873/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:17:25  来源:igfitidea点击:

Apache Spark how to append new column from list/array to Spark dataframe

scalaapache-sparkdataframeapache-spark-sql

提问by Stefan Repcek

I am using Apache Spark 2.0 Dataframe/Dataset API I want to add a new column to my dataframe from List of values. My list has same number of values like given dataframe.

我正在使用 Apache Spark 2.0 数据帧/数据集 API 我想从值列表向我的数据帧添加一个新列。我的列表具有与给定数据框相同数量的值。

val list = List(4,5,10,7,2)
val df   = List("a","b","c","d","e").toDF("row1")

I would like to do something like:

我想做类似的事情:

val appendedDF = df.withColumn("row2",somefunc(list))
df.show()
// +----+------+
// |row1 |row2 |
// +----+------+
// |a    |4    |
// |b    |5    |
// |c    |10   |
// |d    |7    |
// |e    |2    |
// +----+------+

For any ideas I would be greatful, my dataframe in reality contains more columns.

对于任何我会很高兴的想法,我的数据框实际上包含更多列。

回答by Psidom

You could do it like this:

你可以这样做:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._    

// create rdd from the list
val rdd = sc.parallelize(List(4,5,10,7,2))
// rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:28

// zip the data frame with rdd
val rdd_new = df.rdd.zip(rdd).map(r => Row.fromSeq(r._1.toSeq ++ Seq(r._2)))
// rdd_new: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[33] at map at <console>:32

// create a new data frame from the rdd_new with modified schema
spark.createDataFrame(rdd_new, df.schema.add("new_col", IntegerType)).show
+----+-------+
|row1|new_col|
+----+-------+
|   a|      4|
|   b|      5|
|   c|     10|
|   d|      7|
|   e|      2|
+----+-------+

回答by Tzach Zohar

Adding for completeness: the fact that the input list(which exists in driver memory) has the same size as the DataFramesuggests that this is a small DataFrame to begin with - so you might consider collect()-ing it, zipping with list, and converting back into a DataFrameif needed:

添加完整性:输入list(存在于驱动程序内存中)的大小与这DataFrame表明这是一个小的 DataFrame 开始时的大小相同的事实 - 所以你可以考虑collect()-ing 它,用 压缩list,然后转换回DataFrameif需要:

df.collect()
  .map(_.getAs[String]("row1"))
  .zip(list).toList
  .toDF("row1", "row2")

That won't be faster, but if the data is really small it might be negligible and the code is (arguably) clearer.

这不会更快,但如果数据真的很小,它可能可以忽略不计,并且代码(可以说)更清晰。