如何从 Spark Scala 中的多个数组创建 DataFrame?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37153482/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:16:53  来源:igfitidea点击:

how to create DataFrame from multiple arrays in Spark Scala?

arraysscalalinear-regressionspark-dataframe

提问by Sam

val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)

I have two Arrays as above, i need to build a DataFrame from this Arrays like the following,

我有两个数组,我需要从这个数组构建一个 DataFrame,如下所示,

Tvalues                Pvalues
1.866393526974307      0.064020056478447
2.864048126935307      0.004808399479386827
......                 .....

As of now i am trying with StringBuilderin Scala. which doesnt go as expected. help me on this please.

截至目前,我正在StringBuilderScala 中尝试。这不符合预期。请帮助我。

回答by elm

Try for instance

尝试例如

val df = sc.parallelize(tpvalues zip pvalues).toDF("Tvalues","Pvalues")

and thus

因此

scala> df.show
+------------------+--------------------+
|          Tvalues|             Pvalues|
+------------------+--------------------+
| 1.866393526974307|   0.064020056478447|
| 2.864048126935307|0.004808399479386827|
| 4.032486069215076|8.914865448939047E-5|
| 7.876169953355888|7.489564524121306...|
| 4.875333799256043|2.836379410675604...|
|14.316322626848278|                 0.0|
+------------------+--------------------+

Using parallelizewe obtain an RDDof tuples -- the first element from the first array, the second element from the other array --, which is transformed into a dataframe of rows, one row for each tuple.

使用parallelize我们获得一个RDD元组——第一个数组中的第一个元素,另一个数组中的第二个元素——将其转换为行数据帧,每个元组一行。

Update

更新

For dataframe'ingmultiple arrays (all with the same size), for instance 4 arrays, consider

对于dataframe'ing多个数组(都具有相同的大小),例如 4 个数组,请考虑

case class Row(i: Double, j: Double, k: Double, m: Double)

val xs = Array(arr1, arr2, arr3, arr4).transpose
val rdd = sc.parallelize(xs).map(ys => Row(ys(0), ys(1), ys(2), ys(3))
val df = rdd.toDF("i","j","k","m")