如何从 Scala 的可迭代列表创建 DataFrame?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38063195/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:25:29  来源:igfitidea点击:

How to create DataFrame from Scala's List of Iterables?

scalaapache-sparkapache-spark-sqlspark-dataframe

提问by MTT

I have the following Scala value:

我有以下 Scala 值:

val values: List[Iterable[Any]] = Traces().evaluate(features).toList

and I want to convert it to a DataFrame.

我想将其转换为 DataFrame。

When I try the following:

当我尝试以下操作时:

sqlContext.createDataFrame(values)

I got this error:

我收到此错误:

error: overloaded method value createDataFrame with alternatives:

[A <: Product](data: Seq[A])(implicit evidence: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame 
[A <: Product](rdd: org.apache.spark.rdd.RDD[A])(implicit evidence: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
cannot be applied to (List[Iterable[Any]])
          sqlContext.createDataFrame(values)

Why?

为什么?

采纳答案by MTT

As zero323mentioned, we need to first convert List[Iterable[Any]]to List[Row]and then put rows in RDDand prepare schema for the spark data frame.

正如zero323 所提到的,我们需要先转换List[Iterable[Any]]List[Row]然后放入行RDD并为 spark 数据帧准备模式。

To convert List[Iterable[Any]]to List[Row], we can say

要转换List[Iterable[Any]]List[Row],我们可以说

val rows = values.map{x => Row(x:_*)}

and then having schema like schema, we can make RDD

然后有了像这样的模式schema,我们就可以制作 RDD

val rdd = sparkContext.makeRDD[RDD](rows)

and finally create a spark data frame

最后创建一个火花数据框

val df = sqlContext.createDataFrame(rdd, schema)

回答by sparker

Thats what spark implicits object is for. It allows you to convert your common scala collection types into DataFrame / DataSet / RDD. Here is an example with Spark 2.0 but it exists in older versions too

这就是 spark 隐含对象的用途。它允许您将常见的 Scala 集合类型转换为 DataFrame / DataSet / RDD。这是 Spark 2.0 的示例,但它也存在于旧版本中

import org.apache.spark.sql.SparkSession
val values = List(1,2,3,4,5)

val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val df = values.toDF()

Edit: Just realised you were after 2d list. Here is something I tried on spark-shell. I converted a 2d List to List of Tuples and used implicit conversion to DataFrame:

编辑:刚刚意识到你在追求 2d 列表。这是我在 spark-shell 上尝试过的东西。我将 2d 列表转换为元组列表并使用隐式转换为 DataFrame:

val values = List(List("1", "One") ,List("2", "Two") ,List("3", "Three"),List("4","4")).map(x =>(x(0), x(1)))
import spark.implicits._
val df = values.toDF

Edit2: The original question by MTT was How to create spark dataframe from a scala list for a 2d list for which this is a correct answer. The original question is https://stackoverflow.com/revisions/38063195/1The question was later changed to match an accepted answer. Adding this edit so that if someone else looking for something similar to the original question can find it.

Edit2:MTT 的原始问题是如何从 2d 列表的 Scala 列表创建火花数据框,这是正确答案。原始问题是https://stackoverflow.com/revisions/38063195/1该问题后来被更改以匹配已接受的答案。添加此编辑,以便其他人在寻找与原始问题类似的内容时可以找到它。

回答by Josh Cason

Simplest approach:

最简单的方法:

val newList = yourList.map(Tuple1(_))
val df = spark.createDataFrame(newList).toDF("stuff")

回答by Viacheslav Shalamov

The most concise way I've found:

我发现的最简洁的方法:

val df = spark.createDataFrame(List("A", "B", "C").map(Tuple1(_)))

回答by sun007

In Spark 2 we can use DataSet by just converting list to DS by toDS API

在 Spark 2 中,我们可以通过 toDS API 将列表转换为 DS 来使用 DataSet

val ds = list.flatMap(_.split(",")).toDS() // Records split by comma 

or

或者

val ds = list.toDS()

This more convenient than rddor df

这比rdd或更方便df