如何从 Scala 的可迭代列表创建 DataFrame?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38063195/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to create DataFrame from Scala's List of Iterables?
提问by MTT
I have the following Scala value:
我有以下 Scala 值:
val values: List[Iterable[Any]] = Traces().evaluate(features).toList
and I want to convert it to a DataFrame.
我想将其转换为 DataFrame。
When I try the following:
当我尝试以下操作时:
sqlContext.createDataFrame(values)
I got this error:
我收到此错误:
error: overloaded method value createDataFrame with alternatives:
[A <: Product](data: Seq[A])(implicit evidence: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
[A <: Product](rdd: org.apache.spark.rdd.RDD[A])(implicit evidence: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
cannot be applied to (List[Iterable[Any]])
sqlContext.createDataFrame(values)
Why?
为什么?
采纳答案by MTT
As zero323mentioned, we need to first convert List[Iterable[Any]]to List[Row]and then put rows in RDDand prepare schema for the spark data frame.
正如zero323 所提到的,我们需要先转换List[Iterable[Any]]为List[Row]然后放入行RDD并为 spark 数据帧准备模式。
To convert List[Iterable[Any]]to List[Row], we can say
要转换List[Iterable[Any]]为List[Row],我们可以说
val rows = values.map{x => Row(x:_*)}
and then having schema like schema, we can make RDD
然后有了像这样的模式schema,我们就可以制作 RDD
val rdd = sparkContext.makeRDD[RDD](rows)
and finally create a spark data frame
最后创建一个火花数据框
val df = sqlContext.createDataFrame(rdd, schema)
回答by sparker
Thats what spark implicits object is for. It allows you to convert your common scala collection types into DataFrame / DataSet / RDD. Here is an example with Spark 2.0 but it exists in older versions too
这就是 spark 隐含对象的用途。它允许您将常见的 Scala 集合类型转换为 DataFrame / DataSet / RDD。这是 Spark 2.0 的示例,但它也存在于旧版本中
import org.apache.spark.sql.SparkSession
val values = List(1,2,3,4,5)
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val df = values.toDF()
Edit: Just realised you were after 2d list. Here is something I tried on spark-shell. I converted a 2d List to List of Tuples and used implicit conversion to DataFrame:
编辑:刚刚意识到你在追求 2d 列表。这是我在 spark-shell 上尝试过的东西。我将 2d 列表转换为元组列表并使用隐式转换为 DataFrame:
val values = List(List("1", "One") ,List("2", "Two") ,List("3", "Three"),List("4","4")).map(x =>(x(0), x(1)))
import spark.implicits._
val df = values.toDF
Edit2: The original question by MTT was How to create spark dataframe from a scala list for a 2d list for which this is a correct answer. The original question is https://stackoverflow.com/revisions/38063195/1The question was later changed to match an accepted answer. Adding this edit so that if someone else looking for something similar to the original question can find it.
Edit2:MTT 的原始问题是如何从 2d 列表的 Scala 列表创建火花数据框,这是正确答案。原始问题是https://stackoverflow.com/revisions/38063195/1该问题后来被更改以匹配已接受的答案。添加此编辑,以便其他人在寻找与原始问题类似的内容时可以找到它。
回答by Josh Cason
Simplest approach:
最简单的方法:
val newList = yourList.map(Tuple1(_))
val df = spark.createDataFrame(newList).toDF("stuff")
回答by Viacheslav Shalamov
The most concise way I've found:
我发现的最简洁的方法:
val df = spark.createDataFrame(List("A", "B", "C").map(Tuple1(_)))
回答by sun007
In Spark 2 we can use DataSet by just converting list to DS by toDS API
在 Spark 2 中,我们可以通过 toDS API 将列表转换为 DS 来使用 DataSet
val ds = list.flatMap(_.split(",")).toDS() // Records split by comma
or
或者
val ds = list.toDS()
This more convenient than rddor df
这比rdd或更方便df

