scala 如何从 RDD 创建 Spark 数据集
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37513667/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to create a Spark Dataset from an RDD
提问by javadba
I have an RDD[LabeledPoint]intended to be used within a machine learning pipeline. How do we convert that RDDto a DataSet? Note the newer spark.mlapis require inputs in the Datasetformat.
我RDD[LabeledPoint]打算在机器学习管道中使用。我们如何将其转换RDD为 a DataSet?请注意,较新的 spark.mlapi 需要Dataset格式输入。
回答by javadba
Here is an answer that traverses an extra step - the DataFrame. We use the SQLContextto create a DataFrameand then create a DataSetusing the desired object type - in this case a LabeledPoint:
这是一个遍历额外步骤的答案 - DataFrame. 我们使用SQLContext来创建 a DataFrame,然后DataSet使用所需的对象类型创建 a - 在本例中为 a LabeledPoint:
val sqlContext = new SQLContext(sc)
val pointsTrainDf = sqlContext.createDataFrame(training)
val pointsTrainDs = pointsTrainDf.as[LabeledPoint]
UpdateEver heard of a SparkSession? (neither had I until now..)
更新听说过SparkSession吗?(直到现在我都没有..)
So apparently the SparkSessionis the Preferred Way(TM) in Spark 2.0.0 and moving forward. Here is the updated code for the new (spark) world order:
很明显,这SparkSession是Spark 2.0.0 中的首选方式(TM) 并继续向前发展。这是新(火花)世界秩序的更新代码:
Spark 2.0.0+ approaches
Spark 2.0.0+ 方法
Notice in both of the below approaches (simpler one of which credit @zero323) we have accomplished an important savings as compared to the SQLContextapproach: no longer is it necessary to first create a DataFrame.
请注意,在以下两种方法中(较简单的一种,归功于 @zero323),与该SQLContext方法相比,我们已经实现了重要的节省:不再需要首先创建DataFrame.
val sparkSession = SparkSession.builder().getOrCreate()
val pointsTrainDf = sparkSession.createDataset(training)
val model = new LogisticRegression()
.train(pointsTrainDs.as[LabeledPoint])
Second way for Spark 2.0.0+Credit to @zero323
Spark 2.0.0+ 的第二种方式归功于@zero323
val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._
val trainDs = training.toDS()
Traditional Spark 1.X and earlier approach
传统的 Spark 1.X 和更早的方法
val sqlContext = new SQLContext(sc) // Note this is *deprecated* in 2.0.0
import sqlContext.implicits._
val training = splits(0).cache()
val test = splits(1)
val trainDs = training**.toDS()**
See also: How to store custom objects in Dataset?by the esteemed @zero323 .
另请参阅:如何在数据集中存储自定义对象?由尊敬的@zero323 提供。

