scala 如何从 RDD 创建 Spark 数据集

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37513667/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:20:06  来源:igfitidea点击:

How to create a Spark Dataset from an RDD

scalaapache-sparkdatasetapache-spark-dataset

提问by javadba

I have an RDD[LabeledPoint]intended to be used within a machine learning pipeline. How do we convert that RDDto a DataSet? Note the newer spark.mlapis require inputs in the Datasetformat.

RDD[LabeledPoint]打算在机器学习管道中使用。我们如何将其转换RDD为 a DataSet?请注意,较新的 spark.mlapi 需要Dataset格式输入。

回答by javadba

Here is an answer that traverses an extra step - the DataFrame. We use the SQLContextto create a DataFrameand then create a DataSetusing the desired object type - in this case a LabeledPoint:

这是一个遍历额外步骤的答案 - DataFrame. 我们使用SQLContext来创建 a DataFrame,然后DataSet使用所需的对象类型创建 a - 在本例中为 a LabeledPoint

val sqlContext = new SQLContext(sc)
val pointsTrainDf =  sqlContext.createDataFrame(training)
val pointsTrainDs = pointsTrainDf.as[LabeledPoint]

UpdateEver heard of a SparkSession? (neither had I until now..)

更新听说过SparkSession吗?(直到现在我都没有..)

So apparently the SparkSessionis the Preferred Way(TM) in Spark 2.0.0 and moving forward. Here is the updated code for the new (spark) world order:

很明显,这SparkSession是Spark 2.0.0 中的首选方式(TM) 并继续向前发展。这是新(火花)世界秩序的更新代码:

Spark 2.0.0+ approaches

Spark 2.0.0+ 方法

Notice in both of the below approaches (simpler one of which credit @zero323) we have accomplished an important savings as compared to the SQLContextapproach: no longer is it necessary to first create a DataFrame.

请注意,在以下两种方法中(较简单的一种,归功于 @zero323),与该SQLContext方法相比,我们已经实现了重要的节省:不再需要首先创建DataFrame.

val sparkSession =  SparkSession.builder().getOrCreate()
val pointsTrainDf =  sparkSession.createDataset(training)
val model = new LogisticRegression()
   .train(pointsTrainDs.as[LabeledPoint])

Second way for Spark 2.0.0+Credit to @zero323

Spark 2.0.0+ 的第二种方式归功于@zero323

val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._

val trainDs = training.toDS()

Traditional Spark 1.X and earlier approach

传统的 Spark 1.X 和更早的方法

val sqlContext = new SQLContext(sc)  // Note this is *deprecated* in 2.0.0
import sqlContext.implicits._
val training = splits(0).cache()
val test = splits(1)
val trainDs = training**.toDS()**

See also: How to store custom objects in Dataset?by the esteemed @zero323 .

另请参阅:如何在数据集中存储自定义对象?由尊敬的@zero323 提供。