scala 如何从 RDD 创建 Spark 数据集

Question

提问by javadba

I have an RDD[LabeledPoint]intended to be used within a machine learning pipeline. How do we convert that RDDto a DataSet? Note the newer spark.mlapis require inputs in the Datasetformat.

我RDD[LabeledPoint]打算在机器学习管道中使用。我们如何将其转换RDD为 a DataSet？请注意，较新的 spark.mlapi 需要Dataset格式输入。

Answer 1

回答by javadba

Here is an answer that traverses an extra step - the DataFrame. We use the SQLContextto create a DataFrameand then create a DataSetusing the desired object type - in this case a LabeledPoint:

这是一个遍历额外步骤的答案 - DataFrame. 我们使用SQLContext来创建 a DataFrame，然后DataSet使用所需的对象类型创建 a - 在本例中为 a LabeledPoint：

val sqlContext = new SQLContext(sc)
val pointsTrainDf =  sqlContext.createDataFrame(training)
val pointsTrainDs = pointsTrainDf.as[LabeledPoint]

UpdateEver heard of a SparkSession? (neither had I until now..)

更新听说过SparkSession吗？（直到现在我都没有..）

So apparently the SparkSessionis the Preferred Way(TM) in Spark 2.0.0 and moving forward. Here is the updated code for the new (spark) world order:

很明显，这SparkSession是Spark 2.0.0 中的首选方式(TM) 并继续向前发展。这是新（火花）世界秩序的更新代码：

Spark 2.0.0+ approaches

Spark 2.0.0+ 方法

Notice in both of the below approaches (simpler one of which credit @zero323) we have accomplished an important savings as compared to the SQLContextapproach: no longer is it necessary to first create a DataFrame.

请注意，在以下两种方法中（较简单的一种，归功于 @zero323），与该SQLContext方法相比，我们已经实现了重要的节省：不再需要首先创建DataFrame.

val sparkSession =  SparkSession.builder().getOrCreate()
val pointsTrainDf =  sparkSession.createDataset(training)
val model = new LogisticRegression()
   .train(pointsTrainDs.as[LabeledPoint])

Second way for Spark 2.0.0+Credit to @zero323

Spark 2.0.0+ 的第二种方式归功于@zero323

val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._

val trainDs = training.toDS()

Traditional Spark 1.X and earlier approach

传统的 Spark 1.X 和更早的方法

val sqlContext = new SQLContext(sc)  // Note this is *deprecated* in 2.0.0
import sqlContext.implicits._
val training = splits(0).cache()
val test = splits(1)
val trainDs = training**.toDS()**

See also: How to store custom objects in Dataset?by the esteemed @zero323 .

另请参阅：如何在数据集中存储自定义对象？由尊敬的@zero323 提供。

scala 如何从 RDD 创建 Spark 数据集

提问by javadba

回答by javadba

相关推荐

最近更新

标签

scala 如何从 RDD 创建 Spark 数据集

提问by javadba

回答by javadba

相关推荐

scala 使用 Spark 通过 s3a 将镶木地板文件写入 s3 非常慢

scala 发送 FakeRequest 时如何为 akka.stream.Materializer 提供隐式值？

scala 如何在 DataFrames 中将列类型从 String 更改为 Date？

Spark Scala 从 rdd.foreachPartition 取回数据

相关推荐

最近更新

标签