scala 如何将 spark DataFrame 转换为 RDD mllib LabeledPoints?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35966921/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert spark DataFrame to RDD mllib LabeledPoints?
提问by Tianyi Wang
I tried to apply PCA to my data and then apply RandomForest to the transformed data. However, PCA.transform(data) gave me a DataFrame but I need a mllib LabeledPoints to feed my RandomForest. How can I do that? My code:
我尝试将 PCA 应用于我的数据,然后将 RandomForest 应用于转换后的数据。但是,PCA.transform(data) 给了我一个 DataFrame,但我需要一个 mllib LabeledPoints 来提供我的 RandomForest。我怎样才能做到这一点?我的代码:
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.ml.feature.PCA
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val dataset = MLUtils.loadLibSVMFile(sc, "data/mnist/mnist.bz2")
val splits = dataset.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
val trainingDf = trainingData.toDF()
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(100)
.fit(trainingDf)
val pcaTrainingData = pca.transform(trainingDf)
val numClasses = 10
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 20
val maxBins = 32
val model = RandomForest.trainClassifier(pcaTrainingData, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
error: type mismatch;
found : org.apache.spark.sql.DataFrame
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint]
I tried the following two possible solutions but they didn't work:
我尝试了以下两种可能的解决方案,但没有奏效:
scala> val pcaTrainingData = trainingData.map(p => p.copy(features = pca.transform(p.features)))
<console>:39: error: overloaded method value transform with alternatives:
(dataset: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame <and>
(dataset: org.apache.spark.sql.DataFrame,paramMap: org.apache.spark.ml.param.ParamMap)org.apache.spark.sql.DataFrame <and>
(dataset: org.apache.spark.sql.DataFrame,firstParamPair: org.apache.spark.ml.param.ParamPair[_],otherParamPairs: org.apache.spark.ml.param.ParamPair[_]*)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.mllib.linalg.Vector)
And:
和:
val labeled = pca
.transform(trainingDf)
.map(row => LabeledPoint(row.getDouble(0), row(4).asInstanceOf[Vector[Int]]))
error: type mismatch;
found : scala.collection.immutable.Vector[Int]
required: org.apache.spark.mllib.linalg.Vector
(I have imported org.apache.spark.mllib.linalg.Vectors in the above case)
(我在上面的例子中导入了 org.apache.spark.mllib.linalg.Vectors)
Any help?
有什么帮助吗?
回答by Tzach Zohar
The correct approach here is the second one you tried - mapping each Rowinto a LabeledPointto get an RDD[LabeledPoint]. However, it has two mistakes:
这里的正确方法是您尝试的第二种方法 - 将每个映射Row到 aLabeledPoint以获取RDD[LabeledPoint]. 但是,它有两个错误:
- The correct
Vectorclass (org.apache.spark.mllib.linalg.Vector) does NOT take type arguments (e.g.Vector[Int]) - so even though you had the right import, the compiler concluded that you meantscala.collection.immutable.Vectorwhich DOES. The DataFrame returned from
PCA.fit()has 3 columns, and you tried to extract column number 4. For example, showing first 4 lines:+-----+--------------------+--------------------+ |label| features| pcaFeatures| +-----+--------------------+--------------------+ | 5.0|(780,[152,153,154...|[880.071111851977...| | 1.0|(780,[158,159,160...|[-41.473039034112...| | 2.0|(780,[155,156,157...|[931.444898405036...| | 1.0|(780,[124,125,126...|[25.5114585648411...| +-----+--------------------+--------------------+To make this easier - I prefer using the column namesinstead of their indices.
- 正确的
Vector类 (org.apache.spark.mllib.linalg.Vector) 不接受类型参数(例如Vector[Int]) - 所以即使您有正确的导入,编译器也会得出结论,您的意思是scala.collection.immutable.Vector哪个 DOES。 从返回的 DataFrame
PCA.fit()有 3 列,您尝试提取列号 4。例如,显示前 4 行:+-----+--------------------+--------------------+ |label| features| pcaFeatures| +-----+--------------------+--------------------+ | 5.0|(780,[152,153,154...|[880.071111851977...| | 1.0|(780,[158,159,160...|[-41.473039034112...| | 2.0|(780,[155,156,157...|[931.444898405036...| | 1.0|(780,[124,125,126...|[25.5114585648411...| +-----+--------------------+--------------------+为了使这更容易 - 我更喜欢使用列名而不是它们的索引。
So here's the transformation you need:
所以这是您需要的转换:
val labeled = pca.transform(trainingDf).rdd.map(row => LabeledPoint(
row.getAs[Double]("label"),
row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")
))

