scala 将 RDD[org.apache.spark.sql.Row] 转换为 RDD[org.apache.spark.mllib.linalg.Vector]
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33048177/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Converting RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]
提问by Yeye
I am relatively new to Spark and Scala.
我对 Spark 和 Scala 比较陌生。
I am starting with the following dataframe (single column made out of a dense Vector of Doubles):
我从以下数据框开始(由密集的双精度向量组成的单列):
scala> val scaledDataOnly_pruned = scaledDataOnly.select("features")
scaledDataOnly_pruned: org.apache.spark.sql.DataFrame = [features: vector]
scala> scaledDataOnly_pruned.show(5)
+--------------------+
| features|
+--------------------+
|[-0.0948337274182...|
|[-0.0948337274182...|
|[-0.0948337274182...|
|[-0.0948337274182...|
|[-0.0948337274182...|
+--------------------+
A straight conversion to RDD yields an instance of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] :
直接转换为 RDD 会产生一个 org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] 的实例:
scala> val scaledDataOnly_rdd = scaledDataOnly_pruned.rdd
scaledDataOnly_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[32] at rdd at <console>:66
Does anyone know how to convert this DF to an instance of org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] instead? My various attempts have been unsuccessful so far.
有谁知道如何将此 DF 转换为 org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] 的实例?到目前为止,我的各种尝试都没有成功。
Thank you in advance for any pointers!
在此先感谢您的指点!
回答by Yeye
Just found out:
刚刚发现:
val scaledDataOnly_rdd = scaledDataOnly_pruned.map{x:Row => x.getAs[Vector](0)}
回答by andrew
EDIT: use more sophisticated way to interpret fields in Row.
编辑:使用更复杂的方式来解释行中的字段。
This is worked for me
这对我有用
val featureVectors = features.map(row => {
Vectors.dense(row.toSeq.toArray.map({
case s: String => s.toDouble
case l: Long => l.toDouble
case _ => 0.0
}))
})
features is a DataFrame of spark SQL.
features 是 spark SQL 的 DataFrame。
回答by Santoshi M
import org.apache.spark.mllib.linalg.Vectors
scaledDataOnly
.rdd
.map{
row => Vectors.dense(row.getAs[Seq[Double]]("features").toArray)
}

