scala 从 DataFrame 到 RDD[LabeledPoint]
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30925819/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
From DataFrame to RDD[LabeledPoint]
提问by Miguel
I am trying to implement a document classifier using Apache Spark MLlib and I am having some problems representing the data. My code is the following:
我正在尝试使用 Apache Spark MLlib 实现文档分类器,但在表示数据时遇到了一些问题。我的代码如下:
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.ml.feature.Tokenizer
import org.apache.spark.ml.feature.HashingTF
import org.apache.spark.ml.feature.IDF
val sql = new SQLContext(sc)
// Load raw data from a TSV file
val raw = sc.textFile("data.tsv").map(_.split("\t").toSeq)
// Convert the RDD to a dataframe
val schema = StructType(List(StructField("class", StringType), StructField("content", StringType)))
val dataframe = sql.createDataFrame(raw.map(row => Row(row(0), row(1))), schema)
// Tokenize
val tokenizer = new Tokenizer().setInputCol("content").setOutputCol("tokens")
val tokenized = tokenizer.transform(dataframe)
// TF-IDF
val htf = new HashingTF().setInputCol("tokens").setOutputCol("rawFeatures").setNumFeatures(500)
val tf = htf.transform(tokenized)
tf.cache
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(tf)
val tfidf = idfModel.transform(tf)
// Create labeled points
val labeled = tfidf.map(row => LabeledPoint(row.getDouble(0), row.get(4)))
I need to use dataframes to generate the tokens and create the TF-IDF features. The problem appears when I try to convert this dataframe to a RDD[LabeledPoint]. I map the dataframe rows, but the get method of Row return an Any type, not the type defined on the dataframe schema (Vector). Therefore, I cannot construct the RDD I need to train a ML model.
我需要使用数据帧来生成令牌并创建 TF-IDF 功能。当我尝试将此数据框转换为 RDD[LabeledPoint] 时,问题出现了。我映射了数据帧行,但 Row 的 get 方法返回 Any 类型,而不是数据帧架构(Vector)上定义的类型。因此,我无法构建训练 ML 模型所需的 RDD。
What is the best option to get a RDD[LabeledPoint] after calculating a TF-IDF?
计算 TF-IDF 后获得 RDD[LabeledPoint] 的最佳选择是什么?
采纳答案by zzztimbo
Casting the object worked for me.
投射对象对我有用。
Try:
尝试:
// Create labeled points
val labeled = tfidf.map(row => LabeledPoint(row.getDouble(0), row(4).asInstanceOf[Vector]))
回答by Chris
You need to use getAs[T](i: Int): T
你需要使用 getAs[T](i: Int): T
// Create labeled points
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val labeled = tfidf.map(row => LabeledPoint(row.getDouble(0), row.getAs[Vector](4)))

