scala Spark 多类分类示例

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32029314/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:27:51  来源:igfitidea点击:

Spark Multiclass Classification Example

scalaapache-sparkapache-spark-mllibrandom-forestapache-spark-ml

提问by deniswsrosa

Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation.

你们知道我在哪里可以找到 Spark 中多类分类的例子吗?我花了很多时间在书籍和网络上搜索,到目前为止,我只知道根据文档的最新版本是可能的。

回答by zero323

ML

机器学习

(Recommended in Spark 2.0+)

推荐在 Spark 2.0+ 中

We'll use the same data as in the MLlib below. There are two basic options. If Estimatorsupports multilclass classification out-of-the-box (for example random forest) you can use it directly:

我们将使用与下面 MLlib 中相同的数据。有两个基本选项。如果Estimator支持开箱即用的多类分类(例如随机森林),您可以直接使用它:

val trainRawDf = trainRaw.toDF

import org.apache.spark.ml.feature.{Tokenizer, CountVectorizer, StringIndexer}
import org.apache.spark.ml.Pipeline

import org.apache.spark.ml.classification.RandomForestClassifier

val transformers = Array(
  new StringIndexer().setInputCol("group").setOutputCol("label"),
  new Tokenizer().setInputCol("text").setOutputCol("tokens"),
  new CountVectorizer().setInputCol("tokens").setOutputCol("features")
)


val rf = new RandomForestClassifier() 
  .setLabelCol("label")
  .setFeaturesCol("features")

val model = new Pipeline().setStages(transformers :+ rf).fit(trainRawDf)

model.transform(trainRawDf)

If model supports only binary classification (logistic regression) and extends o.a.s.ml.classification.Classifieryou can use one-vs-rest strategy:

如果模型仅支持二元分类(逻辑回归)并扩展o.a.s.ml.classification.Classifier,则可以使用一对一策略:

import org.apache.spark.ml.classification.OneVsRest
import org.apache.spark.ml.classification.LogisticRegression

val lr = new LogisticRegression() 
  .setLabelCol("label")
  .setFeaturesCol("features")

val ovr = new OneVsRest().setClassifier(lr)

val ovrModel = new Pipeline().setStages(transformers :+ ovr).fit(trainRawDf)

MLLib

机器学习库

According to the official documentationat this moment (MLlib 1.6.0) following methods support multiclass classification:

根据目前官方文档(MLlib 1.6.0)以下方法支持多类分类:

  • logistic regression,
  • decision trees,
  • random forests,
  • naive Bayes
  • 逻辑回归,
  • 决策树,
  • 随机森林,
  • 朴素贝叶斯

At least some of the examples use multiclass classification:

至少有一些示例使用多类分类:

General framework, ignoring method specific arguments, is pretty much the same as for all the other methods in MLlib. You have to pre-processes your input to create either data frame with columns representing labeland features:

忽略方法特定参数的通用框架与 MLlib 中的所有其他方法几乎相同。您必须预处理您的输入以创建具有代表label和 的列的任一数据框features

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

or RDD[LabeledPoint].

RDD[LabeledPoint]

Spark provides broad range of useful tools designed to facilitate this process including Feature Extractorsand Feature Transformersand pipelines.

Spark 提供了广泛的有用工具,旨在促进这一过程,包括特征提取器特征转换器管道

You'll find a rather naive example of using Random Forest below.

你会在下面找到一个使用随机森林的相当幼稚的例子。

First lets import required packages and create dummy data:

首先让我们导入所需的包并创建虚拟数据:

import sqlContext.implicits._
import org.apache.spark.ml.feature.{HashingTF, Tokenizer} 
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD

case class LabeledRecord(group: String, text: String)

val trainRaw = sc.parallelize(
    LabeledRecord("foo", "foo v a y b  foo") ::
    LabeledRecord("bar", "x bar y bar v") ::
    LabeledRecord("bar", "x a y bar z") ::
    LabeledRecord("foobar", "foo v b bar z") ::
    LabeledRecord("foo", "foo x") ::
    LabeledRecord("foobar", "z y x foo a b bar v") ::
    Nil
)

Now let's define required transformers and process train Dataset:

现在让我们定义所需的变压器和工艺流程Dataset

// Tokenizer to process text fields
val tokenizer = new Tokenizer()
    .setInputCol("text")
    .setOutputCol("words")

// HashingTF to convert tokens to the feature vector
val hashingTF = new HashingTF()
    .setInputCol("words")
    .setOutputCol("features")
    .setNumFeatures(10)

// Indexer to convert String labels to Double
val indexer = new StringIndexer()
    .setInputCol("group")
    .setOutputCol("label")
    .fit(trainRaw.toDF)


def transfom(rdd: RDD[LabeledRecord]) = {
    val tokenized = tokenizer.transform(rdd.toDF)
    val hashed = hashingTF.transform(tokenized)
    val indexed = indexer.transform(hashed)
    indexed
        .select($"label", $"features")
        .map{case Row(label: Double, features: Vector) =>
            LabeledPoint(label, features)}
}

val train: RDD[LabeledPoint] = transfom(trainRaw)

Please note that indexeris "fitted" on the train data. It simply means that categorical values used as the labels are converted to doubles. To use classifier on a new data you have to transform it first using this indexer.

请注意,它indexer是“拟合”在火车数据上的。它只是意味着用作标签的分类值被转换为doubles. 要在新数据上使用分类器,您必须先使用 this 对其进行转换indexer

Next we can train RF model:

接下来我们可以训练 RF 模型:

val numClasses = 3
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 4
val maxBins = 16

val model = RandomForest.trainClassifier(
    train, numClasses, categoricalFeaturesInfo, 
    numTrees, featureSubsetStrategy, impurity,
    maxDepth, maxBins
)

and finally test it:

最后测试一下:

val testRaw = sc.parallelize(
    LabeledRecord("foo", "foo  foo z z z") ::
    LabeledRecord("bar", "z bar y y v") ::
    LabeledRecord("bar", "a a  bar a z") ::
    LabeledRecord("foobar", "foo v b bar z") ::
    LabeledRecord("foobar", "a foo a bar") ::
    Nil
)

val test: RDD[LabeledPoint] = transfom(testRaw)

val predsAndLabs = test.map(lp => (model.predict(lp.features), lp.label))
val metrics = new MulticlassMetrics(predsAndLabs)

metrics.precision
metrics.recall

回答by chelBert

Are you using Spark 1.6 rather than Spark 2.1? I think the problem is that in spark 2.1 the transform method returns a dataset, which can be implicitly converted to a typed RDD, where as prior to that, it returns a data frame or row.

您使用的是 Spark 1.6 而不是 Spark 2.1?我认为问题在于,在 spark 2.1 中,transform 方法返回一个数据集,该数据集可以隐式转换为类型化 RDD,而在此之前,它返回一个数据框或行。

Try as a diagnostic specifying the return type of the transform function as RDD[LabeledPoint] and see if you get the same error.

尝试作为将转换函数的返回类型指定为 RDD[LabeledPoint] 的诊断,看看是否得到相同的错误。