scala Spark 多类分类示例

Question

提问by deniswsrosa

Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation.

你们知道我在哪里可以找到 Spark 中多类分类的例子吗？我花了很多时间在书籍和网络上搜索，到目前为止，我只知道根据文档的最新版本是可能的。

Answer 1

回答by zero323

ML

机器学习

(Recommended in Spark 2.0+)

（推荐在 Spark 2.0+ 中）

We'll use the same data as in the MLlib below. There are two basic options. If Estimatorsupports multilclass classification out-of-the-box (for example random forest) you can use it directly:

我们将使用与下面 MLlib 中相同的数据。有两个基本选项。如果Estimator支持开箱即用的多类分类（例如随机森林），您可以直接使用它：

val trainRawDf = trainRaw.toDF

import org.apache.spark.ml.feature.{Tokenizer, CountVectorizer, StringIndexer}
import org.apache.spark.ml.Pipeline

import org.apache.spark.ml.classification.RandomForestClassifier

val transformers = Array(
  new StringIndexer().setInputCol("group").setOutputCol("label"),
  new Tokenizer().setInputCol("text").setOutputCol("tokens"),
  new CountVectorizer().setInputCol("tokens").setOutputCol("features")
)


val rf = new RandomForestClassifier() 
  .setLabelCol("label")
  .setFeaturesCol("features")

val model = new Pipeline().setStages(transformers :+ rf).fit(trainRawDf)

model.transform(trainRawDf)

If model supports only binary classification (logistic regression) and extends o.a.s.ml.classification.Classifieryou can use one-vs-rest strategy:

如果模型仅支持二元分类（逻辑回归）并扩展o.a.s.ml.classification.Classifier，则可以使用一对一策略：

import org.apache.spark.ml.classification.OneVsRest
import org.apache.spark.ml.classification.LogisticRegression

val lr = new LogisticRegression() 
  .setLabelCol("label")
  .setFeaturesCol("features")

val ovr = new OneVsRest().setClassifier(lr)

val ovrModel = new Pipeline().setStages(transformers :+ ovr).fit(trainRawDf)

MLLib

机器学习库

According to the official documentationat this moment (MLlib 1.6.0) following methods support multiclass classification:

根据目前官方文档（MLlib 1.6.0）以下方法支持多类分类：

logistic regression,
decision trees,
random forests,
naive Bayes

逻辑回归，
决策树，
随机森林，
朴素贝叶斯

At least some of the examples use multiclass classification:

至少有一些示例使用多类分类：

Naive Bayes example- 3 classes
Logistic regression- 10 classes for classifier although only 2 in the example data

朴素贝叶斯示例- 3 个类
逻辑回归- 分类器有 10 个类别，尽管示例数据中只有 2 个

General framework, ignoring method specific arguments, is pretty much the same as for all the other methods in MLlib. You have to pre-processes your input to create either data frame with columns representing labeland features:

忽略方法特定参数的通用框架与 MLlib 中的所有其他方法几乎相同。您必须预处理您的输入以创建具有代表label和的列的任一数据框features：

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

or RDD[LabeledPoint].

或RDD[LabeledPoint]。

Spark provides broad range of useful tools designed to facilitate this process including Feature Extractorsand Feature Transformersand pipelines.

Spark 提供了广泛的有用工具，旨在促进这一过程，包括特征提取器和特征转换器和管道。

You'll find a rather naive example of using Random Forest below.

你会在下面找到一个使用随机森林的相当幼稚的例子。

First lets import required packages and create dummy data:

首先让我们导入所需的包并创建虚拟数据：

import sqlContext.implicits._
import org.apache.spark.ml.feature.{HashingTF, Tokenizer} 
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD

case class LabeledRecord(group: String, text: String)

val trainRaw = sc.parallelize(
    LabeledRecord("foo", "foo v a y b  foo") ::
    LabeledRecord("bar", "x bar y bar v") ::
    LabeledRecord("bar", "x a y bar z") ::
    LabeledRecord("foobar", "foo v b bar z") ::
    LabeledRecord("foo", "foo x") ::
    LabeledRecord("foobar", "z y x foo a b bar v") ::
    Nil
)

Now let's define required transformers and process train Dataset:

现在让我们定义所需的变压器和工艺流程Dataset：

// Tokenizer to process text fields
val tokenizer = new Tokenizer()
    .setInputCol("text")
    .setOutputCol("words")

// HashingTF to convert tokens to the feature vector
val hashingTF = new HashingTF()
    .setInputCol("words")
    .setOutputCol("features")
    .setNumFeatures(10)

// Indexer to convert String labels to Double
val indexer = new StringIndexer()
    .setInputCol("group")
    .setOutputCol("label")
    .fit(trainRaw.toDF)


def transfom(rdd: RDD[LabeledRecord]) = {
    val tokenized = tokenizer.transform(rdd.toDF)
    val hashed = hashingTF.transform(tokenized)
    val indexed = indexer.transform(hashed)
    indexed
        .select($"label", $"features")
        .map{case Row(label: Double, features: Vector) =>
            LabeledPoint(label, features)}
}

val train: RDD[LabeledPoint] = transfom(trainRaw)

Please note that indexeris "fitted" on the train data. It simply means that categorical values used as the labels are converted to doubles. To use classifier on a new data you have to transform it first using this indexer.

请注意，它indexer是“拟合”在火车数据上的。它只是意味着用作标签的分类值被转换为doubles. 要在新数据上使用分类器，您必须先使用 this 对其进行转换indexer。

Next we can train RF model:

接下来我们可以训练 RF 模型：

val numClasses = 3
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 4
val maxBins = 16

val model = RandomForest.trainClassifier(
    train, numClasses, categoricalFeaturesInfo, 
    numTrees, featureSubsetStrategy, impurity,
    maxDepth, maxBins
)

and finally test it:

最后测试一下：

val testRaw = sc.parallelize(
    LabeledRecord("foo", "foo  foo z z z") ::
    LabeledRecord("bar", "z bar y y v") ::
    LabeledRecord("bar", "a a  bar a z") ::
    LabeledRecord("foobar", "foo v b bar z") ::
    LabeledRecord("foobar", "a foo a bar") ::
    Nil
)

val test: RDD[LabeledPoint] = transfom(testRaw)

val predsAndLabs = test.map(lp => (model.predict(lp.features), lp.label))
val metrics = new MulticlassMetrics(predsAndLabs)

metrics.precision
metrics.recall

Answer 2

回答by chelBert

Are you using Spark 1.6 rather than Spark 2.1? I think the problem is that in spark 2.1 the transform method returns a dataset, which can be implicitly converted to a typed RDD, where as prior to that, it returns a data frame or row.

您使用的是 Spark 1.6 而不是 Spark 2.1？我认为问题在于，在 spark 2.1 中，transform 方法返回一个数据集，该数据集可以隐式转换为类型化 RDD，而在此之前，它返回一个数据框或行。

Try as a diagnostic specifying the return type of the transform function as RDD[LabeledPoint] and see if you get the same error.

尝试作为将转换函数的返回类型指定为 RDD[LabeledPoint] 的诊断，看看是否得到相同的错误。

scala Spark 多类分类示例

提问by deniswsrosa

回答by zero323

回答by chelBert

相关推荐

最近更新

标签

scala Spark 多类分类示例

提问by deniswsrosa

回答by zero323

回答by chelBert

相关推荐

如何找到 scala.MatchError 的来源？

如何在 Scala 集合中使用多列进行分组

scala 如何构建高效的Kafka broker健康检查？

scala 如何使用指定的模式创建一个空的 DataFrame？

相关推荐

最近更新

标签