如何使用 Scala 运行具有分类特征集的 Spark 决策树?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25038294/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I run the Spark decision tree with a categorical feature set using Scala?
提问by Climbs_lika_Spyder
I have a feature set with a corresponding categoricalFeaturesInfo: Map[Int,Int]. However, for the life of me I cannot figure out how I am supposed to get the DecisionTree class to work. It will not accept anything, but a LabeledPoint as data. However, LabeledPoint requires (double, vector) where the vector requires doubles.
我有一个具有相应 categoricalFeaturesInfo: Map[Int,Int] 的功能集。但是,对于我的一生,我无法弄清楚应该如何让 DecisionTree 类工作。它不会接受任何东西,而是接受 LabeledPoint 作为数据。但是, LabeledPoint 需要 (double, vector) ,其中向量需要双精度。
val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
// Run training algorithm to build the model
val maxDepth: Int = 3
val isMulticlassWithCategoricalFeatures: Boolean = true
val numClassesForClassification: Int = countPossibilities(labelCol)
val model = DecisionTree.train(LP, Classification, Gini, isMulticlassWithCategoricalFeatures, maxDepth, numClassesForClassification,categoricalFeaturesInfo)
The error I get:
我得到的错误:
scala> val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
<console>:32: error: overloaded method value dense with alternatives:
(values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
(firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
cannot be applied to (Array[String])
val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
My resources thus far: tree config,decision tree,labeledpoint
回答by lam
You can first transform categories to numbers, then load data as if all features are numerical.
您可以先将类别转换为数字,然后加载数据,就好像所有特征都是数字一样。
When you build a decision tree model in Spark, you just need to tell spark which features are categorical and also the feature's arity (the number of distinct categories of that feature) by specifying a map Map[Int, Int]()from feature indices to its arity.
当您在 Spark 中构建决策树模型时,您只需要通过指定Map[Int, Int]()从特征索引到其数量的映射来告诉 Spark 哪些特征是分类特征以及特征的数量(该特征的不同类别的数量)。
For example if you have data as:
例如,如果您的数据为:
1,a,add
2,b,more
1,c,thinking
3,a,to
1,c,me
You can first transform data into numerical format as:
您可以首先将数据转换为数字格式:
1,0,0
2,1,1
1,2,2
3,0,3
1,2,4
In that format you can load data to Spark. Then if you want to tell Spark the second and the third columns are categorical, you should create a map:
在这种格式中,您可以将数据加载到 Spark。然后如果你想告诉 Spark 第二和第三列是分类的,你应该创建一个映射:
categoricalFeaturesInfo = Map[Int, Int]((1,3),(2,5))
The map tells us that feature with index 1 has arity 3, and feature with index 2 has artity 5. They will be considered as categorical when we build a decision tree model passing that map as a parameter of the training function:
该地图告诉我们,索引为 1 的特征的数量为 3,索引为 2 的特征的数量为 5。当我们构建决策树模型并将该地图作为训练函数的参数传递时,它们将被视为分类的:
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
回答by dirceusemighini
Strings are not supported by LabeledPoint, one way to put it into a LabeledPoint is to split your data into multiple columns, considering that your strings are categorical.
LabeledPoint 不支持字符串,考虑到您的字符串是分类的,将其放入 LabeledPoint 的一种方法是将您的数据拆分为多列。
So for example, if you have the following dataset:
例如,如果您有以下数据集:
id,String,Intvalue
1,"a",123
2,"b",456
3,"c",789
4,"a",887
Then you could split your string data, making each value of the strings into a new column
然后你可以拆分你的字符串数据,将字符串的每个值变成一个新列
a -> 1,0,0
b -> 0,1,0
c -> 0,0,1
As you have 3 distinct values of Strings, you will convert your string column to 3 new columns, and each value will be represented by a value in this new columns.
由于您有 3 个不同的字符串值,您将把字符串列转换为 3 个新列,每个值将由这个新列中的一个值表示。
Now your dataset will be
现在你的数据集将是
id,String,Intvalue
1,1,0,0,123
2,0,1,0,456
3,0,0,1,789
4,1,0,0,887
Which now you can convert into Double values and use it into your LabeledPoint.
现在您可以将其转换为 Double 值并将其用于您的 LabeledPoint。
Another way to convert your strings into a LabeledPoint is to create a distinctlist of values for each column, and convert the values of the strings into the index of that string in this list. Which is not recommended because if so, in this supposed dataset it will be
将字符串转换为 LabeledPoint 的另一种方法是为每列创建一个不同的值列表,并将字符串的值转换为该列表中该字符串的索引。这是不推荐的,因为如果是这样,在这个假设的数据集中,它将是
a = 0
b = 1
c = 2
But in this case the algorithms will consider a closer to b than to c, which cannot be determined.
但在这种情况下,算法将考虑 a 更接近 b 而不是 c,这是无法确定的。
回答by yanbohappy
You need to confirm the type of array x. From the error log, it said that the item in array x is string which is not supported in spark. Current spark Vectors can only be filled by Double.
您需要确认数组 x 的类型。从错误日志中,它说数组 x 中的项目是 spark 不支持的字符串。当前 spark Vectors 只能由 Double 填充。

