scala 如何从 CrossValidatorModel 中提取最佳参数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31749593/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:25:29  来源:igfitidea点击:

How to extract best parameters from a CrossValidatorModel

scalaapache-sparkpipelinecross-validationapache-spark-mllib

提问by Mohammad

I want to find the parameters of ParamGridBuilderthat make the best model in CrossValidator in Spark 1.4.x,

我想ParamGridBuilder在 Spark 1.4.x 的 CrossValidator 中找到使最佳模型的参数,

In Pipeline Examplein Spark documentation, they add different parameters (numFeatures, regParam) by using ParamGridBuilderin the Pipeline. Then by the following line of code they make the best model:

在Spark 文档中的管道示例中,他们通过在管道中使用来添加不同的参数 ( numFeatures, regParam) ParamGridBuilder。然后通过以下代码行,他们制作了最佳模型:

val cvModel = crossval.fit(training.toDF)

Now, I want to know what are the parameters (numFeatures, regParam) from ParamGridBuilderthat produces the best model.

现在,我想知道产生最佳模型的参数 ( numFeatures, regParam)是什么ParamGridBuilder

I already used the following commands without success:

我已经使用了以下命令但没有成功:

cvModel.bestModel.extractParamMap().toString()
cvModel.params.toList.mkString("(", ",", ")")
cvModel.estimatorParamMaps.toString()
cvModel.explainParams()
cvModel.getEstimatorParamMaps.mkString("(", ",", ")")
cvModel.toString()

Any help?

有什么帮助吗?

Thanks in advance,

提前致谢,

回答by Adam Vogel

One method to get a proper ParamMapobject is to use CrossValidatorModel.avgMetrics: Array[Double]to find the argmax ParamMap:

获取正确ParamMap对象的一种方法是使用CrossValidatorModel.avgMetrics: Array[Double]来查找 argmax ParamMap

implicit class BestParamMapCrossValidatorModel(cvModel: CrossValidatorModel) {
  def bestEstimatorParamMap: ParamMap = {
    cvModel.getEstimatorParamMaps
           .zip(cvModel.avgMetrics)
           .maxBy(_._2)
           ._1
  }
}

When run on the CrossValidatorModeltrained in the Pipeline Example you cited gives:

CrossValidatorModel在您引用的管道示例中训练有素时运行时:

scala> println(cvModel.bestEstimatorParamMap)
{
   hashingTF_2b0b8ccaeeec-numFeatures: 100,
   logreg_950a13184247-regParam: 0.1
}

回答by macfeliga

val bestPipelineModel = cvModel.bestModel.asInstanceOf[PipelineModel]
val stages = bestPipelineModel.stages

val hashingStage = stages(1).asInstanceOf[HashingTF]
println("numFeatures = " + hashingStage.getNumFeatures)

val lrStage = stages(2).asInstanceOf[LogisticRegressionModel]
println("regParam = " + lrStage.getRegParam)

source

来源

回答by Algorithman

To print everything in paramMap, you actually don't have to call parent:

要在 中打印所有内容paramMap,您实际上不必调用 parent:

cvModel.bestModel.extractParamMap()

To answer OP's question, to get a single best parameter, for example regParam:

要回答 OP 的问题,要获得一个最佳参数,例如regParam

cvModel.bestModel.extractParamMap().apply(cvModel.bestModel.getParam("regParam"))

回答by Mazen Aly

This is how you get the chosen parameters

这是您获得所选参数的方式

println(cvModel.bestModel.getMaxIter)   
println(cvModel.bestModel.getRegParam)  

回答by orangeHIX

this java code should work: cvModel.bestModel().parent().extractParamMap().you can translate it to scala code parent()method will return an estimator, you can get the best params then.

这个java代码应该可以工作:. cvModel.bestModel().parent().extractParamMap()你可以将它翻译成scala代码 parent()方法将返回一个估计器,然后你可以获得最好的参数。

回答by u6020995

This is the ParamGridBuilder()

这是 ParamGridBuilder()

paraGrid = ParamGridBuilder().addGrid(
hashingTF.numFeatures, [10, 100, 1000]
).addGrid(
    lr.regParam, [0.1, 0.01, 0.001]
).build()

There are 3 stages in pipeline. It seems we can assess parameters as the following:

管道中有 3 个阶段。似乎我们可以评估参数如下:

for stage in cv_model.bestModel.stages:
    print 'stages: {}'.format(stage)
    print stage.params
    print '\n'

stage: Tokenizer_46ffb9fac5968c6c152b
[Param(parent='Tokenizer_46ffb9fac5968c6c152b', name='inputCol', doc='input column name'), Param(parent='Tokenizer_46ffb9fac5968c6c152b', name='outputCol', doc='output column name')]

stage: HashingTF_40e1af3ba73764848d43
[Param(parent='HashingTF_40e1af3ba73764848d43', name='inputCol', doc='input column name'), Param(parent='HashingTF_40e1af3ba73764848d43', name='numFeatures', doc='number of features'), Param(parent='HashingTF_40e1af3ba73764848d43', name='outputCol', doc='output column name')]

stage: LogisticRegression_451b8c8dbef84ecab7a9
[]

However, there is no parameter in the last stage, logiscRegression.

但是,最后一个阶段没有参数,logiscRegression。

We can also get weightand interceptparameter from logistregression like the following:

我们还可以从 logistregression 中获取权重截距参数,如下所示:

cv_model.bestModel.stages[1].getNumFeatures()
10
cv_model.bestModel.stages[2].intercept
1.5791827733883774
cv_model.bestModel.stages[2].weights
DenseVector([-2.5361, -0.9541, 0.4124, 4.2108, 4.4707, 4.9451, -0.3045, 5.4348, -0.1977, -1.8361])

Full exploration: http://kuanliang.github.io/2016-06-07-SparkML-pipeline/

全探索:http: //kuanliang.github.io/2016-06-07-SparkML-pipeline/

回答by Fran?ois

I am working with Spark Scala 1.6.x and here is a full example of how i can set and fit a CrossValidatorand then return the value of the parameter used to get the best model (assuming that training.toDFgives a dataframe ready to be used) :

我正在使用 Spark Scala 1.6.x,这是一个完整的示例,说明如何设置和拟合 a CrossValidator,然后返回用于获得最佳模型的参数值(假设training.toDF提供了一个可供使用的数据帧):

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

// Instantiate a LogisticRegression object
val lr = new LogisticRegression()

// Instantiate a ParamGrid with different values for the 'RegParam' parameter of the logistic regression
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.0001, 0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1)).build()

// Setting and fitting the CrossValidator on the training set, using 'MultiClassClassificationEvaluator' as evaluator
val crossVal = new CrossValidator().setEstimator(lr).setEvaluator(new MulticlassClassificationEvaluator).setEstimatorParamMaps(paramGrid)
val cvModel = crossVal.fit(training.toDF)

// Getting the value of the 'RegParam' used to get the best model
val bestModel = cvModel.bestModel                    // Getting the best model
val paramReference = bestModel.getParam("regParam")  // Getting the reference of the parameter you want (only the reference, not the value)
val paramValue = bestModel.get(paramReference)       // Getting the value of this parameter
print(paramValue)                                    // In my case : 0.001

You can do the same for any parameter or any other type of model.

您可以对任何参数或任何其他类型的模型执行相同的操作。

回答by u10437407

enter image description here

在此处输入图片说明

If java,see this debug show;

如果是java,看这个debug show;

bestModel.parent().extractParamMap()

回答by Jorge M. Londo?o P.

Building in the solution of @macfeliga, a single liner that works for pipelines:

构建在@macfeliga 的解决方案中,这是一个适用于管道的单一衬垫:

cvModel.bestModel.asInstanceOf[PipelineModel]
    .stages.foreach(stage => println(stage.extractParamMap))

回答by panc

This SO threadkinda answers the question.

这个 SO 线程有点回答这个问题。

In a nutshell, you need to cast each object to its supposed-to-be class.

简而言之,您需要将每个对象转换为其假定的类。

For the case of CrossValidatorModel, the following is what I did:

对于CrossValidatorModel,以下是我所做的:

import org.apache.spark.ml.tuning.CrossValidatorModel
import org.apache.spark.ml.PipelineModel
import org.apache.spark.ml.regression.RandomForestRegressionModel

// Load CV model from S3
val inputModelPath = "s3://path/to/my/random-forest-regression-cv"
val reloadedCvModel = CrossValidatorModel.load(inputModelPath)

// To get the parameters of the best model
(
    reloadedCvModel.bestModel
        .asInstanceOf[PipelineModel]
        .stages(1)
        .asInstanceOf[RandomForestRegressionModel]
        .extractParamMap()
)

In the example, my pipeline has two stages (a VectorIndexer and a RandomForestRegressor), so the stage index is 1 for my model.

在示例中,我的管道有两个阶段(一个 VectorIndexer 和一个 RandomForestRegressor),所以我的模型的阶段索引为 1。