pandas GridSearchCV.best_score_ 评分设置为“准确度”和 CV 时的含义

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44459845/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:45:35  来源:igfitidea点击:

GridSearchCV.best_score_ meaning when scoring set to 'accuracy' and CV

pythonpandasscikit-learncross-validationgrid-search

提问by Taka

I'm trying to find the best model Neural Network model applied for the classification of breast cancer samples on the well-known Wisconsin Cancer dataset (569 samples, 31 features + target). I'm using sklearn 0.18.1. I'm not using Normalization so far. I'll add it when I solve this question.

我试图在著名的威斯康星癌症数据集(569 个样本,31 个特征 + 目标)上找到应用于乳腺癌样本分类的最佳模型神经网络模型。我正在使用 sklearn 0.18.1。到目前为止,我没有使用标准化。当我解决这个问题时,我会添加它。

# some init code omitted
X_train, X_test, y_train, y_test = train_test_split(X, y)

Define params NN params for the GridSearchCV

为 GridSearchCV 定义参数 NN 参数

tuned_params = [{'solver': ['sgd'], 'learning_rate': ['constant'], "learning_rate_init" : [0.001, 0.01, 0.05, 0.1]},
                {"learning_rate_init" : [0.001, 0.01, 0.05, 0.1]}]

CV method and model

CV方法和模型

cv_method = KFold(n_splits=4, shuffle=True)
model = MLPClassifier()

Apply grid

应用网格

grid = GridSearchCV(estimator=model, param_grid=tuned_params, cv=cv_method, scoring='accuracy')
grid.fit(X_train, y_train)
y_pred = grid.predict(X_test)

And if I run:

如果我跑:

print(grid.best_score_)
print(accuracy_score(y_test, y_pred))

The result is 0.746478873239and 0.902097902098

结果是0.7464788732390.902097902098

According to the doc "best_score_ : float, Score of best_estimator on the left out data". I assume it is the best accuracy among the ones obtained running the 8 different configuration as especified in tuned_paramsthe number of times especified by KFold, on the left out data as especified by KFold. Am I right?

根据文档“best_score_ : float, best_estimator 在遗漏数据上的分数”。我认为它是在运行 8 种不同配置中获得的最佳准确度,如tuned_pa​​rams 中指定的KFold指定的次数,KFold 指定的遗漏数据。我对吗?

One more question. Is there a method to find the optimal size of test data to use in train_test_splitwhich defaults to 0.25?

还有一个问题。有没有一种方法可以找到在train_test_split中使用的测试数据的最佳大小,默认为 0.25?

Thanks a lot

非常感谢

REFERENCES

参考

回答by Vivek Kumar

The grid.best_score_is the average of all cv folds for a single combination of the parameters you specify in the tuned_params.

grid.best_score_为您的指定参数的单一组合是平均所有品种的褶皱tuned_params

In order to access other relevant details about the grid searching process, you can look at the grid.cv_results_attribute.

为了访问有关网格搜索过程的其他相关详细信息,您可以查看该grid.cv_results_属性。

From the documentation of GridSearchCV:

GridSearchCV文档中

cv_results_ : dict of numpy (masked) ndarrays

A dict with keys as column headers and values as columns, 
that can be imported into a pandas DataFrame

cv_results_ : numpy (masked) ndarrays 的字典

A dict with keys as column headers and values as columns, 
that can be imported into a pandas DataFrame

It contains keys like 'split0_test_score', 'split1_test_score' , 'mean_test_score', 'std_test_score', 'rank_test_score', 'split0_train_score', 'split1_train_score', 'mean_train_score', etc, which gives additional information about the whole execution.

它包含诸如“split0_test_score”、“split1_test_score”、“mean_test_score”、“std_test_score”、“rank_test_score”、“split0_train_score”、“split1_train_score”、“mean_train_score”等键,这些键提供了有关整体的附加信息。