理解python xgboost cv
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34469038/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
understanding python xgboost cv
提问by kilojoules
I would like to use the xgboost cv function to find the best parameters for my training data set. I am confused by the api. How do I find the best parameter? Is this similar to the sklearn grid_search
cross-validation function? How can I find which of the options for the max_depth
parameter ([2,4,6]) was determined optimal?
我想使用 xgboost cv 函数为我的训练数据集找到最佳参数。我对api感到困惑。如何找到最佳参数?这类似于sklearngrid_search
交叉验证功能吗?如何找到max_depth
参数 ([2,4,6]) 的哪个选项被确定为最佳?
from sklearn.datasets import load_iris
import xgboost as xgb
iris = load_iris()
DTrain = xgb.DMatrix(iris.data, iris.target)
x_parameters = {"max_depth":[2,4,6]}
xgb.cv(x_parameters, DTrain)
...
Out[6]:
test-rmse-mean test-rmse-std train-rmse-mean train-rmse-std
0 0.888435 0.059403 0.888052 0.022942
1 0.854170 0.053118 0.851958 0.017982
2 0.837200 0.046986 0.833532 0.015613
3 0.829001 0.041960 0.824270 0.014501
4 0.825132 0.038176 0.819654 0.013975
5 0.823357 0.035454 0.817363 0.013722
6 0.822580 0.033540 0.816229 0.013598
7 0.822265 0.032209 0.815667 0.013538
8 0.822158 0.031287 0.815390 0.013508
9 0.822140 0.030647 0.815252 0.013494
采纳答案by Aske Doerge
Cross-validationis used for estimating the performance of one set of parameters on unseen data.
交叉验证用于估计一组参数在未见数据上的性能。
Grid-searchevaluates a model with varying parameters to find the best possible combination of these.
网格搜索评估具有不同参数的模型,以找到这些参数的最佳组合。
The sklearn docstalks a lot about CV, and they can be used in combination, but they each have very different purposes.
sklearn docs谈到了很多关于 CV 的内容,它们可以组合使用,但它们每个都有非常不同的目的。
You might be able to fit xgboost into sklearn's gridsearch functionality. Check out the sklearn interface to xgboost for the most smooth application.
您也许可以将 xgboost 纳入 sklearn 的 gridsearch 功能。查看 xgboost 的 sklearn 接口以获得最流畅的应用程序。
回答by Deepish
Sklearn GridSearchCV
should be a way to go if you are looking for parameter tuning. You need to just pass the xgb classifier to GridSearchCV and evaluate on the best CV score.
GridSearchCV
如果您正在寻找参数调整,Sklearn应该是一种方法。您只需将 xgb 分类器传递给 GridSearchCV 并评估最佳 CV 分数。
here is nice tutorial which might help you getting started with parameter tuning: http://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
这是一个很好的教程,可以帮助您开始参数调整:http: //www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
回答by Rohit
You can use GridSearchCV with xgboost through xgboost sklearn API
您可以通过 xgboost sklearn API 将 GridSearchCV 与 xgboost 一起使用
Define your classifier as follows:
定义你的分类器如下:
from xgboost.sklearn import XGBClassifier
from sklearn.grid_search import GridSearchCV
xgb_model = XGBClassifier(other_params)
test_params = {
'max_depth':[4,8,12]
}
model = GridSearchCV(estimator = xgb_model,param_grid = test_params)
model.fit(train,target)
print model.best_params_
回答by Eran Moshe
I would go with hyperOpt
我会去 hyperOpt
https://github.com/hyperopt/hyperopt
https://github.com/hyperopt/hyperopt
open sourced and worked great for me. If you do choose this and need help, I can elaborate.
开源并且对我很有用。如果您确实选择了这个并需要帮助,我可以详细说明。
When you ask to look over "max_depth":[2,4,6]
you can naively solve this by running 3 models, each one with a max depth you want and see which model yields better results.
当你要求查看时,"max_depth":[2,4,6]
你可以通过运行 3 个模型来天真地解决这个问题,每个模型都有你想要的最大深度,然后看看哪个模型会产生更好的结果。
But "max_depth" is not the only hyper parameter you should consider tune. There are a lot of other hyper parameters, such as: eta (learning rate), gamma, min_child_weight, subsample
and so on. Some are continues and some are discrete. (assuming you know your objective functions and evaluation metrics)
但是“max_depth”并不是您应该考虑调整的唯一超参数。还有很多其他的超参数,比如:eta (learning rate), gamma, min_child_weight, subsample
等等。有些是连续的,有些是离散的。(假设你知道你的目标函数和评估指标)
you can read about all of them here https://github.com/dmlc/xgboost/blob/master/doc/parameter.md
你可以在这里阅读所有这些https://github.com/dmlc/xgboost/blob/master/doc/parameter.md
When you look on all those "parameters" and the size of dimension they create, its huge. You cannot search in it by hand (nor does an "expert" can give you the best arguments to them).
当您查看所有这些“参数”以及它们创建的维度大小时,它是巨大的。您无法手动搜索(“专家”也无法为您提供最佳论据)。
Therefor, hyperOpt gives you a neat solution to this, and builds you a search space which is not exactly random nor grid. All you need to do is define the parameters and their ranges.
因此,hyperOpt 为您提供了一个巧妙的解决方案,并为您构建了一个既不是完全随机也不是网格的搜索空间。您需要做的就是定义参数及其范围。
You can find a code example here: https://github.com/bamine/Kaggle-stuff/blob/master/otto/hyperopt_xgboost.py
您可以在此处找到代码示例:https: //github.com/bamine/Kaggle-stuff/blob/master/otto/hyperopt_xgboost.py
I can tell you from my own experience it worked better then Bayesian Optimization on my models. Give it a few hours/days of trial and error and contact me if you encounter issues you cannot solve.
我可以从我自己的经验告诉你,它在我的模型上比贝叶斯优化更有效。给它几个小时/天的反复试验,如果遇到无法解决的问题,请与我联系。
Good luck!
祝你好运!