Python 如何为 xgboost 实施增量训练?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38079853/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I implement incremental training for xgboost?
提问by Marat Zakirov
The problem is that my train data could not be placed into RAM due to train data size. So I need a method which first builds one tree on whole train data set, calculate residuals build another tree and so on (like gradient boosted tree do). Obviously if I call model = xgb.train(param, batch_dtrain, 2)
in some loop - it will not help, because in such case it just rebuilds whole model for each batch.
问题是由于火车数据的大小,我的火车数据无法放入 RAM 中。所以我需要一种方法,它首先在整个训练数据集上构建一棵树,计算残差构建另一棵树等等(就像梯度提升树那样)。显然,如果我model = xgb.train(param, batch_dtrain, 2)
在某个循环中调用- 它无济于事,因为在这种情况下,它只会为每个批次重建整个模型。
回答by Alain
Disclaimer: I'm new to xgboost as well, but I think I figured this out.
免责声明:我也是 xgboost 的新手,但我想我想通了。
Try saving your model after you train on the first batch. Then, on successive runs, provide the xgb.train method with the filepath of the saved model.
在第一批训练后尝试保存模型。然后,在连续运行中,为 xgb.train 方法提供已保存模型的文件路径。
Here's a small experiment that I ran to convince myself that it works:
这是我进行的一个小实验,以说服自己它有效:
First, split the boston dataset into training and testing sets. Then split the training set into halves. Fit a model with the first half and get a score that will serve as a benchmark. Then fit two models with the second half; one model will have the additional parameter xgb_model. If passing in the extra parameter didn't make a difference, then we would expect their scores to be similar.. But, fortunately, the new model seems to perform much better than the first.
首先,将波士顿数据集拆分为训练集和测试集。然后将训练集分成两半。将模型与上半场拟合并获得可作为基准的分数。然后用下半部分拟合两个模型;一个模型将具有附加参数xgb_model。如果传入额外的参数没有影响,那么我们希望它们的分数相似。但是,幸运的是,新模型的性能似乎比第一个好得多。
import xgboost as xgb
from sklearn.cross_validation import train_test_split as ttsplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse
X = load_boston()['data']
y = load_boston()['target']
# split data into training and testing sets
# then split training set in half
X_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0)
X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(X_train,
y_train,
test_size=0.5,
random_state=0)
xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)
params = {'objective': 'reg:linear', 'verbose': False}
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')
# ================= train two versions of the model =====================#
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')
print(mse(model_1.predict(xg_test), y_test)) # benchmark
print(mse(model_2_v1.predict(xg_test), y_test)) # "before"
print(mse(model_2_v2.predict(xg_test), y_test)) # "after"
# 23.0475232194
# 39.6776876084
# 27.2053239482
Let me know if anything is unclear!
如果有什么不清楚的,请告诉我!
reference: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py
参考:https: //github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py
回答by paulperry
There is now (version 0.6?) a process_update parameter that might help. Here's an experiment with it:
现在(0.6 版?)一个 process_update 参数可能会有所帮助。这是一个实验:
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import ShuffleSplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse
boston = load_boston()
features = boston.feature_names
X = boston.data
y = boston.target
X=pd.DataFrame(X,columns=features)
y = pd.Series(y,index=X.index)
# split data into training and testing sets
rs = ShuffleSplit(test_size=0.3, n_splits=1, random_state=0)
for train_idx,test_idx in rs.split(X): # this looks silly
pass
train_split = round(len(train_idx) / 2)
train1_idx = train_idx[:train_split]
train2_idx = train_idx[train_split:]
X_train = X.loc[train_idx]
X_train_1 = X.loc[train1_idx]
X_train_2 = X.loc[train2_idx]
X_test = X.loc[test_idx]
y_train = y.loc[train_idx]
y_train_1 = y.loc[train1_idx]
y_train_2 = y.loc[train2_idx]
y_test = y.loc[test_idx]
xg_train_0 = xgb.DMatrix(X_train, label=y_train)
xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)
params = {'objective': 'reg:linear', 'verbose': False}
model_0 = xgb.train(params, xg_train_0, 30)
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model=model_1)
params.update({'process_type': 'update',
'updater' : 'refresh',
'refresh_leaf': True})
model_2_v2_update = xgb.train(params, xg_train_2, 30, xgb_model=model_1)
print('full train\t',mse(model_0.predict(xg_test), y_test)) # benchmark
print('model 1 \t',mse(model_1.predict(xg_test), y_test))
print('model 2 \t',mse(model_2_v1.predict(xg_test), y_test)) # "before"
print('model 1+2\t',mse(model_2_v2.predict(xg_test), y_test)) # "after"
print('model 1+update2\t',mse(model_2_v2_update.predict(xg_test), y_test)) # "after"
Output:
输出:
full train 17.8364309709
model 1 24.2542132108
model 2 25.6967017352
model 1+2 22.8846455135
model 1+update2 14.2816257268
回答by Shubham Chaudhary
I created a gist of jupyter notebookto demonstrate that xgboost model can be trained incrementally. I used boston dataset to train the model. I did 3 experiments - one shot learning, iterative one shot learning, iterative incremental learning. In incremental training, I passed the boston data to the model in batches of size 50.
我创建了 jupyter notebook 的要点来证明 xgboost 模型可以增量训练。我使用波士顿数据集来训练模型。我做了 3 个实验 - 一次学习,迭代一次学习,迭代增量学习。在增量训练中,我将波士顿数据以大小为 50 的批次传递给模型。
The gist of the gist is that you'll have to iterate over the data multiple times for the model to converge to the accuracy attained by one shot (all data) learning.
要点的要点是,您必须多次迭代数据才能使模型收敛到一次性(所有数据)学习所达到的准确度。
Here is the corresponding code for doing iterative incremental learning with xgboost.
下面是使用 xgboost 进行迭代增量学习的相应代码。
batch_size = 50
iterations = 25
model = None
for i in range(iterations):
for start in range(0, len(x_tr), batch_size):
model = xgb.train({
'learning_rate': 0.007,
'update':'refresh',
'process_type': 'update',
'refresh_leaf': True,
#'reg_lambda': 3, # L2
'reg_alpha': 3, # L1
'silent': False,
}, dtrain=xgb.DMatrix(x_tr[start:start+batch_size], y_tr[start:start+batch_size]), xgb_model=model)
y_pr = model.predict(xgb.DMatrix(x_te))
#print(' MSE itr@{}: {}'.format(int(start/batch_size), sklearn.metrics.mean_squared_error(y_te, y_pr)))
print('MSE itr@{}: {}'.format(i, sklearn.metrics.mean_squared_error(y_te, y_pr)))
y_pr = model.predict(xgb.DMatrix(x_te))
print('MSE at the end: {}'.format(sklearn.metrics.mean_squared_error(y_te, y_pr)))
XGBoost version: 0.6
XGBoost 版本:0.6
回答by Mobigital
looks like you don't need anything other than call your xgb.train(....)
again but provide the model result from the previous batch:
看起来你除了xgb.train(....)
再次打电话之外不需要任何其他东西,但提供上一批的模型结果:
# python
params = {} # your params here
ith_batch = 0
n_batches = 100
model = None
while ith_batch < n_batches:
d_train = getBatchData(ith_batch)
model = xgb.train(params, d_train, xgb_model=model)
ith_batch += 1
this is based on https://xgboost.readthedocs.io/en/latest/python/python_api.html
这是基于https://xgboost.readthedocs.io/en/latest/python/python_api.html
回答by Alberto Castelo Becerra
If your problem is regarding the dataset sizeand you do not really need Incremental Learning (you are not dealing with an Streaming app, for instance), then you should check out Spark or Flink.
如果您的问题与数据集大小有关,并且您并不真正需要增量学习(例如,您不是在处理流应用程序),那么您应该查看 Spark 或 Flink。
This two frameworks can train on very large datasets with a small RAM, leveraging disk memory. Both framework deal with memory issues internally. While Flink had it solved first, Spark has caught up in recent releases.
这两个框架可以利用磁盘内存在非常大的数据集上进行训练,而 RAM 很小。两个框架都在内部处理内存问题。虽然 Flink 首先解决了这个问题,但 Spark 在最近的版本中已经赶上了。
Take a look at:
看一眼:
- "XGBoost4J: Portable Distributed XGBoost in Spark, Flink and Dataflow": http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html
- Spark Integration: http://dmlc.ml/2016/10/26/a-full-integration-of-xgboost-and-spark.html
- “XGBoost4J:Spark、Flink 和 Dataflow 中的便携式分布式 XGBoost”:http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html
- 星火集成:http: //dmlc.ml/2016/10/26/a-full-integration-of-xgboost-and-spark.html
回答by Tao Cheng
To paulperry's code, If change one line from "train_split = round(len(train_idx) / 2)" to "train_split = len(train_idx) - 50". model 1+update2 will changed from 14.2816257268 to 45.60806270012028. And a lot of "leaf=0" result in dump file.
对于paulperry的代码,如果将一行从“train_split = round(len(train_idx) / 2)”更改为“train_split = len(train_idx) - 50”。模型 1+update2 将从 14.2816257268 更改为 45.60806270012028。很多“leaf=0”导致转储文件。
Updated model is not good when update sample set is relative small. For binary:logistic, updated model is unusable when update sample set has only one class.
当更新样本集相对较小时,更新模型不好。对于 binary:logistic,当更新样本集只有一个类时,更新模型是不可用的。