pandas 使用样本权重训练 xgboost (0.7) 分类器
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/50486593/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using sample weights for training xgboost (0.7) classifier
提问by Ernie Halberg
I am trying to use sample_weight
in XGBClassifier
to improve the performance of one of our models.
我正在尝试使用sample_weight
inXGBClassifier
来提高我们的模型之一的性能。
However, it seems like the sample_weight
parameter is not working as expected. sample_weight
is very important for this problem. Please see my code below.
但是,该sample_weight
参数似乎没有按预期工作。sample_weight
对于这个问题非常重要。请在下面查看我的代码。
Basically the fitting of the model does not seem to take into account the sample_weight
parameter – it starts at an AUC of 0.5 and drops from there, recommending 0, or 1 n_estimators
. There is nothing wrong with the underlying data – we have constructed a very good model using sample weights using another tool, getting a good Gini.
基本上模型的拟合似乎没有考虑sample_weight
参数——它从 0.5 的 AUC 开始,然后从那里下降,推荐 0 或 1 n_estimators
。底层数据没有任何问题——我们使用另一个工具使用样本权重构建了一个非常好的模型,得到了一个很好的基尼系数。
The sample data provided does not properly exhibit this behavior but given a consistent random seed throughout we can see that the model objects are identical whether a weight
/sample_weight
is provided or not.
提供的样本数据并没有正确表现出这种行为,但是在整个过程中给出了一致的随机种子,我们可以看到模型对象是相同的,无论是否提供weight
/ sample_weight
。
I have tried different components from the xbgoost library that similarly have parameters where one can define weights, but no luck:
我尝试了 xbgoost 库中的不同组件,这些组件同样具有可以定义权重的参数,但没有运气:
XGBClassifier.fit()
XGBClassifier.train()
Xgboost()
XGB.fit()
XGB.train()
Dmatrix()
XGBGridSearchCV()
I have also tried the fit_params=fit_params
as a parameter as well as weight=weight
and sample_weight=sample_weight
variations
我也曾尝试fit_params=fit_params
作为一个参数,以及weight=weight
和sample_weight=sample_weight
变化
Code:
代码:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
df = pd.DataFrame(columns =
['GB_FLAG','sample_weight','f1','f2','f3','f4','f5'])
df.loc[0] = [0,1,2046,10,625,8000,2072]
df.loc[1] = [0,0.86836,8000,10,705,8800,28]
df.loc[2] = [1,1,2303.62,19,674,3000,848]
df.loc[3] = [0,0,2754.8,2,570,16300,46]
df.loc[4] = [1,0.103474,11119.81,6,0,9500,3885]
df.loc[5] = [1,0,1050.83,19,715,3000,-5]
df.loc[6] = [1,0.011098,7063.35,11,713,19700,486]
df.loc[7] = [0,0.972176,6447.16,18,681,11300,1104]
df.loc[8] = [1,0.054237,7461.27,18,0,0,4]
df.loc[9] = [0,0.917026,4600.83,8,0,10400,242]
df.loc[10] = [0,0.670026,2041.8,21,716,11000,3]
df.loc[11] = [1,0.112416,2413.77,22,750,4600,271]
df.loc[12] = [0,0,251.81,17,806,3800,0]
df.loc[13] = [1,0.026263,20919.2,17,684,8100,1335]
df.loc[14] = [0,1,1504.58,15,621,6800,461]
df.loc[15] = [0,0.654429,9227.69,4,0,22500,294]
df.loc[16] = [0,0.897051,6960.31,22,674,5400,188]
df.loc[17] = [1,0.209862,4481.42,18,745,11600,0]
df.loc[18] = [0,1,2692.96,22,651,12800,2035]
y = np.asarray(df['GB_FLAG'])
X = np.asarray(df.drop(['GB_FLAG'], axis=1))
X_traintest, X_valid, y_traintest, y_valid = train_test_split(X, y,
train_size=0.7, stratify=y, random_state=1337)
traintest_sample_weight = X_traintest[:,0]
valid_sample_weight = X_valid[:,0]
X_traintest = X_traintest[:,1:]
X_valid = X_valid[:,1:]
model = XGBClassifier()
eval_set = [(X_valid, y_valid)]
model.fit(X_traintest, y_traintest, eval_set=eval_set, eval_metric="auc", e
early_stopping_rounds=50, verbose = True, sample_weight =
traintest_sample_weight)
How do I use sample weights when using xgboost
for modeling?
xgboost
用于建模时如何使用样本权重?
回答by Mischa Lisovyi
The problem is that for evaluation datasets weights are not propagated by the sklearn API.
问题是对于评估数据集权重不是由 sklearn API 传播的。
So you seem to be doomed to use the native API. Just replace the lines starting with your model
definition by the following code:
所以你似乎注定要使用原生API。只需用model
以下代码替换以您的定义开头的行:
from xgboost import train, DMatrix
trainDmatrix = DMatrix(X_traintest, label=y_traintest, weight=traintest_sample_weight)
validDmatrix = DMatrix(X_valid, label=y_valid, weight=valid_sample_weight)
booster = train({'eval_metric': 'auc'}, trainDmatrix, num_boost_round=100,
evals=[(trainDmatrix,'train'), (validDmatrix, 'valid')], early_stopping_rounds=50,
verbose_eval=10)
UPD: The xgboost community is aware of it and there is a discussion and even a PR for it: https://github.com/dmlc/xgboost/issues/1804. However, this was never propagated to v0.71 for some reason.
UPD:xgboost 社区意识到了这一点,并且有一个讨论,甚至是一个 PR:https: //github.com/dmlc/xgboost/issues/1804。但是,由于某种原因,这从未传播到 v0.71。
UPD2: After pinging that issue, the relevant code update has been revived and the PR was merged into master in time for the upcoming xgboost 0.72
release on 1 June 2018
UPD2:ping那个问题后,相关的代码更新已经恢复,PR及时合并到master中,以便xgboost 0.72
于2018年6月1日发布