pandas 使用样本权重训练 xgboost (0.7) 分类器

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50486593/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:35:29  来源:igfitidea点击:

Using sample weights for training xgboost (0.7) classifier

pythonpandasxgboostsample

提问by Ernie Halberg

I am trying to use sample_weightin XGBClassifierto improve the performance of one of our models.

我正在尝试使用sample_weightinXGBClassifier来提高我们的模型之一的性能。

However, it seems like the sample_weightparameter is not working as expected. sample_weightis very important for this problem. Please see my code below.

但是,该sample_weight参数似乎没有按预期工作。sample_weight对于这个问题非常重要。请在下面查看我的代码。

Basically the fitting of the model does not seem to take into account the sample_weightparameter – it starts at an AUC of 0.5 and drops from there, recommending 0, or 1 n_estimators. There is nothing wrong with the underlying data – we have constructed a very good model using sample weights using another tool, getting a good Gini.

基本上模型的拟合似乎没有考虑sample_weight参数——它从 0.5 的 AUC 开始,然后从那里下降,推荐 0 或 1 n_estimators。底层数据没有任何问题——我们使用另一个工具使用样本权重构建了一个非常好的模型,得到了一个很好的基尼系数。

The sample data provided does not properly exhibit this behavior but given a consistent random seed throughout we can see that the model objects are identical whether a weight/sample_weightis provided or not.

提供的样本数据并没有正确表现出这种行为,但是在整个过程中给出了一致的随机种子,我们可以看到模型对象是相同的,无论是否提供weight/ sample_weight

I have tried different components from the xbgoost library that similarly have parameters where one can define weights, but no luck:

我尝试了 xbgoost 库中的不同组件,这些组件同样具有可以定义权重的参数,但没有运气:

XGBClassifier.fit()
XGBClassifier.train()
Xgboost()
XGB.fit()
XGB.train()
Dmatrix()
XGBGridSearchCV()

I have also tried the fit_params=fit_paramsas a parameter as well as weight=weightand sample_weight=sample_weightvariations

我也曾尝试fit_params=fit_params作为一个参数,以及weight=weightsample_weight=sample_weight变化

Code:

代码:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

df = pd.DataFrame(columns = 
['GB_FLAG','sample_weight','f1','f2','f3','f4','f5'])
df.loc[0] = [0,1,2046,10,625,8000,2072]
df.loc[1] = [0,0.86836,8000,10,705,8800,28]
df.loc[2] = [1,1,2303.62,19,674,3000,848]
df.loc[3] = [0,0,2754.8,2,570,16300,46]
df.loc[4] = [1,0.103474,11119.81,6,0,9500,3885]
df.loc[5] = [1,0,1050.83,19,715,3000,-5]
df.loc[6] = [1,0.011098,7063.35,11,713,19700,486]
df.loc[7] = [0,0.972176,6447.16,18,681,11300,1104]
df.loc[8] = [1,0.054237,7461.27,18,0,0,4]
df.loc[9] = [0,0.917026,4600.83,8,0,10400,242]
df.loc[10] = [0,0.670026,2041.8,21,716,11000,3]
df.loc[11] = [1,0.112416,2413.77,22,750,4600,271]
df.loc[12] = [0,0,251.81,17,806,3800,0]
df.loc[13] = [1,0.026263,20919.2,17,684,8100,1335]
df.loc[14] = [0,1,1504.58,15,621,6800,461]
df.loc[15] = [0,0.654429,9227.69,4,0,22500,294]
df.loc[16] = [0,0.897051,6960.31,22,674,5400,188]
df.loc[17] = [1,0.209862,4481.42,18,745,11600,0]
df.loc[18] = [0,1,2692.96,22,651,12800,2035]

y = np.asarray(df['GB_FLAG'])
X = np.asarray(df.drop(['GB_FLAG'], axis=1))

X_traintest, X_valid, y_traintest, y_valid = train_test_split(X, y, 
train_size=0.7, stratify=y, random_state=1337)
traintest_sample_weight = X_traintest[:,0]
valid_sample_weight = X_valid[:,0]

X_traintest = X_traintest[:,1:]
X_valid = X_valid[:,1:]

model = XGBClassifier()
eval_set = [(X_valid, y_valid)]
model.fit(X_traintest, y_traintest, eval_set=eval_set, eval_metric="auc", e 
early_stopping_rounds=50, verbose = True, sample_weight = 
traintest_sample_weight)

How do I use sample weights when using xgboostfor modeling?

xgboost用于建模时如何使用样本权重?

回答by Mischa Lisovyi

The problem is that for evaluation datasets weights are not propagated by the sklearn API.

问题是对于评估数据集权重不是由 sklearn API 传播的。

So you seem to be doomed to use the native API. Just replace the lines starting with your modeldefinition by the following code:

所以你似乎注定要使用原生API。只需用model以下代码替换以您的定义开头的行:

from xgboost import train, DMatrix
trainDmatrix = DMatrix(X_traintest, label=y_traintest, weight=traintest_sample_weight)
validDmatrix = DMatrix(X_valid, label=y_valid, weight=valid_sample_weight)
booster = train({'eval_metric': 'auc'}, trainDmatrix, num_boost_round=100, 
                evals=[(trainDmatrix,'train'), (validDmatrix, 'valid')], early_stopping_rounds=50, 
                verbose_eval=10)

UPD: The xgboost community is aware of it and there is a discussion and even a PR for it: https://github.com/dmlc/xgboost/issues/1804. However, this was never propagated to v0.71 for some reason.

UPD:xgboost 社区意识到了这一点,并且有一个讨论,甚至是一个 PR:https: //github.com/dmlc/xgboost/issues/1804。但是,由于某种原因,这从未传播到 v0.71。

UPD2: After pinging that issue, the relevant code update has been revived and the PR was merged into master in time for the upcoming xgboost 0.72release on 1 June 2018

UPD2:ping那个问题后,相关的代码更新已经恢复,PR及时合并到master中,以便xgboost 0.72于2018年6月1日发布