Python 如何在 xgboost 中获得特征重要性?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37627923/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:41:10  来源:igfitidea点击:

How to get feature importance in xgboost?

pythonxgboost

提问by modkzs

I'm using xgboost to build a model, and try to find the importance of each feature using get_fscore(), but it returns {}

我正在使用 xgboost 构建模型,并尝试使用 找到每个功能的重要性get_fscore(),但它返回{}

and my train code is:

我的火车代码是:

dtrain = xgb.DMatrix(X, label=Y)
watchlist = [(dtrain, 'train')]
param = {'max_depth': 6, 'learning_rate': 0.03}
num_round = 200
bst = xgb.train(param, dtrain, num_round, watchlist)

So is there any mistake in my train? How to get feature importance in xgboost?

那么我的火车有什么错误吗?如何在 xgboost 中获得特征重要性?

回答by MLKing

In your code you can get feature importance for each feature in dict form:

在您的代码中,您可以以 dict 形式获取每个功能的功能重要性:

bst.get_score(importance_type='gain')

>>{'ftr_col1': 77.21064539577829,
   'ftr_col2': 10.28690566363971,
   'ftr_col3': 24.225014841466294,
   'ftr_col4': 11.234086283060112}

Explanation: The train() API's method get_score() is defined as:

说明:train() API 的 get_score() 方法定义为:

get_score(fmap='', importance_type='weight')

get_score(fmap='',importance_type='weight')

  • fmap(str (optional)) – The name of feature map file.
  • importance_type
    • ‘weight' - the number of times a feature is used to split the data across all trees.
    • ‘gain' - the average gain across all splits the feature is used in.
    • ‘cover' - the average coverage across all splits the feature is used in.
    • ‘total_gain' - the total gain across all splits the feature is used in.
    • ‘total_cover' - the total coverage across all splits the feature is used in.
  • fmap(str (optional)) – 特征图文件的名称。
  • 重要性类型
    • 'weight' - 一个特征被用来在所有树上分割数据的次数。
    • 'gain' - 使用该特征的所有分割的平均增益。
    • 'cover' - 使用该特征的所有分割的平均覆盖率。
    • 'total_gain' - 使用该特征的所有分割的总增益。
    • 'total_cover' - 使用该功能的所有分割的总覆盖范围。

https://xgboost.readthedocs.io/en/latest/python/python_api.html

https://xgboost.readthedocs.io/en/latest/python/python_api.html

回答by Sesquipedalism

Using sklearn API and XGBoost >= 0.81:

使用 sklearn API 和 XGBoost >= 0.81:

clf.get_booster().get_score(importance_type="gain")

or

或者

regr.get_booster().get_score(importance_type="gain")

For this to work correctly, when you call regr.fit(or clf.fit), Xmust be a pandas.DataFrame.

为了使其正常工作,当您调用regr.fit(或clf.fit) 时,X必须是pandas.DataFrame.

回答by Roozbeh

For feature importance Try this:

对于特征重要性试试这个:

Classification:

分类:

pd.DataFrame(bst.get_fscore().items(), columns=['feature','importance']).sort_values('importance', ascending=False)

Regression:

回归:

xgb.plot_importance(bst)

回答by koalagreener

Try this

尝试这个

fscore = clf.best_estimator_.booster().get_fscore()

回答by BCR

For anyone who comes across this issue while using xgb.XGBRegressor()the workaround I'm using is to keep the data in a pandas.DataFrame()or numpy.array()and not to convert the data to dmatrix(). Also, I had to make sure the gammaparameter is not specified for the XGBRegressor.

对于在使用xgb.XGBRegressor()我正在使用的解决方法时遇到此问题的任何人,将数据保存在pandas.DataFrame()or 中numpy.array()而不是将数据转换为dmatrix(). 另外,我必须确保gamma没有为 XGBRegressor 指定参数。

fit = alg.fit(dtrain[ft_cols].values, dtrain['y'].values)
ft_weights = pd.DataFrame(fit.feature_importances_, columns=['weights'], index=ft_cols)

After fitting the regressor fit.feature_importances_returns an array of weights which I'm assumingis in the same order as the feature columns of the pandas dataframe.

拟合回归器后fit.feature_importances_返回一个权重数组,我假设这些权重与pandas 数据帧的特征列的顺序相同。

My current setup is Ubuntu 16.04, Anaconda distro, python 3.6, xgboost 0.6, and sklearn 18.1.

我当前的设置是 Ubuntu 16.04、Anaconda 发行版、python 3.6、xgboost 0.6 和 sklearn 18.1。

回答by Kirill Dolmatov

I don't know how to get values certainly, but there is a good way to plot features importance:

我当然不知道如何获取值,但是有一种绘制特征重要性的好方法:

model = xgb.train(params, d_train, 1000, watchlist)
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()

回答by Steven Hu

Build the model from XGboost first

首先从 XGboost 构建模型

from xgboost import XGBClassifier, plot_importance
model = XGBClassifier()
model.fit(train, label)

this would result in an array. So we can sort it with descending

这将导致一个数组。所以我们可以用降序对它进行排序

sorted_idx = np.argsort(model.feature_importances_)[::-1]

Then, it is time to print all sorted importances and the name of columns together as lists (I assume the data loaded with Pandas)

然后,是时候将所有排序的重要性和列的名称一起打印为列表(我假设数据是用 Pandas 加载的)

for index in sorted_idx:
    print([train.columns[index], model.feature_importances_[index]]) 


Furthermore, we can plot the importances with XGboost built-in function

此外,我们可以使用 XGboost 内置函数绘制重要性

plot_importance(model, max_num_features = 15)
pyplot.show()

use max_num_featuresin plot_importanceto limit the number of features if you want.

如果需要max_num_featuresplot_importance可以使用in来限制功能的数量。

回答by Catbuilts

Get the table containing scoresand feature names, and then plot it.

获取包含分数特征名称的表,然后绘制它。

feature_important = model.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())

data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.plot(kind='barh')

For example:

例如:

enter image description here

在此处输入图片说明

回答by Ashish Barvaliya

print(model.feature_importances_)

plt.bar(range(len(model.feature_importances_)), model.feature_importances_)

回答by Nicolás Fornasari

In case you are using XGBRegressor, try with: model.get_booster().get_score().

在您使用XGBRegressor情况下,尝试用:model.get_booster().get_score()

That returns the results that you can directly visualize through plot_importancecommand

这将返回您可以通过plot_importance命令直接可视化的结果