Python 如何在 xgboost 中获得特征重要性?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37627923/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to get feature importance in xgboost?
提问by modkzs
I'm using xgboost to build a model, and try to find the importance of each feature using get_fscore()
, but it returns {}
我正在使用 xgboost 构建模型,并尝试使用 找到每个功能的重要性get_fscore()
,但它返回{}
and my train code is:
我的火车代码是:
dtrain = xgb.DMatrix(X, label=Y)
watchlist = [(dtrain, 'train')]
param = {'max_depth': 6, 'learning_rate': 0.03}
num_round = 200
bst = xgb.train(param, dtrain, num_round, watchlist)
So is there any mistake in my train? How to get feature importance in xgboost?
那么我的火车有什么错误吗?如何在 xgboost 中获得特征重要性?
回答by MLKing
In your code you can get feature importance for each feature in dict form:
在您的代码中,您可以以 dict 形式获取每个功能的功能重要性:
bst.get_score(importance_type='gain')
>>{'ftr_col1': 77.21064539577829,
'ftr_col2': 10.28690566363971,
'ftr_col3': 24.225014841466294,
'ftr_col4': 11.234086283060112}
Explanation: The train() API's method get_score() is defined as:
说明:train() API 的 get_score() 方法定义为:
get_score(fmap='', importance_type='weight')
get_score(fmap='',importance_type='weight')
- fmap(str (optional)) – The name of feature map file.
- importance_type
- ‘weight' - the number of times a feature is used to split the data across all trees.
- ‘gain' - the average gain across all splits the feature is used in.
- ‘cover' - the average coverage across all splits the feature is used in.
- ‘total_gain' - the total gain across all splits the feature is used in.
- ‘total_cover' - the total coverage across all splits the feature is used in.
- fmap(str (optional)) – 特征图文件的名称。
- 重要性类型
- 'weight' - 一个特征被用来在所有树上分割数据的次数。
- 'gain' - 使用该特征的所有分割的平均增益。
- 'cover' - 使用该特征的所有分割的平均覆盖率。
- 'total_gain' - 使用该特征的所有分割的总增益。
- 'total_cover' - 使用该功能的所有分割的总覆盖范围。
https://xgboost.readthedocs.io/en/latest/python/python_api.html
https://xgboost.readthedocs.io/en/latest/python/python_api.html
回答by Sesquipedalism
Using sklearn API and XGBoost >= 0.81:
使用 sklearn API 和 XGBoost >= 0.81:
clf.get_booster().get_score(importance_type="gain")
or
或者
regr.get_booster().get_score(importance_type="gain")
For this to work correctly, when you call regr.fit
(or clf.fit
), X
must be a pandas.DataFrame
.
为了使其正常工作,当您调用regr.fit
(或clf.fit
) 时,X
必须是pandas.DataFrame
.
回答by Roozbeh
For feature importance Try this:
对于特征重要性试试这个:
Classification:
分类:
pd.DataFrame(bst.get_fscore().items(), columns=['feature','importance']).sort_values('importance', ascending=False)
Regression:
回归:
xgb.plot_importance(bst)
回答by koalagreener
Try this
尝试这个
fscore = clf.best_estimator_.booster().get_fscore()
回答by BCR
For anyone who comes across this issue while using xgb.XGBRegressor()
the workaround I'm using is to keep the data in a pandas.DataFrame()
or numpy.array()
and not to convert the data to dmatrix()
. Also, I had to make sure the gamma
parameter is not specified for the XGBRegressor.
对于在使用xgb.XGBRegressor()
我正在使用的解决方法时遇到此问题的任何人,将数据保存在pandas.DataFrame()
or 中numpy.array()
而不是将数据转换为dmatrix()
. 另外,我必须确保gamma
没有为 XGBRegressor 指定参数。
fit = alg.fit(dtrain[ft_cols].values, dtrain['y'].values)
ft_weights = pd.DataFrame(fit.feature_importances_, columns=['weights'], index=ft_cols)
After fitting the regressor fit.feature_importances_
returns an array of weights which I'm assumingis in the same order as the feature columns of the pandas dataframe.
拟合回归器后fit.feature_importances_
返回一个权重数组,我假设这些权重与pandas 数据帧的特征列的顺序相同。
My current setup is Ubuntu 16.04, Anaconda distro, python 3.6, xgboost 0.6, and sklearn 18.1.
我当前的设置是 Ubuntu 16.04、Anaconda 发行版、python 3.6、xgboost 0.6 和 sklearn 18.1。
回答by Kirill Dolmatov
I don't know how to get values certainly, but there is a good way to plot features importance:
我当然不知道如何获取值,但是有一种绘制特征重要性的好方法:
model = xgb.train(params, d_train, 1000, watchlist)
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()
回答by Steven Hu
Build the model from XGboost first
首先从 XGboost 构建模型
from xgboost import XGBClassifier, plot_importance
model = XGBClassifier()
model.fit(train, label)
this would result in an array. So we can sort it with descending
这将导致一个数组。所以我们可以用降序对它进行排序
sorted_idx = np.argsort(model.feature_importances_)[::-1]
Then, it is time to print all sorted importances and the name of columns together as lists (I assume the data loaded with Pandas)
然后,是时候将所有排序的重要性和列的名称一起打印为列表(我假设数据是用 Pandas 加载的)
for index in sorted_idx:
print([train.columns[index], model.feature_importances_[index]])
Furthermore, we can plot the importances with XGboost built-in function
此外,我们可以使用 XGboost 内置函数绘制重要性
plot_importance(model, max_num_features = 15)
pyplot.show()
use max_num_features
in plot_importance
to limit the number of features if you want.
如果需要max_num_features
,plot_importance
可以使用in来限制功能的数量。
回答by Catbuilts
Get the table containing scoresand feature names, and then plot it.
获取包含分数和特征名称的表,然后绘制它。
feature_important = model.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())
data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.plot(kind='barh')
For example:
例如:
回答by Ashish Barvaliya
print(model.feature_importances_)
plt.bar(range(len(model.feature_importances_)), model.feature_importances_)
回答by Nicolás Fornasari
In case you are using XGBRegressor, try with: model.get_booster().get_score()
.
在您使用XGBRegressor情况下,尝试用:model.get_booster().get_score()
。
That returns the results that you can directly visualize through plot_importance
command
这将返回您可以通过plot_importance
命令直接可视化的结果