Python 如何在 xgboost 中获得特征重要性？

Question

提问by modkzs

I'm using xgboost to build a model, and try to find the importance of each feature using get_fscore(), but it returns {}

我正在使用 xgboost 构建模型，并尝试使用找到每个功能的重要性get_fscore()，但它返回{}

and my train code is:

我的火车代码是：

dtrain = xgb.DMatrix(X, label=Y)
watchlist = [(dtrain, 'train')]
param = {'max_depth': 6, 'learning_rate': 0.03}
num_round = 200
bst = xgb.train(param, dtrain, num_round, watchlist)

So is there any mistake in my train? How to get feature importance in xgboost?

那么我的火车有什么错误吗？如何在 xgboost 中获得特征重要性？

Answer 1

回答by MLKing

In your code you can get feature importance for each feature in dict form:

在您的代码中，您可以以 dict 形式获取每个功能的功能重要性：

bst.get_score(importance_type='gain')

>>{'ftr_col1': 77.21064539577829,
   'ftr_col2': 10.28690566363971,
   'ftr_col3': 24.225014841466294,
   'ftr_col4': 11.234086283060112}

Explanation: The train() API's method get_score() is defined as:

说明：train() API 的 get_score() 方法定义为：

get_score(fmap='', importance_type='weight')

get_score(fmap='',importance_type='weight')

fmap(str (optional)) – The name of feature map file.
importance_type
- ‘weight' - the number of times a feature is used to split the data across all trees.
- ‘gain' - the average gain across all splits the feature is used in.
- ‘cover' - the average coverage across all splits the feature is used in.
- ‘total_gain' - the total gain across all splits the feature is used in.
- ‘total_cover' - the total coverage across all splits the feature is used in.

fmap(str (optional)) – 特征图文件的名称。
重要性类型
- 'weight' - 一个特征被用来在所有树上分割数据的次数。
- 'gain' - 使用该特征的所有分割的平均增益。
- 'cover' - 使用该特征的所有分割的平均覆盖率。
- 'total_gain' - 使用该特征的所有分割的总增益。
- 'total_cover' - 使用该功能的所有分割的总覆盖范围。

https://xgboost.readthedocs.io/en/latest/python/python_api.html

Answer 2

回答by Sesquipedalism

Using sklearn API and XGBoost >= 0.81:

使用 sklearn API 和 XGBoost >= 0.81：

clf.get_booster().get_score(importance_type="gain")

or

或者

regr.get_booster().get_score(importance_type="gain")

For this to work correctly, when you call regr.fit(or clf.fit), Xmust be a pandas.DataFrame.

为了使其正常工作，当您调用regr.fit(或clf.fit) 时，X必须是pandas.DataFrame.

Answer 3

回答by Roozbeh

For feature importance Try this:

对于特征重要性试试这个：

Classification:

分类：

pd.DataFrame(bst.get_fscore().items(), columns=['feature','importance']).sort_values('importance', ascending=False)

Regression:

回归：

xgb.plot_importance(bst)

Answer 4

回答by koalagreener

Try this

尝试这个

fscore = clf.best_estimator_.booster().get_fscore()

Answer 5

回答by BCR

For anyone who comes across this issue while using xgb.XGBRegressor()the workaround I'm using is to keep the data in a pandas.DataFrame()or numpy.array()and not to convert the data to dmatrix(). Also, I had to make sure the gammaparameter is not specified for the XGBRegressor.

对于在使用xgb.XGBRegressor()我正在使用的解决方法时遇到此问题的任何人，将数据保存在pandas.DataFrame()or 中numpy.array()而不是将数据转换为dmatrix(). 另外，我必须确保gamma没有为 XGBRegressor 指定参数。

fit = alg.fit(dtrain[ft_cols].values, dtrain['y'].values)
ft_weights = pd.DataFrame(fit.feature_importances_, columns=['weights'], index=ft_cols)

After fitting the regressor fit.feature_importances_returns an array of weights which I'm assumingis in the same order as the feature columns of the pandas dataframe.

拟合回归器后fit.feature_importances_返回一个权重数组，我假设这些权重与pandas 数据帧的特征列的顺序相同。

My current setup is Ubuntu 16.04, Anaconda distro, python 3.6, xgboost 0.6, and sklearn 18.1.

我当前的设置是 Ubuntu 16.04、Anaconda 发行版、python 3.6、xgboost 0.6 和 sklearn 18.1。

Answer 6

回答by Kirill Dolmatov

I don't know how to get values certainly, but there is a good way to plot features importance:

我当然不知道如何获取值，但是有一种绘制特征重要性的好方法：

model = xgb.train(params, d_train, 1000, watchlist)
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()

Answer 7

回答by Steven Hu

Build the model from XGboost first

首先从 XGboost 构建模型

from xgboost import XGBClassifier, plot_importance
model = XGBClassifier()
model.fit(train, label)

this would result in an array. So we can sort it with descending

这将导致一个数组。所以我们可以用降序对它进行排序

sorted_idx = np.argsort(model.feature_importances_)[::-1]

Then, it is time to print all sorted importances and the name of columns together as lists (I assume the data loaded with Pandas)

然后，是时候将所有排序的重要性和列的名称一起打印为列表（我假设数据是用 Pandas 加载的）

for index in sorted_idx:
    print([train.columns[index], model.feature_importances_[index]])

Furthermore, we can plot the importances with XGboost built-in function

此外，我们可以使用 XGboost 内置函数绘制重要性

plot_importance(model, max_num_features = 15)
pyplot.show()

use max_num_featuresin plot_importanceto limit the number of features if you want.

如果需要max_num_features，plot_importance可以使用in来限制功能的数量。

Answer 8

回答by Catbuilts

Get the table containing scoresand feature names, and then plot it.

获取包含分数和特征名称的表，然后绘制它。

feature_important = model.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())

data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.plot(kind='barh')

For example:

例如：

Answer 9

回答by Ashish Barvaliya

print(model.feature_importances_)

plt.bar(range(len(model.feature_importances_)), model.feature_importances_)

Answer 10

回答by Nicolás Fornasari

In case you are using XGBRegressor, try with: model.get_booster().get_score().

在您使用XGBRegressor情况下，尝试用：model.get_booster().get_score()。

That returns the results that you can directly visualize through plot_importancecommand

这将返回您可以通过plot_importance命令直接可视化的结果

Python 如何在 xgboost 中获得特征重要性？

提问by modkzs

回答by MLKing

回答by Sesquipedalism

回答by Roozbeh

回答by koalagreener

回答by BCR

回答by Kirill Dolmatov

回答by Steven Hu

回答by Catbuilts

回答by Ashish Barvaliya

回答by Nicolás Fornasari

相关推荐

最近更新

标签

Python 如何在 xgboost 中获得特征重要性？

提问by modkzs

回答by MLKing

回答by Sesquipedalism

回答by Roozbeh

回答by koalagreener

回答by BCR

回答by Kirill Dolmatov

回答by Steven Hu

回答by Catbuilts

回答by Ashish Barvaliya

回答by Nicolás Fornasari

相关推荐

Python 如何删除熊猫数据框中具有重复列值的行？

Python，Enum 类型有什么用？

Python 我需要什么 K.clear_session() 和 del 模型（Keras with Tensorflow-gpu）？

Python “TypeError：没有编码的字符串参数”，但字符串已编码？

相关推荐

最近更新

标签