使用 Python 的随机森林特征重要性图表

Question

提问by user348547

I am working with RandomForestRegressor in python and I want to create a chart that will illustrate the ranking of feature importance. This is the code I used:

我正在 python 中使用 RandomForestRegressor，我想创建一个图表来说明特征重要性的排名。这是我使用的代码：

from sklearn.ensemble import RandomForestRegressor

MT= pd.read_csv("MT_reduced.csv") 
df = MT.reset_index(drop = False)

columns2 = df.columns.tolist()

# Filter the columns to remove ones we don't want.
columns2 = [c for c in columns2 if c not in["Violent_crime_rate","Change_Property_crime_rate","State","Year"]]

# Store the variable we'll be predicting on.
target = "Property_crime_rate"

# Let's randomly split our data with 80% as the train set and 20% as the test set:

# Generate the training set.  Set random_state to be able to replicate results.
train2 = df.sample(frac=0.8, random_state=1)

#exclude all obs with matching index
test2 = df.loc[~df.index.isin(train2.index)]

print(train2.shape) #need to have same number of features only difference should be obs
print(test2.shape)

# Initialize the model with some parameters.

model = RandomForestRegressor(n_estimators=100, min_samples_leaf=8, random_state=1)

#n_estimators= number of trees in forrest
#min_samples_leaf= min number of samples at each leaf


# Fit the model to the data.
model.fit(train2[columns2], train2[target])
# Make predictions.
predictions_rf = model.predict(test2[columns2])
# Compute the error.
mean_squared_error(predictions_rf, test2[target])#650.4928

Feature Importance

特征重要性

features=df.columns[[3,4,6,8,9,10]]
importances = model.feature_importances_
indices = np.argsort(importances)

plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')

This feature importance code was altered from an example found on http://www.agcross.com/2015/02/random-forests-in-python-with-scikit-learn/

此功能重要性代码是从http://www.agcross.com/2015/02/random-forests-in-python-with-scikit-learn/上的示例中更改的

I receive the following error when I attempt to replicate the code with my data:

当我尝试用我的数据复制代码时收到以下错误：

  IndexError: index 6 is out of bounds for axis 1 with size 6

Also, only one feature shows up on my chart with 100% importance where there are no labels.

此外，在没有标签的情况下，我的图表上仅显示一项具有 100% 重要性的功能。

Any help solving this issue so I can create this chart will be greatly appreciated.

任何帮助解决此问题以便我可以创建此图表将不胜感激。

Answer 1

回答by spies006

Here is an example using the iris data set.

这是一个使用 iris 数据集的示例。

>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)
>>> rnd_clf.fit(iris["data"], iris["target"])
>>> for name, importance in zip(iris["feature_names"], rnd_clf.feature_importances_):
...     print(name, "=", importance)

sepal length (cm) = 0.112492250999
sepal width (cm) = 0.0231192882825
petal length (cm) = 0.441030464364
petal width (cm) = 0.423357996355

Plotting feature importance

绘制特征重要性

>>> features = iris['feature_names']
>>> importances = rnd_clf.feature_importances_
>>> indices = np.argsort(importances)

>>> plt.title('Feature Importances')
>>> plt.barh(range(len(indices)), importances[indices], color='b', align='center')
>>> plt.yticks(range(len(indices)), [features[i] for i in indices])
>>> plt.xlabel('Relative Importance')
>>> plt.show()

Answer 2

回答by fordy

Load the feature importances into a pandas series indexed by your column names, then use its plot method. e.g. for an sklearn RF classifier/regressor modeltrained using df:

将特征重要性加载到由列名索引的 Pandas 系列中，然后使用其 plot 方法。例如，对于model使用df以下方法训练的 sklearn RF 分类器/回归器：

feat_importances = pd.Series(model.feature_importances_, index=df.columns)
feat_importances.nlargest(4).plot(kind='barh')

Answer 3

回答by seralouk

A barplotwould be more than usefulin order to visualizethe importanceof the features.

一个barplot会超过有用的，以可视化的重要的功能。

Use this (example using Iris Dataset):

使用这个（使用虹膜数据集的例子）：

from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt

# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Create decision tree classifer object
clf = RandomForestClassifier(random_state=0, n_jobs=-1)
# Train model
model = clf.fit(X, y)

# Calculate feature importances
importances = model.feature_importances_
# Sort feature importances in descending order
indices = np.argsort(importances)[::-1]

# Rearrange feature names so they match the sorted feature importances
names = [iris.feature_names[i] for i in indices]

# Barplot: Add bars
plt.bar(range(X.shape[1]), importances[indices])
# Add feature names as x-axis labels
plt.xticks(range(X.shape[1]), names, rotation=20, fontsize = 8)
# Create plot title
plt.title("Feature Importance")
# Show plot
plt.show()

Answer 4

回答by Kuang Liang

The y-ticks are not correct. To fix it, it should be

y 刻度不正确。要修复它，它应该是

plt.yticks(range(len(indices)), [features[i] for i in indices])

Answer 5

回答by Miguel Gutierrez

This code from spies006 dont work : plt.yticks(range(len(indices)), features[indices])so you have to change it for plt.yticks(range(len(indices)),features.columns[indices])

spies006 中的此代码不起作用：plt.yticks(range(len(indices)), features[indices])因此您必须将其更改为plt.yticks(range(len(indices)),features.columns[indices])

Answer 6

回答by Aanish

In the above code from spies006, "feature_names" didn't work for me. A generic solution would be to use name_of_the_dataframe.columns.

在上面来自 spies006 的代码中，“feature_names”对我不起作用。一个通用的解决方案是使用 name_of_the_dataframe.columns。

使用 Python 的随机森林特征重要性图表

提问by user348547

Feature Importance

特征重要性

回答by spies006

回答by fordy

回答by seralouk

回答by Kuang Liang

回答by Miguel Gutierrez

回答by Aanish

相关推荐

最近更新

标签

使用 Python 的随机森林特征重要性图表

提问by user348547

Feature Importance

特征重要性

回答by spies006

回答by fordy

回答by seralouk

回答by Kuang Liang

回答by Miguel Gutierrez

回答by Aanish

相关推荐

如何在 Python 3 中进行 URL 编码？

PermissionError: [Errno 13] 权限被拒绝：'C:\\Program Files\\Python35\\Lib\\site-packages\\six.py'

Python Openpyxl如何按索引从工作表中获取行

Python pandas datareader 不再适用于 yahoo-finance 更改的 url

相关推荐

最近更新

标签