pandas 如何在 SelectFromModel() 中确定用于选择特征的阈值?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49345578/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to decide threshold value in SelectFromModel() for selecting features?
提问by stone rock
I am using random forest classifier for feature selection. I have 70 features in all and I want to select the most important features out of 70. Below code shows the classifier displaying the features from most significant to least significant.
我正在使用随机森林分类器进行特征选择。我总共有 70 个特征,我想从 70 个特征中选择最重要的特征。下面的代码显示了分类器,显示了从最重要到最不重要的特征。
Code:
代码:
feat_labels = data.columns[1:]
clf = RandomForestClassifier(n_estimators=100, random_state=0)
# Train the classifier
clf.fit(X_train, y_train)
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
Now I am trying to use SelectFromModel
from sklearn.feature_selection
but how can I decide the threshold value for my given dataset.
现在我正在尝试使用SelectFromModel
fromsklearn.feature_selection
但如何确定给定数据集的阈值。
# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.15
sfm = SelectFromModel(clf, threshold=0.15)
# Train the selector
sfm.fit(X_train, y_train)
When I try threshold=0.15
and then try to train my model I get an error saying data is too noisy or the selection is too strict.
当我尝试threshold=0.15
然后尝试训练我的模型时,我收到一条错误消息,说数据太嘈杂或选择太严格。
But if I use threshold=0.015
I am able to train my model on selected new features So how can I decide this threshold value ?
但是,如果我使用,threshold=0.015
我可以在选定的新功能上训练我的模型 那么我该如何决定这个阈值呢?
采纳答案by MaxU
I would try the following approach:
我会尝试以下方法:
- start with a low threshold, for example:
1e-4
- reduce your features using
SelectFromModel
fit & transform - compute metrics (accuracy, etc.) for your estimator (
RandomForestClassifier
in your case) for selected features - increase threshold and repeat all steps starting from point 1.
- 从低门槛开始,例如:
1e-4
- 使用
SelectFromModel
拟合和变换减少您的特征 RandomForestClassifier
为所选功能的估算器(在您的情况下)计算指标(准确性等)- 增加阈值并从点 1 开始重复所有步骤。
Using this approach you can estimate what is the best threshold
for your particular data and your estimator
使用这种方法,您可以估计什么最threshold
适合您的特定数据和您的估算器