pandas 如何在 SelectFromModel() 中确定用于选择特征的阈值?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49345578/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:20:39  来源:igfitidea点击:

How to decide threshold value in SelectFromModel() for selecting features?

pythonpandasnumpymachine-learningscikit-learn

提问by stone rock

I am using random forest classifier for feature selection. I have 70 features in all and I want to select the most important features out of 70. Below code shows the classifier displaying the features from most significant to least significant.

我正在使用随机森林分类器进行特征选择。我总共有 70 个特征,我想从 70 个特征中选择最重要的特征。下面的代码显示了分类器,显示了从最重要到最不重要的特征。

Code:

代码:

feat_labels = data.columns[1:]
clf = RandomForestClassifier(n_estimators=100, random_state=0)

# Train the classifier
clf.fit(X_train, y_train)

importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))  

enter image description here

在此处输入图片说明

Now I am trying to use SelectFromModelfrom sklearn.feature_selectionbut how can I decide the threshold value for my given dataset.

现在我正在尝试使用SelectFromModelfromsklearn.feature_selection但如何确定给定数据集的阈值。

# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.15
sfm = SelectFromModel(clf, threshold=0.15)

# Train the selector
sfm.fit(X_train, y_train)

When I try threshold=0.15and then try to train my model I get an error saying data is too noisy or the selection is too strict.

当我尝试threshold=0.15然后尝试训练我的模型时,我收到一条错误消息,说数据太嘈杂或选择太严格。

But if I use threshold=0.015I am able to train my model on selected new features So how can I decide this threshold value ?

但是,如果我使用,threshold=0.015我可以在选定的新功能上训练我的模型 那么我该如何决定这个阈值呢?

采纳答案by MaxU

I would try the following approach:

我会尝试以下方法:

  1. start with a low threshold, for example: 1e-4
  2. reduce your features using SelectFromModelfit & transform
  3. compute metrics (accuracy, etc.) for your estimator (RandomForestClassifierin your case) for selected features
  4. increase threshold and repeat all steps starting from point 1.
  1. 从低门槛开始,例如: 1e-4
  2. 使用SelectFromModel拟合和变换减少您的特征
  3. RandomForestClassifier为所选功能的估算器(在您的情况下)计算指标(准确性等)
  4. 增加阈值并从点 1 开始重复所有步骤。

Using this approach you can estimate what is the best thresholdfor your particular data and your estimator

使用这种方法,您可以估计什么最threshold适合您的特定数据和您的估算器