Python 使用 scikit-learn 进行特征选择

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25792012/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:39:32  来源:igfitidea点击:

Feature selection using scikit-learn

pythonmachine-learningscikit-learnfeature-selectionchi-squared

提问by sara

I'm new in machine learning. I'm preparing my data for classification using Scikit Learn SVM. In order to select the best features I have used the following method:

我是机器学习的新手。我正在准备使用 Scikit Learn SVM 进行分类的数据。为了选择最佳功能,我使用了以下方法:

SelectKBest(chi2, k=10).fit_transform(A1, A2)

Since my dataset consist of negative values, I get the following error:

由于我的数据集包含负值,因此出现以下错误:

ValueError                                Traceback (most recent call last)

/media/5804B87404B856AA/TFM_UC3M/test2_v.py in <module>()
----> 1 
      2 
      3 
      4 
      5 

/usr/local/lib/python2.6/dist-packages/sklearn/base.pyc in fit_transform(self, X, y,     **fit_params)
    427         else:
    428             # fit method of arity 2 (supervised transformation)

--> 429             return self.fit(X, y, **fit_params).transform(X)
    430 
    431 

/usr/local/lib/python2.6/dist-packages/sklearn/feature_selection/univariate_selection.pyc in fit(self, X, y)
    300         self._check_params(X, y)
    301 
--> 302         self.scores_, self.pvalues_ = self.score_func(X, y)
    303         self.scores_ = np.asarray(self.scores_)
    304         self.pvalues_ = np.asarray(self.pvalues_)

/usr/local/lib/python2.6/dist-  packages/sklearn/feature_selection/univariate_selection.pyc in chi2(X, y)
    190     X = atleast2d_or_csr(X)
    191     if np.any((X.data if issparse(X) else X) < 0):
--> 192         raise ValueError("Input X must be non-negative.")
    193 
    194     Y = LabelBinarizer().fit_transform(y)

ValueError: Input X must be non-negative.

Can someone tell me how can I transform my data ?

有人能告诉我如何转换我的数据吗?

回答by Maxim

The error message Input X must be non-negativesays it all: Pearson's chi square test (goodness of fit)does not apply to negative values. It's logical because the chi square test assumes frequencies distribution and a frequency can't be a negative number. Consequently, sklearn.feature_selection.chi2asserts the input is non-negative.

错误消息Input X must be non-negative说明了一切:Pearson 卡方检验(拟合优度)不适用于负值。这是合乎逻辑的,因为卡方检验假设频率分布并且频率不能是负数。因此,sklearn.feature_selection.chi2断言输入是非负的。

You are saying that your features are "min, max, mean, median and FFT of accelerometer signal". In many cases, it may be quite safe to simply shift each feature to make it all positive, or even normalize to [0, 1]interval as suggested by EdChum.

您是说您的特征是“加速度计信号的最小值、最大值、平均值、中值和 FFT”。在许多情况下,简单地移动每个特征以使其全部为正值,或者甚至[0, 1]按照 EdChum 的建议将其标准化为区间可能是非常安全的。

If data transformation is for some reason not possible (e.g. a negative value is an important factor), you should pick another statistic to score your features:

如果由于某种原因无法进行数据转换(例如,负值是一个重要因素),您应该选择另一个统计数据来为您的特征评分:

Since the whole point of this procedure is to prepare the features for another method, it's not a big deal to pick anyone, the end result usually the same or very close.

由于此过程的重点是为另一种方法准备特征,因此挑选任何人都没什么大不了的,最终结果通常相同或非常接近。