Python sklearn 逻辑回归 - 重要功能

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24255723/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:20:07  来源:igfitidea点击:

sklearn logistic regression - important features

pythonscikit-learnfeature-selection

提问by mel

I'm pretty sure it's been asked before, but I'm unable to find an answer

我很确定以前有人问过这个问题,但我找不到答案

Running Logistic Regression using sklearn on python, I'm able to transform my dataset to its most important features using the Transform method

在 python 上使用 sklearn 运行逻辑回归,我能够使用 Transform 方法将我的数据集转换为其最重要的特征

classf = linear_model.LogisticRegression()
func  = classf.fit(Xtrain, ytrain)
reduced_train = func.transform(Xtrain)

How can I tell which features were selcted as most important? more generally how can I calculate the p-value of each feature in the dataset?

我怎么知道哪些特征被选为最重要的?更一般地说,我如何计算数据集中每个特征的 p 值?

回答by BrenBarn

You can look at the coefficients in the coef_attribute of the fitted model to see which features are most important. (For LogisticRegression, all transformis doing is looking at which coefficients are highest in absolute value.)

您可以查看coef_拟合模型的属性中的系数,以了解哪些特征最重要。(对于 LogisticRegression,transform所做的就是查看绝对值最高的系数。)

Most scikit-learn models do not provide a way to calculate p-values. Broadly speaking, these models are designed to be used to actually predict outputs, not to be inspected to glean understanding about how the prediction is done. If you're interested in p-values you could take a look at statsmodels, although it is somewhat less mature than sklearn.

大多数 scikit-learn 模型不提供计算 p 值的方法。从广义上讲,这些模型旨在用于实际预测输出,而不是通过检查来了解预测是如何完成的。如果您对 p 值感兴趣,您可以查看statsmodels,尽管它不如 sklearn 成熟。

回答by Fred Foo

LogisticRegression.transformtakes a thresholdvalue that determines which features to keep. Straight from the docstring:

LogisticRegression.transform取一个threshold值来确定要保留哪些特征。直接来自文档字符串:

Threshold : string, float or None, optional (default=None) The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If "median" (resp. "mean"), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., "1.25*mean") may also be used. If None and if available, the object attribute thresholdis used. Otherwise, "mean" is used by default.

阈值:字符串、浮点数或无,可选(默认值=无)用于特征选择的阈值。保留重要性大于或等于的特征,而丢弃其他特征。如果是“中值”(对应“平均值”),则阈值是特征重要性的中值(对应平均值)。也可以使用缩放因子(例如,“1.25*mean”)。如果 None 并且可用,threshold则使用object 属性。否则,默认使用“mean”。

There is no object attribute thresholdon LR estimators, so only those features with higher absolute value than the mean (after summing over the classes) are kept by default.

thresholdLR 估计器上没有对象属性,因此默认情况下仅保留绝对值高于平均值的那些特征(在对类求和之后)。

回答by Keith

As suggested in comments above you can (and should) scale your data prior to your fit thus making the coefficients comparable. Below is a little code to show how this would work. I follow thisformat for comparison.

正如上面评论中所建议的,您可以(并且应该)在拟合之前缩放数据,从而使系数具有可比性。下面是一些代码来展示这是如何工作的。我按照这种格式进行比较。

import numpy as np    
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt

x1 = np.random.randn(100)
x2 = np.random.randn(100)
x3 = np.random.randn(100)

#Make difference in feature dependance
y = (3 + x1 + 2*x2 + 5*x3 + 0.2*np.random.randn()) > 0

X = pd.DataFrame({'x1':x1,'x2':x2,'x3':x3})

#Scale your data
scaler = StandardScaler()
scaler.fit(X) 
X_scaled = pd.DataFrame(scaler.transform(X),columns = X.columns)

clf = LogisticRegression(random_state = 0)
clf.fit(X_scaled, y)

feature_importance = abs(clf.coef_[0])
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5

featfig = plt.figure()
featax = featfig.add_subplot(1, 1, 1)
featax.barh(pos, feature_importance[sorted_idx], align='center')
featax.set_yticks(pos)
featax.set_yticklabels(np.array(X.columns)[sorted_idx], fontsize=8)
featax.set_xlabel('Relative Feature Importance')

plt.tight_layout()   
plt.show()