Python sklearn 逻辑回归 - 重要功能

Question

提问by mel

I'm pretty sure it's been asked before, but I'm unable to find an answer

我很确定以前有人问过这个问题，但我找不到答案

Running Logistic Regression using sklearn on python, I'm able to transform my dataset to its most important features using the Transform method

在 python 上使用 sklearn 运行逻辑回归，我能够使用 Transform 方法将我的数据集转换为其最重要的特征

classf = linear_model.LogisticRegression()
func  = classf.fit(Xtrain, ytrain)
reduced_train = func.transform(Xtrain)

How can I tell which features were selcted as most important? more generally how can I calculate the p-value of each feature in the dataset?

我怎么知道哪些特征被选为最重要的？更一般地说，我如何计算数据集中每个特征的 p 值？

Answer 1

回答by BrenBarn

You can look at the coefficients in the coef_attribute of the fitted model to see which features are most important. (For LogisticRegression, all transformis doing is looking at which coefficients are highest in absolute value.)

您可以查看coef_拟合模型的属性中的系数，以了解哪些特征最重要。（对于 LogisticRegression，transform所做的就是查看绝对值最高的系数。）

Most scikit-learn models do not provide a way to calculate p-values. Broadly speaking, these models are designed to be used to actually predict outputs, not to be inspected to glean understanding about how the prediction is done. If you're interested in p-values you could take a look at statsmodels, although it is somewhat less mature than sklearn.

大多数 scikit-learn 模型不提供计算 p 值的方法。从广义上讲，这些模型旨在用于实际预测输出，而不是通过检查来了解预测是如何完成的。如果您对 p 值感兴趣，您可以查看statsmodels，尽管它不如 sklearn 成熟。

Answer 2

回答by Fred Foo

LogisticRegression.transformtakes a thresholdvalue that determines which features to keep. Straight from the docstring:

LogisticRegression.transform取一个threshold值来确定要保留哪些特征。直接来自文档字符串：

Threshold : string, float or None, optional (default=None) The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If "median" (resp. "mean"), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., "1.25*mean") may also be used. If None and if available, the object attribute thresholdis used. Otherwise, "mean" is used by default.

阈值：字符串、浮点数或无，可选（默认值=无）用于特征选择的阈值。保留重要性大于或等于的特征，而丢弃其他特征。如果是“中值”（对应“平均值”），则阈值是特征重要性的中值（对应平均值）。也可以使用缩放因子（例如，“1.25*mean”）。如果 None 并且可用，threshold则使用object 属性。否则，默认使用“mean”。

There is no object attribute thresholdon LR estimators, so only those features with higher absolute value than the mean (after summing over the classes) are kept by default.

thresholdLR 估计器上没有对象属性，因此默认情况下仅保留绝对值高于平均值的那些特征（在对类求和之后）。

Answer 3

回答by Keith

As suggested in comments above you can (and should) scale your data prior to your fit thus making the coefficients comparable. Below is a little code to show how this would work. I follow thisformat for comparison.

正如上面评论中所建议的，您可以（并且应该）在拟合之前缩放数据，从而使系数具有可比性。下面是一些代码来展示这是如何工作的。我按照这种格式进行比较。

import numpy as np    
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt

x1 = np.random.randn(100)
x2 = np.random.randn(100)
x3 = np.random.randn(100)

#Make difference in feature dependance
y = (3 + x1 + 2*x2 + 5*x3 + 0.2*np.random.randn()) > 0

X = pd.DataFrame({'x1':x1,'x2':x2,'x3':x3})

#Scale your data
scaler = StandardScaler()
scaler.fit(X) 
X_scaled = pd.DataFrame(scaler.transform(X),columns = X.columns)

clf = LogisticRegression(random_state = 0)
clf.fit(X_scaled, y)

feature_importance = abs(clf.coef_[0])
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5

featfig = plt.figure()
featax = featfig.add_subplot(1, 1, 1)
featax.barh(pos, feature_importance[sorted_idx], align='center')
featax.set_yticks(pos)
featax.set_yticklabels(np.array(X.columns)[sorted_idx], fontsize=8)
featax.set_xlabel('Relative Feature Importance')

plt.tight_layout()   
plt.show()

Python sklearn 逻辑回归 - 重要功能

提问by mel

回答by BrenBarn

回答by Fred Foo

回答by Keith

相关推荐

最近更新

标签

Python sklearn 逻辑回归 - 重要功能

提问by mel

回答by BrenBarn

回答by Fred Foo

回答by Keith

相关推荐

Python 熊猫识别的所有 dtypes 是什么？

Python Django 序列化程序方法字段

Python 熊猫如何更换？使用 NaN - 处理非标准缺失值

Python 无明显原因的“语法错误：无效语法”

相关推荐

最近更新

标签