Python 如何找到逻辑回归模型特征的重要性?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34052115/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 14:24:42  来源:igfitidea点击:

How to find the importance of the features for a logistic regression model?

pythonmachine-learningscikit-learnlogistic-regression

提问by mgokhanbakal

I have a binary prediction model trained by logistic regression algorithm. I want know which features(predictors) are more important for the decision of positive or negative class. I know there is coef_parameter comes from the scikit-learn package, but I don't know whether it is enough to for the importance. Another thing is how I can evaluate the coef_values in terms of the importance for negative and positive classes. I also read about standardized regression coefficients and I don't know what it is.

我有一个由逻辑回归算法训练的二元预测模型。我想知道哪些特征(预测器)对于正类或负类的决定更重要。我知道有coef_参数来自 scikit-learn 包,但我不知道它是否足够重要。另一件事是我如何coef_根据负类和正类的重要性来评估这些值。我还阅读了标准化回归系数,但我不知道它是什么。

Lets say there are features like size of tumor, weight of tumor, and etc to make a decision for a test case like malignant or not malignant. I want to know which of the features are more important for malignant and not malignant prediction. Does it make sort of sense?

假设有诸如肿瘤大小、肿瘤重量等特征来决定测试用例是否为恶性。我想知道哪些特征对于恶性预测和非恶性预测更重要。这有点道理吗?

采纳答案by KT.

One of the simplest options to get a feeling for the "influence" of a given parameter in a linear classification model (logistic being one of those), is to consider the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.

在线性分类模型(逻辑是其中之一)中感受给定参数的“影响”的最简单选项之一是考虑其系数的大小乘以数据中相应参数的标准偏差.

Consider this example:

考虑这个例子:

import numpy as np    
from sklearn.linear_model import LogisticRegression

x1 = np.random.randn(100)
x2 = 4*np.random.randn(100)
x3 = 0.5*np.random.randn(100)
y = (3 + x1 + x2 + x3 + 0.2*np.random.randn()) > 0
X = np.column_stack([x1, x2, x3])

m = LogisticRegression()
m.fit(X, y)

# The estimated coefficients will all be around 1:
print(m.coef_)

# Those values, however, will show that the second parameter
# is more influential
print(np.std(X, 0)*m.coef_)

An alternative way to get a similar result is to examine the coefficients of the model fit on standardized parameters:

获得类似结果的另一种方法是检查模型拟合标准化参数的系数:

m.fit(X / np.std(X, 0), y)
print(m.coef_)

Note that this is the most basic approach and a number of other techniques for finding feature importance or parameter influence exist (using p-values, bootstrap scores, various "discriminative indices", etc).

请注意,这是最基本的方法,并且存在许多其他用于查找特征重要性或参数影响的技术(使用 p 值、引导分数、各种“判别指数”等)。

I am pretty sure you would get more interesting answers at https://stats.stackexchange.com/.

我很确定你会在https://stats.stackexchange.com/得到更多有趣的答案。