Python sklearn - 如何计算 p 值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22306341/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:44:12  来源:igfitidea点击:

Python sklearn - how to calculate p-values

pythonscikit-learnp-value

提问by user1096808

This is probably a simple question but I am trying to calculate the p-values for my features either using classifiers for a classification problem or regressors for regression. Could someone suggest what is the best method for each case and provide sample code? I want to just see the p-value for each feature rather than keep the k best / percentile of features etc as explained in the documentation.

这可能是一个简单的问题,但我试图计算我的特征的 p 值,要么使用分类器解决分类问题,要么使用回归器进行回归。有人可以建议每种情况的最佳方法是什么并提供示例代码吗?我只想查看每个特征的 p 值,而不是像文档中解释的那样保留 k 个最佳/特征百分比等。

Thank you

谢谢

采纳答案by Fred Foo

Just run the significance test on X, ydirectly. Example using 20news and chi2:

X, y直接运行显着性检验。使用 20news 和的示例chi2

>>> from sklearn.datasets import fetch_20newsgroups_vectorized
>>> from sklearn.feature_selection import chi2
>>> data = fetch_20newsgroups_vectorized()
>>> X, y = data.data, data.target
>>> scores, pvalues = chi2(X, y)
>>> pvalues
array([  4.10171798e-17,   4.34003018e-01,   9.99999996e-01, ...,
         9.99999995e-01,   9.99999869e-01,   9.99981414e-01])

回答by Lin Feng

You can use statsmodels

您可以使用统计模型

import statsmodels.api as sm
logit_model=sm.Logit(y_train,X_train)
result=logit_model.fit()
print(result.summary())

The resultswould be something like this

结果会是这样的

                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:               406723
Model:                          Logit   Df Residuals:                   406710
Method:                           MLE   Df Model:                           12
Date:                Fri, 12 Apr 2019   Pseudo R-squ.:                0.001661
Time:                        16:48:45   Log-Likelihood:            -2.8145e+05
converged:                      False   LL-Null:                   -2.8192e+05
                                        LLR p-value:                8.758e-193
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.0037      0.003     -1.078      0.281      -0.010       0.003