Python 使用交叉验证评估逻辑回归

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39163354/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 21:57:29  来源:igfitidea点击:

Evaluating Logistic regression with cross validation

pythonscikit-learnlogistic-regressioncross-validation

提问by S.H

I would like to use cross validation to test/train my dataset and evaluate the performance of the logistic regression model on the entire dataset and not only on the test set (e.g. 25%).

我想使用交叉验证来测试/训练我的数据集并评估逻辑回归模型在整个数据集上的性能,而不仅仅是在测试集(例如 25%)上。

These concepts are totally new to me and am not very sure if am doing it right. I would be grateful if anyone could advise me on the right steps to take where I have gone wrong. Part of my code is shown below.

这些概念对我来说是全新的,我不太确定我是否做得对。如果有人能就我出错的地方采取正确的步骤向我提出建议,我将不胜感激。我的部分代码如下所示。

Also, how can I plot ROCs for "y2" and "y3" on the same graph with the current one?

另外,如何在与当前图形相同的图形上绘制“y2”和“y3”的 ROC?

Thank you

谢谢

import pandas as pd 
Data=pd.read_csv ('C:\Dataset.csv',index_col='SNo')
feature_cols=['A','B','C','D','E']
X=Data[feature_cols]

Y=Data['Status'] 
Y1=Data['Status1']  # predictions from elsewhere
Y2=Data['Status2'] # predictions from elsewhere

from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(X_train,y_train)

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

from sklearn import metrics, cross_validation
predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10)
metrics.accuracy_score(y, predicted) 

from sklearn.cross_validation import cross_val_score
accuracy = cross_val_score(logreg, X, y, cv=10,scoring='accuracy')
print (accuracy)
print (cross_val_score(logreg, X, y, cv=10,scoring='accuracy').mean())

from nltk import ConfusionMatrix 
print (ConfusionMatrix(list(y), list(predicted)))
#print (ConfusionMatrix(list(y), list(yexpert)))

# sensitivity:
print (metrics.recall_score(y, predicted) )

import matplotlib.pyplot as plt 
probs = logreg.predict_proba(X)[:, 1] 
plt.hist(probs) 
plt.show()

# use 0.5 cutoff for predicting 'default' 
import numpy as np 
preds = np.where(probs > 0.5, 1, 0) 
print (ConfusionMatrix(list(y), list(preds)))

# check accuracy, sensitivity, specificity 
print (metrics.accuracy_score(y, predicted)) 

#ROC CURVES and AUC 
# plot ROC curve 
fpr, tpr, thresholds = metrics.roc_curve(y, probs) 
plt.plot(fpr, tpr) 
plt.xlim([0.0, 1.0]) 
plt.ylim([0.0, 1.0]) 
plt.xlabel('False Positive Rate') 
plt.ylabel('True Positive Rate)') 
plt.show()

# calculate AUC 
print (metrics.roc_auc_score(y, probs))

# use AUC as evaluation metric for cross-validation 
from sklearn.cross_validation import cross_val_score 
logreg = LogisticRegression() 
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean() 

采纳答案by CentAu

You got it almost right. cross_validation.cross_val_predictgives you predictions for the entire dataset. You just need to remove logreg.fitearlier in the code. Specifically, what it does is the following: It divides your dataset in to nfolds and in each iteration it leaves one of the folds out as the test set and trains the model on the rest of the folds (n-1folds). So, in the end you will get predictions for the entire data.

你几乎猜对了。cross_validation.cross_val_predict为您提供对整个数据集的预测。您只需要logreg.fit在代码中较早地删除即可。具体来说,它的作用如下:它将您的数据集划分为n折叠,并在每次迭代中将其中一个折叠作为测试集,并在其余折叠(n-1折叠)上训练模型。因此,最终您将获得对整个数据的预测。

Let's illustrate this with one of the built-in datasets in sklearn, iris. This dataset contains 150 training samples with 4 features. iris['data']is Xand iris['target']is y

让我们用 sklearn 中的一个内置数据集 iris 来说明这一点。该数据集包含 150 个具有 4 个特征的训练样本。iris['data']X并且iris['target']y

In [15]: iris['data'].shape
Out[15]: (150, 4)

To get predictions on the entire set with cross validation you can do the following:

要通过交叉验证对整个集合进行预测,您可以执行以下操作:

from sklearn.linear_model import LogisticRegression
from sklearn import metrics, cross_validation
from sklearn import datasets
iris = datasets.load_iris()
predicted = cross_validation.cross_val_predict(LogisticRegression(), iris['data'], iris['target'], cv=10)
print metrics.accuracy_score(iris['target'], predicted)

Out [1] : 0.9537

print metrics.classification_report(iris['target'], predicted) 

Out [2] :
                     precision    recall  f1-score   support

                0       1.00      1.00      1.00        50
                1       0.96      0.90      0.93        50
                2       0.91      0.96      0.93        50

      avg / total       0.95      0.95      0.95       150

So, back to your code. All you need is this:

所以,回到你的代码。你只需要这个:

from sklearn import metrics, cross_validation
logreg=LogisticRegression()
predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10)
print metrics.accuracy_score(y, predicted)
print metrics.classification_report(y, predicted) 

For plotting ROC in multi-class classification, you can follow this tutorialwhich gives you something like the following:

要在多类分类中绘制 ROC,您可以按照本教程进行操作该教程为您提供如下内容:

In general, sklearn has very good tutorials and documentation. I strongly recommend reading their tutorial on cross_validation.

总的来说,sklearn 有非常好的教程和文档。我强烈建议阅读他们关于 cross_validation教程