Python 使用交叉验证评估逻辑回归
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39163354/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Evaluating Logistic regression with cross validation
提问by S.H
I would like to use cross validation to test/train my dataset and evaluate the performance of the logistic regression model on the entire dataset and not only on the test set (e.g. 25%).
我想使用交叉验证来测试/训练我的数据集并评估逻辑回归模型在整个数据集上的性能,而不仅仅是在测试集(例如 25%)上。
These concepts are totally new to me and am not very sure if am doing it right. I would be grateful if anyone could advise me on the right steps to take where I have gone wrong. Part of my code is shown below.
这些概念对我来说是全新的,我不太确定我是否做得对。如果有人能就我出错的地方采取正确的步骤向我提出建议,我将不胜感激。我的部分代码如下所示。
Also, how can I plot ROCs for "y2" and "y3" on the same graph with the current one?
另外,如何在与当前图形相同的图形上绘制“y2”和“y3”的 ROC?
Thank you
谢谢
import pandas as pd
Data=pd.read_csv ('C:\Dataset.csv',index_col='SNo')
feature_cols=['A','B','C','D','E']
X=Data[feature_cols]
Y=Data['Status']
Y1=Data['Status1'] # predictions from elsewhere
Y2=Data['Status2'] # predictions from elsewhere
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(X_train,y_train)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn import metrics, cross_validation
predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10)
metrics.accuracy_score(y, predicted)
from sklearn.cross_validation import cross_val_score
accuracy = cross_val_score(logreg, X, y, cv=10,scoring='accuracy')
print (accuracy)
print (cross_val_score(logreg, X, y, cv=10,scoring='accuracy').mean())
from nltk import ConfusionMatrix
print (ConfusionMatrix(list(y), list(predicted)))
#print (ConfusionMatrix(list(y), list(yexpert)))
# sensitivity:
print (metrics.recall_score(y, predicted) )
import matplotlib.pyplot as plt
probs = logreg.predict_proba(X)[:, 1]
plt.hist(probs)
plt.show()
# use 0.5 cutoff for predicting 'default'
import numpy as np
preds = np.where(probs > 0.5, 1, 0)
print (ConfusionMatrix(list(y), list(preds)))
# check accuracy, sensitivity, specificity
print (metrics.accuracy_score(y, predicted))
#ROC CURVES and AUC
# plot ROC curve
fpr, tpr, thresholds = metrics.roc_curve(y, probs)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate)')
plt.show()
# calculate AUC
print (metrics.roc_auc_score(y, probs))
# use AUC as evaluation metric for cross-validation
from sklearn.cross_validation import cross_val_score
logreg = LogisticRegression()
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()
采纳答案by CentAu
You got it almost right. cross_validation.cross_val_predict
gives you predictions for the entire dataset. You just need to remove logreg.fit
earlier in the code. Specifically, what it does is the following:
It divides your dataset in to n
folds and in each iteration it leaves one of the folds out as the test set and trains the model on the rest of the folds (n-1
folds). So, in the end you will get predictions for the entire data.
你几乎猜对了。cross_validation.cross_val_predict
为您提供对整个数据集的预测。您只需要logreg.fit
在代码中较早地删除即可。具体来说,它的作用如下:它将您的数据集划分为n
折叠,并在每次迭代中将其中一个折叠作为测试集,并在其余折叠(n-1
折叠)上训练模型。因此,最终您将获得对整个数据的预测。
Let's illustrate this with one of the built-in datasets in sklearn, iris. This dataset contains 150 training samples with 4 features. iris['data']
is X
and iris['target']
is y
让我们用 sklearn 中的一个内置数据集 iris 来说明这一点。该数据集包含 150 个具有 4 个特征的训练样本。iris['data']
是X
并且iris['target']
是y
In [15]: iris['data'].shape
Out[15]: (150, 4)
To get predictions on the entire set with cross validation you can do the following:
要通过交叉验证对整个集合进行预测,您可以执行以下操作:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics, cross_validation
from sklearn import datasets
iris = datasets.load_iris()
predicted = cross_validation.cross_val_predict(LogisticRegression(), iris['data'], iris['target'], cv=10)
print metrics.accuracy_score(iris['target'], predicted)
Out [1] : 0.9537
print metrics.classification_report(iris['target'], predicted)
Out [2] :
precision recall f1-score support
0 1.00 1.00 1.00 50
1 0.96 0.90 0.93 50
2 0.91 0.96 0.93 50
avg / total 0.95 0.95 0.95 150
So, back to your code. All you need is this:
所以,回到你的代码。你只需要这个:
from sklearn import metrics, cross_validation
logreg=LogisticRegression()
predicted = cross_validation.cross_val_predict(logreg, X, y, cv=10)
print metrics.accuracy_score(y, predicted)
print metrics.classification_report(y, predicted)
For plotting ROC in multi-class classification, you can follow this tutorialwhich gives you something like the following:
要在多类分类中绘制 ROC,您可以按照本教程进行操作,该教程为您提供如下内容:
In general, sklearn has very good tutorials and documentation. I strongly recommend reading their tutorial on cross_validation.
总的来说,sklearn 有非常好的教程和文档。我强烈建议阅读他们关于 cross_validation的教程。