Python scikit-learn cross_val_predict 准确率分数是如何计算的?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41458834/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 00:59:44  来源:igfitidea点击:

How is scikit-learn cross_val_predict accuracy score calculated?

pythonscikit-learncross-validation

提问by thunder

Does the cross_val_predict(see doc, v0.18) with k-fold method as shown in the code below calculate accuracy for each fold and average them finally or not?

使用k折方法cross_val_predict(见doc,v0.18)是否如下面的代码所示计算每个折叠的准确度并最终对它们求平均值?

cv = KFold(len(labels), n_folds=20)
clf = SVC()
ypred = cross_val_predict(clf, td, labels, cv=cv)
accuracy = accuracy_score(labels, ypred)
print accuracy

回答by Omid

No, it does not!

不,不是的!

According to cross validation docpage, cross_val_predictdoes not return any scores but only the labels based on a certain strategy which is described here:

根据交叉验证文档页面,cross_val_predict不返回任何分数,而只返回基于此处描述的特定策略的标签:

The function cross_val_predict has a similar interface to cross_val_score, but returns, for each element in the input, the prediction that was obtained for that element when it was in the test set. Only cross-validation strategies that assign all elements to a test set exactly once can be used (otherwise, an exception is raised).

函数 cross_val_predict 具有与 cross_val_score 类似的接口,但返回对于输入中的每个元素,当它在测试集中时为该元素获得的预测。只能使用将所有元素仅分配给测试集一次的交叉验证策略(否则会引发异常)。

And therefore by calling accuracy_score(labels, ypred)you are just calculating accuracy scores of labels predicted by aforementioned particular strategycompared to the true labels. This again is specified in the same documentation page:

因此,通过调用accuracy_score(labels, ypred)您只是计算上述特定策略预测的标签与真实标签相比的准确度分数。这在同一文档页面中再次指定:

These prediction can then be used to evaluate the classifier:

predicted = cross_val_predict(clf, iris.data, iris.target, cv=10) 
metrics.accuracy_score(iris.target, predicted)

Note that the result of this computation may be slightly different from those obtained using cross_val_score as the elements are grouped in different ways.

然后可以使用这些预测来评估分类器:

predicted = cross_val_predict(clf, iris.data, iris.target, cv=10) 
metrics.accuracy_score(iris.target, predicted)

请注意,此计算的结果可能与使用 cross_val_score 获得的结果略有不同,因为元素以不同的方式分组。

If you need accuracy scores of different folds you should try:

如果您需要不同折叠的准确度分数,您应该尝试:

>>> scores = cross_val_score(clf, X, y, cv=cv)
>>> scores                                              
array([ 0.96...,  1.  ...,  0.96...,  0.96...,  1.        ])

and then for the mean accuracy of all folds use scores.mean():

然后对于所有折叠的平均精度使用scores.mean()

>>> print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.98 (+/- 0.03)


How to calculate Cohen kappa coefficient and confusion matrix for each fold?

如何计算每个折叠的 Cohen kappa 系数和混淆矩阵?

For calculating Cohen Kappa coefficientand confusion matrix I assumed you mean kappa coefficient and confusion matrix between true labels and each fold's predicted labels:

对于计算Cohen Kappa coefficient和混淆矩阵,我假设您的意思是真实标签和每个折叠的预测标签之间的 kappa 系数和混淆矩阵:

from sklearn.model_selection import KFold
from sklearn.svm.classes import SVC
from sklearn.metrics.classification import cohen_kappa_score
from sklearn.metrics import confusion_matrix

cv = KFold(len(labels), n_folds=20)
clf = SVC()
for train_index, test_index in cv.split(X):
    clf.fit(X[train_index], labels[train_index])
    ypred = clf.predict(X[test_index])
    kappa_score = cohen_kappa_score(labels[test_index], ypred)
    confusion_matrix = confusion_matrix(labels[test_index], ypred)


What does cross_val_predictreturn?

什么cross_val_predict回报?

It uses KFold to split the data to kparts and then for i=1..kiterations:

它使用 KFold 将数据拆分为多个k部分,然后进行i=1..k迭代:

  • takes i'thpart as the test data and all other parts as training data
  • trains the model with training data (all parts except i'th)
  • then by using this trained model, predicts labels for i'thpart (test data)
  • i'th部分作为测试数据和其他所有部分作为训练数据
  • 使用训练数据训练模型(所有部分除外i'th
  • 然后通过使用这个训练好的模型,预测i'th零件的标签(测试数据)

In each iteration, label of i'thpart of data gets predicted. In the end cross_val_predict merges all partially predicted labels and returns them as the final result.

在每次迭代中,i'th预测部分数据的标签。最后 cross_val_predict 合并所有部分预测的标签并将它们作为最终结果返回。

This code shows this process step by step:

此代码逐步显示此过程:

X = np.array([[0], [1], [2], [3], [4], [5]])
labels = np.array(['a', 'a', 'a', 'b', 'b', 'b'])

cv = KFold(len(labels), n_folds=3)
clf = SVC()
ypred_all = np.chararray((labels.shape))
i = 1
for train_index, test_index in cv.split(X):
    print("iteration", i, ":")
    print("train indices:", train_index)
    print("train data:", X[train_index])
    print("test indices:", test_index)
    print("test data:", X[test_index])
    clf.fit(X[train_index], labels[train_index])
    ypred = clf.predict(X[test_index])
    print("predicted labels for data of indices", test_index, "are:", ypred)
    ypred_all[test_index] = ypred
    print("merged predicted labels:", ypred_all)
    i = i+1
    print("=====================================")
y_cross_val_predict = cross_val_predict(clf, X, labels, cv=cv)
print("predicted labels by cross_val_predict:", y_cross_val_predict)

The result is:

结果是:

iteration 1 :
train indices: [2 3 4 5]
train data: [[2] [3] [4] [5]]
test indices: [0 1]
test data: [[0] [1]]
predicted labels for data of indices [0 1] are: ['b' 'b']
merged predicted labels: ['b' 'b' '' '' '' '']
=====================================
iteration 2 :
train indices: [0 1 4 5]
train data: [[0] [1] [4] [5]]
test indices: [2 3]
test data: [[2] [3]]
predicted labels for data of indices [2 3] are: ['a' 'b']
merged predicted labels: ['b' 'b' 'a' 'b' '' '']
=====================================
iteration 3 :
train indices: [0 1 2 3]
train data: [[0] [1] [2] [3]]
test indices: [4 5]
test data: [[4] [5]]
predicted labels for data of indices [4 5] are: ['a' 'a']
merged predicted labels: ['b' 'b' 'a' 'b' 'a' 'a']
=====================================
predicted labels by cross_val_predict: ['b' 'b' 'a' 'b' 'a' 'a']

回答by BloodyD

As you can see from the code of cross_val_predicton github, the function computes for each fold the predictions and concatenates them. The predictions are made based on model learned from other folds.

正如您从githubcross_val_predict上的代码中看到的那样,该函数为每个折叠计算预测并将它们连接起来。预测是基于从其他折叠中学习的模型进行的。

Here is a combination of your code and the example provided in the code

这是您的代码和代码中提供的示例的组合

from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.metrics import accuracy_score

diabetes = datasets.load_diabetes()
X = diabetes.data[:400]
y = diabetes.target[:400]
cv = KFold(n_splits=20)
lasso = linear_model.Lasso()
y_pred = cross_val_predict(lasso, X, y, cv=cv)
accuracy = accuracy_score(y_pred.astype(int), y.astype(int))

print(accuracy)
# >>> 0.0075

Finally, to answer your question: "No, the accuracy is not averaged for each fold"

最后,回答你的问题:“不,准确度不是每个折叠的平均值”

回答by Enrico Damini

As it is written in the documenattion sklearn.model_selection.cross_val_predict:

正如文档sklearn.model_selection.cross_val_predict 中所写

It is not appropriate to pass these predictions into an evaluation metric. Use cross_validateto measure generalization error.

将这些预测传递到评估指标中是不合适的。使用 cross_validate来衡量泛化误差。

回答by Shlomo Koppel

I would like to add an option for a quick and easy answer, above what the previous developers contributed.

我想在以前的开发人员贡献的内容之上添加一个快速简单的答案选项。

If you take micro average of F1 you will essentially be getting the accuracy rate. So for example that would be:

如果你取 F1 的微观平均值,你基本上会得到准确率。例如,这将是:

from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import precision_recall_fscore_support as score    

y_pred = cross_val_predict(lm,df,y,cv=5)
precision, recall, fscore, support = score(y, y_pred, average='micro') 
print(fscore)

This works mathematically, since the micro average gives you the weighted average of the confusion matrix.

这在数学上是有效的,因为微观平均值为您提供了混淆矩阵的加权平均值。

Good luck.

祝你好运。