Python scikit-learn cross_val_predict 准确率分数是如何计算的?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41458834/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How is scikit-learn cross_val_predict accuracy score calculated?
提问by thunder
Does the cross_val_predict
(see doc, v0.18) with k-fold method as shown in the code below calculate accuracy for each fold and average them finally or not?
使用k折方法cross_val_predict
(见doc,v0.18)是否如下面的代码所示计算每个折叠的准确度并最终对它们求平均值?
cv = KFold(len(labels), n_folds=20)
clf = SVC()
ypred = cross_val_predict(clf, td, labels, cv=cv)
accuracy = accuracy_score(labels, ypred)
print accuracy
回答by Omid
No, it does not!
不,不是的!
According to cross validation docpage, cross_val_predict
does not return any scores but only the labels based on a certain strategy which is described here:
根据交叉验证文档页面,cross_val_predict
不返回任何分数,而只返回基于此处描述的特定策略的标签:
The function cross_val_predict has a similar interface to cross_val_score, but returns, for each element in the input, the prediction that was obtained for that element when it was in the test set. Only cross-validation strategies that assign all elements to a test set exactly once can be used (otherwise, an exception is raised).
函数 cross_val_predict 具有与 cross_val_score 类似的接口,但返回对于输入中的每个元素,当它在测试集中时为该元素获得的预测。只能使用将所有元素仅分配给测试集一次的交叉验证策略(否则会引发异常)。
And therefore by calling accuracy_score(labels, ypred)
you are just calculating accuracy scores of labels predicted by aforementioned particular strategycompared to the true labels. This again is specified in the same documentation page:
因此,通过调用accuracy_score(labels, ypred)
您只是计算上述特定策略预测的标签与真实标签相比的准确度分数。这在同一文档页面中再次指定:
These prediction can then be used to evaluate the classifier:
predicted = cross_val_predict(clf, iris.data, iris.target, cv=10) metrics.accuracy_score(iris.target, predicted)
Note that the result of this computation may be slightly different from those obtained using cross_val_score as the elements are grouped in different ways.
然后可以使用这些预测来评估分类器:
predicted = cross_val_predict(clf, iris.data, iris.target, cv=10) metrics.accuracy_score(iris.target, predicted)
请注意,此计算的结果可能与使用 cross_val_score 获得的结果略有不同,因为元素以不同的方式分组。
If you need accuracy scores of different folds you should try:
如果您需要不同折叠的准确度分数,您应该尝试:
>>> scores = cross_val_score(clf, X, y, cv=cv)
>>> scores
array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ])
and then for the mean accuracy of all folds use scores.mean()
:
然后对于所有折叠的平均精度使用scores.mean()
:
>>> print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.98 (+/- 0.03)
How to calculate Cohen kappa coefficient and confusion matrix for each fold?
如何计算每个折叠的 Cohen kappa 系数和混淆矩阵?
For calculating Cohen Kappa coefficient
and confusion matrix I assumed you mean kappa coefficient and confusion matrix between true labels and each fold's predicted labels:
对于计算Cohen Kappa coefficient
和混淆矩阵,我假设您的意思是真实标签和每个折叠的预测标签之间的 kappa 系数和混淆矩阵:
from sklearn.model_selection import KFold
from sklearn.svm.classes import SVC
from sklearn.metrics.classification import cohen_kappa_score
from sklearn.metrics import confusion_matrix
cv = KFold(len(labels), n_folds=20)
clf = SVC()
for train_index, test_index in cv.split(X):
clf.fit(X[train_index], labels[train_index])
ypred = clf.predict(X[test_index])
kappa_score = cohen_kappa_score(labels[test_index], ypred)
confusion_matrix = confusion_matrix(labels[test_index], ypred)
What does cross_val_predict
return?
什么cross_val_predict
回报?
It uses KFold to split the data to k
parts and then for i=1..k
iterations:
它使用 KFold 将数据拆分为多个k
部分,然后进行i=1..k
迭代:
- takes
i'th
part as the test data and all other parts as training data - trains the model with training data (all parts except
i'th
) - then by using this trained model, predicts labels for
i'th
part (test data)
- 取
i'th
部分作为测试数据和其他所有部分作为训练数据 - 使用训练数据训练模型(所有部分除外
i'th
) - 然后通过使用这个训练好的模型,预测
i'th
零件的标签(测试数据)
In each iteration, label of i'th
part of data gets predicted. In the end cross_val_predict merges all partially predicted labels and returns them as the final result.
在每次迭代中,i'th
预测部分数据的标签。最后 cross_val_predict 合并所有部分预测的标签并将它们作为最终结果返回。
This code shows this process step by step:
此代码逐步显示此过程:
X = np.array([[0], [1], [2], [3], [4], [5]])
labels = np.array(['a', 'a', 'a', 'b', 'b', 'b'])
cv = KFold(len(labels), n_folds=3)
clf = SVC()
ypred_all = np.chararray((labels.shape))
i = 1
for train_index, test_index in cv.split(X):
print("iteration", i, ":")
print("train indices:", train_index)
print("train data:", X[train_index])
print("test indices:", test_index)
print("test data:", X[test_index])
clf.fit(X[train_index], labels[train_index])
ypred = clf.predict(X[test_index])
print("predicted labels for data of indices", test_index, "are:", ypred)
ypred_all[test_index] = ypred
print("merged predicted labels:", ypred_all)
i = i+1
print("=====================================")
y_cross_val_predict = cross_val_predict(clf, X, labels, cv=cv)
print("predicted labels by cross_val_predict:", y_cross_val_predict)
The result is:
结果是:
iteration 1 :
train indices: [2 3 4 5]
train data: [[2] [3] [4] [5]]
test indices: [0 1]
test data: [[0] [1]]
predicted labels for data of indices [0 1] are: ['b' 'b']
merged predicted labels: ['b' 'b' '' '' '' '']
=====================================
iteration 2 :
train indices: [0 1 4 5]
train data: [[0] [1] [4] [5]]
test indices: [2 3]
test data: [[2] [3]]
predicted labels for data of indices [2 3] are: ['a' 'b']
merged predicted labels: ['b' 'b' 'a' 'b' '' '']
=====================================
iteration 3 :
train indices: [0 1 2 3]
train data: [[0] [1] [2] [3]]
test indices: [4 5]
test data: [[4] [5]]
predicted labels for data of indices [4 5] are: ['a' 'a']
merged predicted labels: ['b' 'b' 'a' 'b' 'a' 'a']
=====================================
predicted labels by cross_val_predict: ['b' 'b' 'a' 'b' 'a' 'a']
回答by BloodyD
As you can see from the code of cross_val_predict
on github, the function computes for each fold the predictions and concatenates them. The predictions are made based on model learned from other folds.
正如您从githubcross_val_predict
上的代码中看到的那样,该函数为每个折叠计算预测并将它们连接起来。预测是基于从其他折叠中学习的模型进行的。
Here is a combination of your code and the example provided in the code
这是您的代码和代码中提供的示例的组合
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.metrics import accuracy_score
diabetes = datasets.load_diabetes()
X = diabetes.data[:400]
y = diabetes.target[:400]
cv = KFold(n_splits=20)
lasso = linear_model.Lasso()
y_pred = cross_val_predict(lasso, X, y, cv=cv)
accuracy = accuracy_score(y_pred.astype(int), y.astype(int))
print(accuracy)
# >>> 0.0075
Finally, to answer your question: "No, the accuracy is not averaged for each fold"
最后,回答你的问题:“不,准确度不是每个折叠的平均值”
回答by Enrico Damini
As it is written in the documenattion sklearn.model_selection.cross_val_predict:
正如文档sklearn.model_selection.cross_val_predict 中所写:
It is not appropriate to pass these predictions into an evaluation metric. Use cross_validateto measure generalization error.
将这些预测传递到评估指标中是不合适的。使用 cross_validate来衡量泛化误差。
回答by Shlomo Koppel
I would like to add an option for a quick and easy answer, above what the previous developers contributed.
我想在以前的开发人员贡献的内容之上添加一个快速简单的答案选项。
If you take micro average of F1 you will essentially be getting the accuracy rate. So for example that would be:
如果你取 F1 的微观平均值,你基本上会得到准确率。例如,这将是:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import precision_recall_fscore_support as score
y_pred = cross_val_predict(lm,df,y,cv=5)
precision, recall, fscore, support = score(y, y_pred, average='micro')
print(fscore)
This works mathematically, since the micro average gives you the weighted average of the confusion matrix.
这在数学上是有效的,因为微观平均值为您提供了混淆矩阵的加权平均值。
Good luck.
祝你好运。