Python 在 sklearn cross_val_score 上评估多个分数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35876508/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:06:19  来源:igfitidea点击:

Evaluate multiple scores on sklearn cross_val_score

pythonmachine-learningscikit-learn

提问by Cristiano Araujo

I'm trying to evaluate multiple machine learning algorithms with sklearn for a couple of metrics (accuracy, recall, precision and maybe more).

我正在尝试使用 sklearn 评估多个机器学习算法的几个指标(准确度、召回率、精确度等等)。

For what I understood from the documentation hereand from the source code(I'm using sklearn 0.17), the cross_val_scorefunction only receives one scorer for each execution. So for calculating multiple scores, I have to :

对于我从此处的文档和源代码(我使用的是 sklearn 0.17)中所了解的内容,cross_val_score函数每次执行仅接收一个记分员。所以为了计算多个分数,我必须:

  1. Execute multiple times
  2. Implement my (time consuming and error prone) scorer

    I've executed multiple times with this code :

    from sklearn.svm import SVC
    from sklearn.naive_bayes import GaussianNB
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.cross_validation import  cross_val_score
    import time
    from sklearn.datasets import  load_iris
    
    iris = load_iris()
    
    models = [GaussianNB(), DecisionTreeClassifier(), SVC()]
    names = ["Naive Bayes", "Decision Tree", "SVM"]
    for model, name in zip(models, names):
        print name
        start = time.time()
        for score in ["accuracy", "precision", "recall"]:
            print score,
            print " : ",
            print cross_val_score(model, iris.data, iris.target,scoring=score, cv=10).mean()
        print time.time() - start
    
  1. 执行多次
  2. 实现我的(耗时且容易出错的)记分器

    我用这段代码执行了多次:

    from sklearn.svm import SVC
    from sklearn.naive_bayes import GaussianNB
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.cross_validation import  cross_val_score
    import time
    from sklearn.datasets import  load_iris
    
    iris = load_iris()
    
    models = [GaussianNB(), DecisionTreeClassifier(), SVC()]
    names = ["Naive Bayes", "Decision Tree", "SVM"]
    for model, name in zip(models, names):
        print name
        start = time.time()
        for score in ["accuracy", "precision", "recall"]:
            print score,
            print " : ",
            print cross_val_score(model, iris.data, iris.target,scoring=score, cv=10).mean()
        print time.time() - start
    

And I get this output:

我得到这个输出:

Naive Bayes
accuracy  :  0.953333333333
precision  :  0.962698412698
recall  :  0.953333333333
0.0383198261261
Decision Tree
accuracy  :  0.953333333333
precision  :  0.958888888889
recall  :  0.953333333333
0.0494720935822
SVM
accuracy  :  0.98
precision  :  0.983333333333
recall  :  0.98
0.063080072403

Which is ok, but it's slow for my own data. How can I measure all scores ?

没关系,但是我自己的数据很慢。如何衡量所有分数?

回答by piman314

Since the time of writing this post scikit-learn has updated and made my answer obsolete, see the much cleaner solution below

自从写这篇文章以来,scikit-learn 已经更新并使我的答案过时,请参阅下面更清晰的解决方案



You can write your own scoring function to capture all three pieces of information, however a scoring function for cross validation must only return a single number in scikit-learn(this is likely for compatibility reasons). Below is an example where each of the scores for each cross validation slice prints to the console, and the returned value is just the sum of the three metrics. If you want to return all these values, you're going to have to make some changes to cross_val_score(line 1351 of cross_validation.py) and _score(line 1601 or the same file).

您可以编写自己的评分函数来捕获所有三个信息,但是交叉验证的评分函数必须只返回一个数字scikit-learn(这可能是出于兼容性原因)。下面是一个示例,其中每个交叉验证切片的每个分数都打印到控制台,并且返回值只是三个指标的总和。如果要返回所有这些值,则必须对cross_val_score(cross_validation.py 的第 1351 行)和_score(第 1601 行或同一文件)进行一些更改。

from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import  cross_val_score
import time
from sklearn.datasets import  load_iris
from sklearn.metrics import accuracy_score, precision_score, recall_score

iris = load_iris()

models = [GaussianNB(), DecisionTreeClassifier(), SVC()]
names = ["Naive Bayes", "Decision Tree", "SVM"]

def getScores(estimator, x, y):
    yPred = estimator.predict(x)
    return (accuracy_score(y, yPred), 
            precision_score(y, yPred, pos_label=3, average='macro'), 
            recall_score(y, yPred, pos_label=3, average='macro'))

def my_scorer(estimator, x, y):
    a, p, r = getScores(estimator, x, y)
    print a, p, r
    return a+p+r

for model, name in zip(models, names):
    print name
    start = time.time()
    m = cross_val_score(model, iris.data, iris.target,scoring=my_scorer, cv=10).mean()
    print '\nSum:',m, '\n\n'
    print 'time', time.time() - start, '\n\n'

Which gives:

这使:

Naive Bayes
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
0.866666666667 0.904761904762 0.866666666667
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0

Sum: 2.86936507937 


time 0.0249638557434 


Decision Tree
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
0.866666666667 0.866666666667 0.866666666667
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
1.0 1.0 1.0

Sum: 2.86555555556 


time 0.0237860679626 


SVM
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0

Sum: 2.94333333333 


time 0.043044090271 


As of scikit-learn 0.19.0 the solution becomes mucheasier

从 scikit-learn 0.19.0 开始,解决方案变得更加容易

from sklearn.model_selection import cross_validate
from sklearn.datasets import  load_iris
from sklearn.svm import SVC

iris = load_iris()
clf = SVC()
scoring = {'acc': 'accuracy',
           'prec_macro': 'precision_macro',
           'rec_micro': 'recall_macro'}
scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,
                         cv=5, return_train_score=True)
print(scores.keys())
print(scores['test_acc'])  

Which gives:

这使:

['test_acc', 'score_time', 'train_acc', 'fit_time', 'test_rec_micro', 'train_rec_micro', 'train_prec_macro', 'test_prec_macro']
[ 0.96666667  1.          0.96666667  0.96666667  1.        ]

回答by kyriakosSt

I ran over the same problem and I created a module that can support multiple metrics in cross_val_score.
In order to accomplish what you want with this module, you can write:

我遇到了同样的问题,并创建了一个可以支持cross_val_score.
为了用这个模块完成你想要的,你可以写:

from multiscorer import MultiScorer
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score          
from sklearn.model_selection import cross_val_score
from numpy import average

scorer = MultiScorer({
    'Accuracy'  : (accuracy_score , {}),
    'Precision' : (precision_score, {'pos_label': 3, 'average':'macro'}),
    'Recall'    : (recall_score   , {'pos_label': 3, 'average':'macro'})
})

for model, name in zip(models, names):
    print name
    start = time.time()

    _ = cross_val_score(model, iris.data, iris.target,scoring=scorer, cv=10) # Added assignment of the result to `_` in order to illustrate that the return value will not be used
    results = scorer.get_results()

    for metric_name in results.keys():
        average_score = np.average(results[metric_name])
        print('%s : %f' % (metric_name, average_score))

    print 'time', time.time() - start, '\n\n'

You can check and download this module from GitHub. Hope it helps.

你可以从GitHub 上查看和下载这个模块。希望能帮助到你。