Python 如何使用 scikit learn 计算多类案例的准确率、召回率、准确率和 f1 分数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31421413/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?
提问by new_with_python
I'm working in a sentiment analysis problem the data looks like this:
我正在处理情感分析问题,数据如下所示:
label instances
5 1190
4 838
3 239
1 204
2 127
So my data is unbalanced since 1190 instances
are labeled with 5
. For the classification Im using scikit's SVC. The problem is I do not know how to balance my data in the right way in order to compute accurately the precision, recall, accuracy and f1-score for the multiclass case. So I tried the following approaches:
所以我的数据是不平衡的,因为 1190instances
标有5
. 对于使用 scikit 的SVC的分类 Im 。问题是我不知道如何以正确的方式平衡我的数据,以便准确计算多类案例的准确率、召回率、准确率和 f1 分数。所以我尝试了以下方法:
First:
第一的:
wclf = SVC(kernel='linear', C= 1, class_weight={1: 10})
wclf.fit(X, y)
weighted_prediction = wclf.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, weighted_prediction)
print 'F1 score:', f1_score(y_test, weighted_prediction,average='weighted')
print 'Recall:', recall_score(y_test, weighted_prediction,
average='weighted')
print 'Precision:', precision_score(y_test, weighted_prediction,
average='weighted')
print '\n clasification report:\n', classification_report(y_test, weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, weighted_prediction)
Second:
第二:
auto_wclf = SVC(kernel='linear', C= 1, class_weight='auto')
auto_wclf.fit(X, y)
auto_weighted_prediction = auto_wclf.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, auto_weighted_prediction)
print 'F1 score:', f1_score(y_test, auto_weighted_prediction,
average='weighted')
print 'Recall:', recall_score(y_test, auto_weighted_prediction,
average='weighted')
print 'Precision:', precision_score(y_test, auto_weighted_prediction,
average='weighted')
print '\n clasification report:\n', classification_report(y_test,auto_weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, auto_weighted_prediction)
Third:
第三:
clf = SVC(kernel='linear', C= 1)
clf.fit(X, y)
prediction = clf.predict(X_test)
from sklearn.metrics import precision_score, \
recall_score, confusion_matrix, classification_report, \
accuracy_score, f1_score
print 'Accuracy:', accuracy_score(y_test, prediction)
print 'F1 score:', f1_score(y_test, prediction)
print 'Recall:', recall_score(y_test, prediction)
print 'Precision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test,prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)
F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1082: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
sample_weight=sample_weight)
0.930416613529
However, Im getting warnings like this:
但是,我收到这样的警告:
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172:
DeprecationWarning: The default `weighted` averaging is deprecated,
and from version 0.18, use of precision, recall or F-score with
multiclass or multilabel data or pos_label=None will result in an
exception. Please set an explicit value for `average`, one of (None,
'micro', 'macro', 'weighted', 'samples'). In cross validation use, for
instance, scoring="f1_weighted" instead of scoring="f1"
How can I deal correctly with my unbalanced data in order to compute in the right way classifier's metrics?
如何正确处理不平衡的数据,以便以正确的方式计算分类器的指标?
采纳答案by ldirer
I think there is a lot of confusion about which weights are used for what. I am not sure I know precisely what bothers you so I am going to cover different topics, bear with me ;).
我认为对于哪些权重用于什么存在很多混淆。我不确定我确切地知道是什么困扰着您,所以我将涵盖不同的主题,请耐心等待;)。
Class weights
班级权重
The weights from the class_weight
parameter are used to train the classifier.
They are not used in the calculation of any of the metrics you are using: with different class weights, the numbers will be different simply because the classifier is different.
class_weight
参数的权重用于训练分类器。它们不会用于计算您正在使用的任何指标:对于不同的类权重,数字也会不同,因为分类器不同。
Basically in every scikit-learn classifier, the class weights are used to tell your model how important a class is. That means that during the training, the classifier will make extra efforts to classify properly the classes with high weights.
How they do that is algorithm-specific. If you want details about how it works for SVC and the doc does not make sense to you, feel free to mention it.
基本上在每个 scikit-learn 分类器中,类权重用于告诉您的模型一个类的重要性。这意味着在训练过程中,分类器会做出额外的努力来正确地对高权重的类进行分类。
他们如何做到这一点是特定于算法的。如果您想了解有关 SVC 如何工作的详细信息,并且该文档对您没有意义,请随时提及。
The metrics
指标
Once you have a classifier, you want to know how well it is performing.
Here you can use the metrics you mentioned: accuracy
, recall_score
, f1_score
...
一旦你有了一个分类器,你就想知道它的表现如何。在这里您可以使用您提到的指标:accuracy
, recall_score
, f1_score
...
Usually when the class distribution is unbalanced, accuracy is considered a poor choice as it gives high scores to models which just predict the most frequent class.
通常,当类别分布不平衡时,准确性被认为是一个糟糕的选择,因为它会给仅预测最频繁类别的模型提供高分。
I will not detail all these metrics but note that, with the exception of accuracy
, they are naturally applied at the class level: as you can see in this print
of a classification report they are defined for each class. They rely on concepts such as true positives
or false negative
that require defining which class is the positiveone.
我不会详细说明所有这些指标,但请注意,除了accuracy
,它们自然应用于类级别:正如您在print
分类报告中看到的那样,它们是为每个类定义的。它们依赖于诸如true positives
或false negative
需要定义哪个类是正类的概念。
precision recall f1-score support
0 0.65 1.00 0.79 17
1 0.57 0.75 0.65 16
2 0.33 0.06 0.10 17
avg / total 0.52 0.60 0.51 50
The warning
警告
F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The
default `weighted` averaging is deprecated, and from version 0.18,
use of precision, recall or F-score with multiclass or multilabel data
or pos_label=None will result in an exception. Please set an explicit
value for `average`, one of (None, 'micro', 'macro', 'weighted',
'samples'). In cross validation use, for instance,
scoring="f1_weighted" instead of scoring="f1".
You get this warning because you are using the f1-score, recall and precision without defining how they should be computed! The question could be rephrased: from the above classification report, how do you output oneglobal number for the f1-score? You could:
您收到此警告是因为您使用了 f1 分数、召回率和准确率,而没有定义它们的计算方式!这个问题可以改写:从上面的分类报告中,你如何为 f1-score输出一个全局数字?你可以:
- Take the average of the f1-score for each class: that's the
avg / total
result above. It's also called macroaveraging. - Compute the f1-score using the global count of true positives / false negatives, etc. (you sum the number of true positives / false negatives for each class). Aka microaveraging.
- Compute a weighted average of the f1-score. Using
'weighted'
in scikit-learn will weigh the f1-score by the support of the class: the more elements a class has, the more important the f1-score for this class in the computation.
- 取每个类的 f1 分数的平均值:这就是
avg / total
上面的结果。它也称为宏平均。 - 使用真阳性/假阴性等的全局计数计算 f1 分数(您对每个类别的真阳性/假阴性数量求和)。又名微平均。
- 计算 f1 分数的加权平均值。使用
'weighted'
在scikit学习会由支持类的权衡F1评分:越要素类有,更重要的F1的得分这个类在计算中。
These are 3 of the options in scikit-learn, the warning is there to say you have to pick one. So you have to specify an average
argument for the score method.
这些是 scikit-learn 中的 3 个选项,警告是说您必须选择一个。所以你必须average
为 score 方法指定一个参数。
Which one you choose is up to how you want to measure the performance of the classifier: for instance macro-averaging does not take class imbalance into account and the f1-score of class 1 will be just as important as the f1-score of class 5. If you use weighted averaging however you'll get more importance for the class 5.
你选择哪一个取决于你想如何衡量分类器的性能:例如宏观平均不考虑类不平衡,类 1 的 f1-score 将与类的 f1-score 一样重要5. 但是,如果您使用加权平均,您将获得第 5 类的更多重要性。
The whole argument specification in these metrics is not super-clear in scikit-learn right now, it will get better in version 0.18 according to the docs. They are removing some non-obvious standard behavior and they are issuing warnings so that developers notice it.
这些指标中的整个参数规范现在在 scikit-learn 中还不是很清楚,根据文档,它会在 0.18 版中变得更好。他们正在删除一些不明显的标准行为,并发布警告以便开发人员注意到它。
Computing scores
计算分数
Last thing I want to mention (feel free to skip it if you're aware of it) is that scores are only meaningful if they are computed on data that the classifier has never seen. This is extremely important as any score you get on data that was used in fitting the classifier is completely irrelevant.
最后我想说的(如果你知道的话可以跳过它)是,分数只有在分类器从未见过的数据上计算时才有意义。这非常重要,因为您在用于拟合分类器的数据上获得的任何分数都完全无关紧要。
Here's a way to do it using StratifiedShuffleSplit
, which gives you a random splits of your data (after shuffling) that preserve the label distribution.
这是一种使用 的方法StratifiedShuffleSplit
,它为您提供了保留标签分布的数据的随机拆分(混洗后)。
from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
# We use a utility to generate artificial classification data.
X, y = make_classification(n_samples=100, n_informative=10, n_classes=3)
sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
for train_idx, test_idx in sss:
X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx]
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
print(f1_score(y_test, y_pred, average="macro"))
print(precision_score(y_test, y_pred, average="macro"))
print(recall_score(y_test, y_pred, average="macro"))
Hope this helps.
希望这可以帮助。
回答by Vlad Mironov
First of all it's a little bit harder using just counting analysis to tell if your data is unbalanced or not. For example: 1 in 1000 positive observation is just a noise, error or a breakthrough in science? You never know.
So it's always better to use all your available knowledge and choice its status with all wise.
首先,仅使用计数分析来判断您的数据是否不平衡会有点困难。例如:千分之一的正面观察只是噪音、错误或科学突破?你永远不会知道。
因此,最好使用所有可用的知识并明智地选择其状态。
Okay, what if it's really unbalanced?
Once again — look to your data. Sometimes you can find one or two observation multiplied by hundred times. Sometimes it's useful to create this fake one-class-observations.
If all the data is clean next step is to use class weights in prediction model.
好吧,如果真的不平衡怎么办?
再一次——看看你的数据。有时你会发现一两个观察值乘以一百倍。有时创建这种虚假的一类观察是有用的。
如果所有数据都是干净的,下一步是在预测模型中使用类权重。
So what about multiclass metrics?
In my experience none of your metrics is usually used. There are two main reasons.
First: it's always better to work with probabilities than with solid prediction (because how else could you separate models with 0.9 and 0.6 prediction if they both give you the same class?)
And second: it's much easier to compare your prediction models and build new ones depending on only one good metric.
From my experience I could recommend loglossor MSE(or just mean squared error).
那么多类指标呢?
根据我的经验,通常不会使用您的任何指标。有两个主要原因。
第一:使用概率总是比使用可靠预测更好(因为如果它们都给你相同的类,你怎么能用 0.9 和 0.6 预测分离模型?)
第二:比较你的预测模型并构建新的模型要容易得多那些只依赖于一个好的指标。
根据我的经验,我可以推荐logloss或MSE(或只是均方误差)。
How to fix sklearn warnings?
Just simply (as yangjie noticed) overwrite average
parameter with one of these
values: 'micro'
(calculate metrics globally), 'macro'
(calculate metrics for each label) or 'weighted'
(same as macro but with auto weights).
如何修复sklearn警告?
只是简单地(正如 yangjie 注意到的)average
用以下值之一覆盖参数:('micro'
全局计算指标),'macro'
(为每个标签计算指标)或'weighted'
(与宏相同,但具有自动权重)。
f1_score(y_test, prediction, average='weighted')
All your Warnings came after calling metrics functions with default average
value 'binary'
which is inappropriate for multiclass prediction.
Good luck and have fun with machine learning!
您所有的警告都是在使用不适合多类预测的默认average
值调用度量函数之后出现的'binary'
。
祝你好运,享受机器学习的乐趣!
Edit:
I found another answerer recommendation to switch to regression approaches (e.g. SVR) with which I cannot agree. As far as I remember there is no even such a thing as multiclass regression. Yes there is multilabel regression which is far different and yes it's possible in some cases switch between regression and classification (if classes somehow sorted) but it pretty rare.
编辑:
我发现另一个回答者建议切换到我不能同意的回归方法(例如 SVR)。据我所知,甚至没有多类回归这样的东西。是的,多标签回归有很大的不同,是的,在某些情况下可以在回归和分类之间切换(如果类以某种方式排序),但这种情况很少见。
What I would recommend (in scope of scikit-learn) is to try another very powerful classification tools: gradient boosting, random forest(my favorite), KNeighborsand many more.
我建议(在 scikit-learn 范围内)尝试另一种非常强大的分类工具:梯度提升、随机森林(我最喜欢的)、KNeighbors等等。
After that you can calculate arithmetic or geometric mean between predictions and most of the time you'll get even better result.
之后,您可以计算预测之间的算术或几何平均值,并且大多数情况下您会得到更好的结果。
final_prediction = (KNNprediction * RFprediction) ** 0.5
回答by Nikita Astrakhantsev
Posed question
提出问题
Responding to the question 'what metric should be used for multi-class classification with imbalanced data': Macro-F1-measure. Macro Precision and Macro Recall can be also used, but they are not so easily interpretable as for binary classificaion, they are already incorporated into F-measure, and excess metrics complicate methods comparison, parameters tuning, and so on.
回答“对于不平衡数据的多类分类应该使用什么度量”的问题:Macro-F1-measure。也可以使用宏观精度和宏观召回,但它们不像二进制分类那样容易解释,它们已经被合并到 F-measure 中,并且多余的指标使方法比较、参数调整等复杂化。
Micro averaging are sensitive to class imbalance: if your method, for example, works good for the most common labels and totally messes others, micro-averaged metrics show good results.
微平均对类别不平衡很敏感:例如,如果您的方法适用于最常见的标签并且完全弄乱了其他标签,则微平均指标显示出良好的结果。
Weighting averaging isn't well suited for imbalanced data, because it weights by counts of labels. Moreover, it is too hardly interpretable and unpopular: for instance, there is no mention of such an averaging in the following very detailed surveyI strongly recommend to look through:
加权平均不太适合不平衡数据,因为它按标签计数加权。此外,它太难以解释和不受欢迎:例如,在我强烈建议阅读的以下非常详细的调查中没有提到这种平均值:
Sokolova, Marina, and Guy Lapalme. "A systematic analysis of performance measures for classification tasks." Information Processing & Management 45.4 (2009): 427-437.
索科洛娃、玛丽娜和盖伊·拉帕尔梅。“分类任务性能指标的系统分析。” 信息处理与管理 45.4 (2009):427-437。
Application-specific question
特定于应用程序的问题
However, returning to your task, I'd research 2 topics:
但是,回到您的任务,我将研究 2 个主题:
- metrics commonly used for your specific task - it lets (a) to compare your method with others and understand if you do something wrong, and (b) to not explore this by yourself and reuse someone else's findings;
- cost of different errors of your methods - for example, use-case of your application may rely on 4- and 5-star reviewes only - in this case, good metric should count only these 2 labels.
- 通常用于您的特定任务的指标 - 它可以让 (a) 将您的方法与其他人进行比较并了解您是否做错了什么,以及 (b) 不自己探索并重复使用其他人的发现;
- 你的方法不同错误的成本——例如,你的应用程序的用例可能只依赖于 4 星和 5 星评论——在这种情况下,好的指标应该只计算这 2 个标签。
Commonly used metrics.As I can infer after looking through literature, there are 2 main evaluation metrics:
常用指标。正如我在查阅文献后推断的那样,有两个主要的评估指标:
- Accuracy, which is used, e.g. in
- Accuracy,用于,例如
Yu, April, and Daryl Chang. "Multiclass Sentiment Prediction using Yelp Business."
Yu、April 和 Daryl Chang。“使用 Yelp 业务进行多类情绪预测。”
(link) - note that the authors work with almost the same distribution of ratings, see Figure 5.
(链接) - 请注意,作者使用几乎相同的评分分布,见图 5。
Pang, Bo, and Lillian Lee. "Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales." Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005.
庞、博和莉莲·李。“看到星星:利用等级关系对评分量表进行情感分类。” 第 43 届计算语言学协会年会论文集。计算语言学协会,2005 年。
(link)
(链接)
Lee, Moontae, and R. Grafe. "Multiclass sentiment analysis with restaurant reviews." Final Projects from CS N 224 (2010).
Lee、Moontae 和 R. Grafe。“带有餐厅评论的多类情感分析。” CS N 224 (2010) 的最终项目。
(link) - they explore both accuracy and MSE, considering the latter to be better
(链接) - 他们探索准确性和 MSE,认为后者更好
Pappas, Nikolaos, Rue Marconi, and Andrei Popescu-Belis. "Explaining the Stars: Weighted Multiple-Instance Learning for Aspect-Based Sentiment Analysis." Proceedings of the 2014 Conference on Empirical Methods In Natural Language Processing. No. EPFL-CONF-200899. 2014.
帕帕斯、尼古拉斯、马可尼街和安德烈波佩斯库-贝利斯。“解释星星:基于方面的情感分析的加权多实例学习。” 2014 年自然语言处理经验方法会议论文集。编号 EPFL-CONF-200899。2014 年。
(link) - they utilize scikit-learn for evaluation and baseline approaches and state that their code is available; however, I can't find it, so if you need it, write a letter to the authors, the work is pretty new and seems to be written in Python.
(链接) - 他们利用 scikit-learn 进行评估和基线方法,并声明他们的代码可用;但是,我找不到它,所以如果您需要它,请给作者写一封信,作品很新,似乎是用Python编写的。
Cost of different errors.If you care more about avoiding gross blunders, e.g. assinging 1-star to 5-star review or something like that, look at MSE; if difference matters, but not so much, try MAE, since it doesn't square diff; otherwise stay with Accuracy.
不同错误的代价。如果您更关心避免大错,例如,将 1 星评价为 5 星评价或类似的东西,请查看 MSE;如果差异很重要,但不是那么重要,请尝试 MAE,因为它不会平方差异;否则留在精度。
About approaches, not metrics
关于方法,而不是指标
Try regression approaches, e.g. SVR, since they generally outperforms Multiclass classifiers like SVC or OVA SVM.
尝试回归方法,例如SVR,因为它们通常优于多类分类器,如 SVC 或 OVA SVM。
回答by wonderkid2
Lot of very detailed answers here but I don't think you are answering the right questions. As I understand the question, there are two concerns:
这里有很多非常详细的答案,但我认为您没有回答正确的问题。根据我的理解,有两个问题:
- How to I score a multiclass problem?
- How do I deal with unbalanced data?
- 如何为多类问题评分?
- 如何处理不平衡的数据?
1.
1.
You can use most of the scoring functions in scikit-learn with both multiclass problem as with single class problems. Ex.:
您可以将 scikit-learn 中的大多数评分函数用于多类问题和单类问题。前任。:
from sklearn.metrics import precision_recall_fscore_support as score
predicted = [1,2,3,4,5,1,2,1,1,4,5]
y_test = [1,2,3,4,5,1,2,1,1,4,1]
precision, recall, fscore, support = score(y_test, predicted)
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
This way you end up with tangible and interpretable numbers for each of the classes.
通过这种方式,您最终会得到每个类的有形且可解释的数字。
| Label | Precision | Recall | FScore | Support |
|-------|-----------|--------|--------|---------|
| 1 | 94% | 83% | 0.88 | 204 |
| 2 | 71% | 50% | 0.54 | 127 |
| ... | ... | ... | ... | ... |
| 4 | 80% | 98% | 0.89 | 838 |
| 5 | 93% | 81% | 0.91 | 1190 |
Then...
然后...
2.
2.
... you can tell if the unbalanced data is even a problem. If the scoring for the less represented classes (class 1 and 2) are lower than for the classes with more training samples (class 4 and 5) then you know that the unbalanced data is in fact a problem, and you can act accordingly, as described in some of the other answers in this thread. However, if the same class distribution is present in the data you want to predict on, your unbalanced training data is a good representative of the data, and hence, the unbalance is a good thing.
...您可以判断不平衡的数据是否是一个问题。如果代表性较少的类别(类别 1 和类别 2)的得分低于具有更多训练样本的类别(类别 4 和类别 5),那么您知道不平衡数据实际上是一个问题,您可以采取相应的行动,如在该线程的其他一些答案中进行了描述。但是,如果您要预测的数据中存在相同的类分布,那么您的不平衡训练数据是数据的一个很好的代表,因此,不平衡是一件好事。