Python:如何规范化混淆矩阵?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20927368/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python: how to normalize a confusion matrix?
提问by Kaly
I calculated a confusion matrix for my classifier using the method confusion_matrix() from the sklearn package. The diagonal elements of the confusion matrix represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier.
我使用sklearn包中的confusion_matrix()方法为我的分类器计算了一个混淆矩阵。混淆矩阵的对角线元素表示预测标签等于真实标签的点数,而非对角线元素是那些被分类器错误标记的点。
I would like to normalize my confusion matrix so that it contains only numbers between 0 and 1. I would like to read the percentage of correctly classified samples from the matrix.
我想标准化我的混淆矩阵,使其仅包含 0 到 1 之间的数字。我想从矩阵中读取正确分类样本的百分比。
I found several methods how to normalize a matrix (row and column normalization) but I don't know much about maths and am not sure if this is the correct approach. Can someone help please?
我找到了几种如何规范化矩阵(行和列规范化)的方法,但我对数学知之甚少,不确定这是否是正确的方法。有人可以帮忙吗?
采纳答案by hugomg
I'm assuming that M[i,j]stands for Element of real class i was classified as j. If its the other way around you are going to need to transpose everything I say. I'm also going to use the following matrix for concrete examples:
我假设这M[i,j]代表Element of real class i was classified as j. 如果反过来,您将需要转置我说的所有内容。我还将使用以下矩阵作为具体示例:
1 2 3
4 5 6
7 8 9
There are essentially two things you can do:
您基本上可以做两件事:
Finding how each class has been classified
查找每个类是如何分类的
The first thing you can ask is what percentage of elements of real class ihere classified as each class. To do so, we take a row fixing the iand divide each element by the sum of the elements in the row. In our example, objects from class 2 are classified as class 1 4 times, are classified correctly as class 2 5 times and are classified as class 3 6 times. To find the percentages we just divide everything by the sum 4 + 5 + 6 = 15
您可以问的第一件事是i这里将实际类的元素归类为每个类的百分比。为此,我们将一行固定i并将每个元素除以该行中元素的总和。在我们的例子中,来自第 2 类的对象被分类为第 4 类,被正确分类为第 2 类 5 次,被分类为第 3 类 6 次。要找到百分比,我们只需将所有内容除以总和 4 + 5 + 6 = 15
4/15 of the class 2 objects are classified as class 1
5/15 of the class 2 objects are classified as class 2
6/15 of the class 2 objects are classified as class 3
Finding what classes are responsible for each classification
查找负责每个分类的类
The second thing you can do is to look at each result from your classifier and ask how many of those results originate from each real class. Its going to be similar to the other case but with columns instead of rows. In our example, our classifier returns "1" 1 time when the original class is 1, 4 times when the original class is 2 and 7 times when the original class is 3. To find the percentages we divide by the sum 1 + 4 + 7 = 12
您可以做的第二件事是查看分类器的每个结果,并询问这些结果中有多少来自每个真实类。它与另一种情况类似,但使用列而不是行。在我们的例子中,我们的分类器在原始类为 1 时返回“1” 1 次,在原始类为 2 时返回 4 次,在原始类为 3 时返回 7 次。为了找到我们除以总和 1 + 4 + 的百分比7 = 12
1/12 of the objects classified as class 1 were from class 1
4/12 of the objects classified as class 1 were from class 2
7/12 of the objects classified as class 1 were from class 3
--
——
Of course, both the methods I gave only apply to single row column at a time and I'm not sure if it would be a good idea to actually modify your confusion matrix in this form. However, this should give the percentages you are looking for.
当然,我给出的两种方法一次仅适用于单行列,我不确定以这种形式实际修改混淆矩阵是否是一个好主意。但是,这应该给出您正在寻找的百分比。
回答by damienfrancois
The matrix output by sklearn's confusion_matrix()is such that
sklearn 的矩阵输出confusion_matrix()是这样的
C_{i, j} is equal to the number of observations known to be in group i but predicted to be in group j
C_{i, j} 等于已知在第 i 组但预测在第 j 组的观测数
so to get the percentages for each class (often called specificity and sensitivity in binary classification) you need to normalize by row: replace each element in a row by itself divided by the sum of the elements of that row.
因此,要获得每个类别的百分比(在二元分类中通常称为特异性和敏感性),您需要按行归一化:将一行中的每个元素替换为自身除以该行元素的总和。
Note that sklearn has a summary function available that computes metrics from the confusion matrix : classification_report. It outputs precision and recall rather than specificity and sensitivity, but those are often regarded as more informative in general (especially for imbalanced multi-class classification.)
请注意, sklearn 有一个可用的汇总函数,可以从混淆矩阵中计算指标:classification_report。它输出的是精确度和召回率,而不是特异性和灵敏度,但这些通常被认为通常提供更多信息(特别是对于不平衡的多类分类)。
回答by Fred Foo
Suppose that
假设
>>> y_true = [0, 0, 1, 1, 2, 0, 1]
>>> y_pred = [0, 1, 0, 1, 2, 2, 1]
>>> C = confusion_matrix(y_true, y_pred)
>>> C
array([[1, 1, 1],
[1, 2, 0],
[0, 0, 1]])
Then, to find out how many samples per class have received their correct label, you need
然后,要找出每个类别有多少样本收到了正确的标签,您需要
>>> C / C.astype(np.float).sum(axis=1)
array([[ 0.33333333, 0.33333333, 1. ],
[ 0.33333333, 0.66666667, 0. ],
[ 0. , 0. , 1. ]])
The diagonal contains the required values. Another way to compute these is to realize that what you're computing is the recall per class:
对角线包含所需的值。计算这些的另一种方法是意识到您正在计算的是每个类的召回率:
>>> from sklearn.metrics import precision_recall_fscore_support
>>> _, recall, _, _ = precision_recall_fscore_support(y_true, y_pred)
>>> recall
array([ 0.33333333, 0.66666667, 1. ])
Similarly, if you divide by the sum over axis=0, you get the precision (fraction of class-kpredictions that have ground truth label k):
类似地,如果除以总和axis=0,则得到精度(k具有真实标签的类别预测的分数k):
>>> C / C.astype(np.float).sum(axis=0)
array([[ 0.5 , 0.33333333, 0.5 ],
[ 0.5 , 0.66666667, 0. ],
[ 0. , 0. , 0.5 ]])
>>> prec, _, _, _ = precision_recall_fscore_support(y_true, y_pred)
>>> prec
array([ 0.5 , 0.66666667, 0.5 ])
回答by Antoni
From the sklearn documentation (plot example)
来自 sklearn 文档(绘图示例)
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
where cm is the confusion matrix as provided by sklearn.
其中 cm 是 sklearn 提供的混淆矩阵。
回答by Pranzell
There's a library provided by scikit-learn itself for plotting graphs. It is based on matplotlib and it should be already installed to proceed further.
scikit-learn 本身提供了一个用于绘制图形的库。它基于 matplotlib 并且应该已经安装以进一步进行。
pip install scikit-plot
Now, just set normalizeparameter to true:
现在,只需将normalize参数设置为true:
import scikitplot as skplt
skplt.metrics.plot_confusion_matrix(Y_TRUE, Y_PRED, normalize=True)
回答by BringBackCommodore64
Using Seaborn you can easily print a normalised AND pretty confusion matrix with a heathmap:
使用 Seaborn,您可以轻松地打印带有热度图的标准化且漂亮的混淆矩阵:
from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
# Normalise
cmn = cm.astype('float') /
cm.sum(axis=1)[:, np.newaxis]
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(cmn, annot=True, fmt='.2f', xticklabels=target_names, yticklabels=target_names)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show(block=False)
回答by Ignacio Peletier
I think the easiest way to do this is by doing:
我认为最简单的方法是执行以下操作:
c = sklearn.metrics.confusion_matrix(y, y_pred)
normed_c = (c.T / c.astype(np.float).sum(axis=1)).T


