Python roc_auc_score() 和 auc() 的不同结果

Question

提问by gowithefloww

I have trouble understanding the difference (if there is one) between roc_auc_score()and auc()in scikit-learn.

我很难理解scikit-learnroc_auc_score()和auc()scikit-learn之间的区别（如果有的话）。

Im tying to predict a binary output with imbalanced classes (around 1.5% for Y=1).

我想预测具有不平衡类的二进制输出（Y = 1 时约为 1.5%）。

Classifier

分类器

model_logit = LogisticRegression(class_weight='auto')
model_logit.fit(X_train_ridge, Y_train)

Roc curve

洛克曲线

false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, clf.predict_proba(xtest)[:,1])

AUC's

AUC的

auc(false_positive_rate, true_positive_rate)
Out[490]: 0.82338034042531527

and

和

roc_auc_score(Y_test, clf.predict(xtest))
Out[493]: 0.75944737191205602

Somebody can explain this difference ? I thought both were just calculating the area under the ROC curve. Might be because of the imbalanced dataset but I could not figure out why.

有人可以解释这种差异吗？我认为两者都只是在计算 ROC 曲线下的面积。可能是因为数据集不平衡，但我不知道为什么。

Thanks!

谢谢！

Answer 1

采纳答案by oopcode

AUC is not always area under the curve of a ROC curve. Area Under the Curve is an (abstract) area under somecurve, so it is a more general thing than AUROC. With imbalanced classes, it may be better to find AUC for a precision-recall curve.

AUC 并不总是 ROC 曲线下的面积。曲线下面积为下（抽象）地区的一些曲线，所以它比AUROC更一般的事情。对于不平衡的类，最好为精确召回曲线找到 AUC。

See sklearn source for roc_auc_score:

请参阅 sklearn 源代码roc_auc_score：

def roc_auc_score(y_true, y_score, average="macro", sample_weight=None):
    # <...> docstring <...>
    def _binary_roc_auc_score(y_true, y_score, sample_weight=None):
            # <...> bla-bla <...>

            fpr, tpr, tresholds = roc_curve(y_true, y_score,
                                            sample_weight=sample_weight)
            return auc(fpr, tpr, reorder=True)

    return _average_binary_score(
        _binary_roc_auc_score, y_true, y_score, average,
        sample_weight=sample_weight)

As you can see, this first gets a roc curve, and then calls auc()to get the area.

如你所见，这首先得到一个 roc 曲线，然后调用auc()得到面积。

I guess your problem is the predict_proba()call. For a normal predict()the outputs are always the same:

我想你的问题是predict_proba()电话。对于正常predict()的输出总是相同的：

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc, roc_auc_score

est = LogisticRegression(class_weight='auto')
X = np.random.rand(10, 2)
y = np.random.randint(2, size=10)
est.fit(X, y)

false_positive_rate, true_positive_rate, thresholds = roc_curve(y, est.predict(X))
print auc(false_positive_rate, true_positive_rate)
# 0.857142857143
print roc_auc_score(y, est.predict(X))
# 0.857142857143

If you change the above for this, you'll sometimes get different outputs:

如果为此更改上述内容，有时会得到不同的输出：

false_positive_rate, true_positive_rate, thresholds = roc_curve(y, est.predict_proba(X)[:,1])
# may differ
print auc(false_positive_rate, true_positive_rate)
print roc_auc_score(y, est.predict(X))

Answer 2

回答by Andreus

predictreturns only one class or the other. Then you compute a ROC with the results of predicton a classifier, there are only three thresholds (trial all one class, trivial all the other class, and in between). Your ROC curve looks like this:

predict只返回一个类或另一个类。然后你用predict分类器上的结果计算 ROC，只有三个阈值（尝试所有一个类，所有其他类都是微不足道的，以及介于两者之间）。您的 ROC 曲线如下所示：

      ..............................
      |
      |
      |
......|
|
|
|
|
|
|
|
|
|
|
|

Meanwhile, predict_proba()returns an entire range of probabilities, so now you can put more than three thresholds on your data.

同时，predict_proba()返回整个概率范围，因此现在您可以对数据设置三个以上的阈值。

             .......................
             |
             |
             |
          ...|
          |
          |
     .....|
     |
     |
 ....|
.|
|
|
|
|

Hence different areas.

因此不同的领域。

Answer 3

回答by Dayvid Oliveira

When you use the y_pred (class labels), you already decided on the threshold. When you use y_prob (positive class probability) you are open to the threshold, and the ROC Curve should help you decide the threshold.

当您使用 y_pred（类标签）时，您已经决定了阈值。当您使用 y_prob（正类概率）时，您对阈值持开放态度，并且 ROC 曲线应该可以帮助您确定阈值。

For the first case you are using the probabilities:

对于第一种情况，您使用的是概率：

y_probs = clf.predict_proba(xtest)[:,1]
fp_rate, tp_rate, thresholds = roc_curve(y_true, y_probs)
auc(fp_rate, tp_rate)

When you do that, you're considering the AUC 'before' taking a decision on the threshold you'll be using.

当你这样做时，你是在“在”决定你将使用的阈值之前考虑 AUC。

In the second case, you are using the prediction (not the probabilities), in that case, use 'predict' instead of 'predict_proba' for both and you should get the same result.

在第二种情况下，您使用的是预测（而不是概率），在这种情况下，对两者都使用 'predict' 而不是 'predict_proba'，您应该得到相同的结果。

y_pred = clf.predict(xtest)
fp_rate, tp_rate, thresholds = roc_curve(y_true, y_pred)
print auc(fp_rate, tp_rate)
# 0.857142857143

print roc_auc_score(y, y_pred)
# 0.857142857143

Python roc_auc_score() 和 auc() 的不同结果

提问by gowithefloww

Classifier

分类器

Roc curve

洛克曲线

AUC's

AUC的

采纳答案by oopcode

回答by Andreus

回答by Dayvid Oliveira

相关推荐

最近更新

标签

Python roc_auc_score() 和 auc() 的不同结果

提问by gowithefloww

Classifier

分类器

Roc curve

洛克曲线

AUC's

AUC的

采纳答案by oopcode

回答by Andreus

回答by Dayvid Oliveira

相关推荐

Python Flask-SQLAlchemy：在回滚无效事务之前无法重新连接

Python 如何按对象计算熊猫组列中的不同值？

Python 查看 Spark 数据框列的内容

Python 将请求的响应保存到文件

相关推荐

最近更新

标签