Roc 曲线和截止点。Python

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28719067/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:39:09  来源:igfitidea点击:

Roc curve and cut off point. Python

pythonlogistic-regressionroc

提问by Shiva Prakash

I ran a logistic regression model and made predictions of the logit values. I used this to get the points on the ROC curve:

我运行了一个逻辑回归模型并对 logit 值进行了预测。我用它来获得 ROC 曲线上的点:

 from sklearn import metrics
 fpr, tpr, thresholds = metrics.roc_curve(Y_test,p)

I know metrics.roc_auc_scoregives the area under the ROC curve. Can anyone tell me what command will find the optimal cut-off point (threshold value)?

我知道metrics.roc_auc_score给出了 ROC 曲线下的面积。谁能告诉我什么命令会找到最佳截止点(阈值)?

采纳答案by Manohar Swamynathan

You can do this using the epipackage in R, however I could not find similar package or example in Python.

您可以使用epiR 中的包来执行此操作,但是我在 Python 中找不到类似的包或示例。

The optimal cut off point would be where “true positive rate” is highand the “false positive rate” is low. Based on this logic, I have pulled an example below to find optimal threshold.

最佳截止点是“真阳性率”而“假阳性率”低的地方。基于这个逻辑,我在下面拉了一个例子来找到最佳阈值。

Python code:

蟒蛇代码:

import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
from sklearn.metrics import roc_curve, auc

# read the data in
df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")

# rename the 'rank' column because there is also a DataFrame method called 'rank'
df.columns = ["admit", "gre", "gpa", "prestige"]
# dummify rank
dummy_ranks = pd.get_dummies(df['prestige'], prefix='prestige')
# create a clean data frame for the regression
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(dummy_ranks.iloc[:, 'prestige_2':])

# manually add the intercept
data['intercept'] = 1.0

train_cols = data.columns[1:]
# fit the model
result = sm.Logit(data['admit'], data[train_cols]).fit()
print result.summary()

# Add prediction to dataframe
data['pred'] = result.predict(data[train_cols])

fpr, tpr, thresholds =roc_curve(data['admit'], data['pred'])
roc_auc = auc(fpr, tpr)
print("Area under the ROC curve : %f" % roc_auc)

####################################
# The optimal cut off would be where tpr is high and fpr is low
# tpr - (1-fpr) is zero or near to zero is the optimal cut off point
####################################
i = np.arange(len(tpr)) # index for df
roc = pd.DataFrame({'fpr' : pd.Series(fpr, index=i),'tpr' : pd.Series(tpr, index = i), '1-fpr' : pd.Series(1-fpr, index = i), 'tf' : pd.Series(tpr - (1-fpr), index = i), 'thresholds' : pd.Series(thresholds, index = i)})
roc.iloc[(roc.tf-0).abs().argsort()[:1]]

# Plot tpr vs 1-fpr
fig, ax = pl.subplots()
pl.plot(roc['tpr'])
pl.plot(roc['1-fpr'], color = 'red')
pl.xlabel('1-False Positive Rate')
pl.ylabel('True Positive Rate')
pl.title('Receiver operating characteristic')
ax.set_xticklabels([])

The optimal cut off point is 0.317628, so anything above this can be labeled as 1 else 0. You can see from the output/chart that where TPR is crossing 1-FPR the TPR is 63%, FPR is 36% and TPR-(1-FPR) is nearest to zero in the current example.

最佳截止点是 0.317628,因此高于此值的任何值都可以标记为 1 else 0。您可以从输出/图表中看到,在 TPR 与 1-FPR 交叉的地方,TPR 为 63%,FPR 为 36% 并且 TPR-( 1-FPR) 在当前示例中最接近于零。

Output:

输出:

        1-fpr       fpr        tf     thresholds       tpr
  171  0.637363  0.362637  0.000433    0.317628     0.637795

enter image description here

在此处输入图片说明

Hope this is helpful.

希望这是有帮助的。

Edit

编辑

To simplify and bring in re-usability, I have made a function to find the optimal probability cutoff point.

为了简化并带来可重用性,我制作了一个函数来找到最佳概率截止点。

Python Code:

蟒蛇代码:

def Find_Optimal_Cutoff(target, predicted):
    """ Find the optimal probability cutoff point for a classification model related to event rate
    Parameters
    ----------
    target : Matrix with dependent or target data, where rows are observations

    predicted : Matrix with predicted data, where rows are observations

    Returns
    -------     
    list type, with optimal cutoff value

    """
    fpr, tpr, threshold = roc_curve(target, predicted)
    i = np.arange(len(tpr)) 
    roc = pd.DataFrame({'tf' : pd.Series(tpr-(1-fpr), index=i), 'threshold' : pd.Series(threshold, index=i)})
    roc_t = roc.iloc[(roc.tf-0).abs().argsort()[:1]]

    return list(roc_t['threshold']) 


# Add prediction probability to dataframe
data['pred_proba'] = result.predict(data[train_cols])

# Find optimal probability threshold
threshold = Find_Optimal_Cutoff(data['admit'], data['pred_proba'])
print threshold
# [0.31762762459360921]

# Find prediction to the dataframe applying threshold
data['pred'] = data['pred_proba'].map(lambda x: 1 if x > threshold else 0)

# Print confusion Matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(data['admit'], data['pred'])
# array([[175,  98],
#        [ 46,  81]])

回答by lee

Vanilla Python Implementation of Youden's J-Score

Youden J-Score的Vanilla Python实现

def cutoff_youdens_j(fpr,tpr,thresholds):
    j_scores = tpr-fpr
    j_ordered = sorted(zip(j_scores,thresholds))
    return j_ordered[-1][1]

回答by cgnorthcutt

Given tpr, fpr, thresholds from your question, the answer for the optimal threshold is just:

鉴于您的问题中的 tpr、fpr、阈值,最佳阈值的答案是:

optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]

回答by j35t3r

The post of cgnorthcutt

cgnorthcutt 的帖子

Given tpr, fpr, thresholds from your question, the answer for the optimal threshold is just:

optimal_idx = np.argmax(tpr - fpr) optimal_threshold = thresholds[optimal_idx]

鉴于您的问题中的 tpr、fpr、阈值,最佳阈值的答案是:

best_idx = np.argmax(tpr - fpr)optimal_threshold = 阈值[optimal_idx]

is almost correct. The abs value must be taken.

几乎是正确的。必须采用 abs 值。

optimal_idx = np.argmin(np.abs(tpr - fpr)) // Edit: Change to argmin!
optimal_threshold = thresholds[optimal_idx]

According to the reference mentioned --> http://www.medicalbiostatistics.com/roccurve.pdfp.6 I ve found another possibility:

根据提到的参考资料 --> http://www.medicalbiostatistics.com/roccurve.pdfp.6 我发现了另一种可能性:

opt_idx = np.argmin(np.sqrt(np.square(1-tpr) + np.square(fpr)))

opt_idx = np.argmin(np.sqrt(np.square(1-tpr) + np.square(fpr)))

回答by Ramesh Kumar

Although I am late to the party, but you can also use Geometric Mean to determine the optimal threshold as stated here: threshold tuning for imbalance classification

虽然我迟到了,但您也可以使用几何均值来确定最佳阈值,如下所述:阈值调整不平衡分类

It can be computed as:

它可以计算为:

# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))