Roc 曲线和截止点。Python
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28719067/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Roc curve and cut off point. Python
提问by Shiva Prakash
I ran a logistic regression model and made predictions of the logit values. I used this to get the points on the ROC curve:
我运行了一个逻辑回归模型并对 logit 值进行了预测。我用它来获得 ROC 曲线上的点:
from sklearn import metrics
fpr, tpr, thresholds = metrics.roc_curve(Y_test,p)
I know metrics.roc_auc_score
gives the area under the ROC curve. Can anyone tell me what command will find the optimal cut-off point (threshold value)?
我知道metrics.roc_auc_score
给出了 ROC 曲线下的面积。谁能告诉我什么命令会找到最佳截止点(阈值)?
采纳答案by Manohar Swamynathan
You can do this using the epi
package in R, however I could not find similar package or example in Python.
您可以使用epi
R 中的包来执行此操作,但是我在 Python 中找不到类似的包或示例。
The optimal cut off point would be where “true positive rate” is highand the “false positive rate” is low. Based on this logic, I have pulled an example below to find optimal threshold.
最佳截止点是“真阳性率”高而“假阳性率”低的地方。基于这个逻辑,我在下面拉了一个例子来找到最佳阈值。
Python code:
蟒蛇代码:
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
from sklearn.metrics import roc_curve, auc
# read the data in
df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
# rename the 'rank' column because there is also a DataFrame method called 'rank'
df.columns = ["admit", "gre", "gpa", "prestige"]
# dummify rank
dummy_ranks = pd.get_dummies(df['prestige'], prefix='prestige')
# create a clean data frame for the regression
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(dummy_ranks.iloc[:, 'prestige_2':])
# manually add the intercept
data['intercept'] = 1.0
train_cols = data.columns[1:]
# fit the model
result = sm.Logit(data['admit'], data[train_cols]).fit()
print result.summary()
# Add prediction to dataframe
data['pred'] = result.predict(data[train_cols])
fpr, tpr, thresholds =roc_curve(data['admit'], data['pred'])
roc_auc = auc(fpr, tpr)
print("Area under the ROC curve : %f" % roc_auc)
####################################
# The optimal cut off would be where tpr is high and fpr is low
# tpr - (1-fpr) is zero or near to zero is the optimal cut off point
####################################
i = np.arange(len(tpr)) # index for df
roc = pd.DataFrame({'fpr' : pd.Series(fpr, index=i),'tpr' : pd.Series(tpr, index = i), '1-fpr' : pd.Series(1-fpr, index = i), 'tf' : pd.Series(tpr - (1-fpr), index = i), 'thresholds' : pd.Series(thresholds, index = i)})
roc.iloc[(roc.tf-0).abs().argsort()[:1]]
# Plot tpr vs 1-fpr
fig, ax = pl.subplots()
pl.plot(roc['tpr'])
pl.plot(roc['1-fpr'], color = 'red')
pl.xlabel('1-False Positive Rate')
pl.ylabel('True Positive Rate')
pl.title('Receiver operating characteristic')
ax.set_xticklabels([])
The optimal cut off point is 0.317628, so anything above this can be labeled as 1 else 0. You can see from the output/chart that where TPR is crossing 1-FPR the TPR is 63%, FPR is 36% and TPR-(1-FPR) is nearest to zero in the current example.
最佳截止点是 0.317628,因此高于此值的任何值都可以标记为 1 else 0。您可以从输出/图表中看到,在 TPR 与 1-FPR 交叉的地方,TPR 为 63%,FPR 为 36% 并且 TPR-( 1-FPR) 在当前示例中最接近于零。
Output:
输出:
1-fpr fpr tf thresholds tpr
171 0.637363 0.362637 0.000433 0.317628 0.637795
Hope this is helpful.
希望这是有帮助的。
Edit
编辑
To simplify and bring in re-usability, I have made a function to find the optimal probability cutoff point.
为了简化并带来可重用性,我制作了一个函数来找到最佳概率截止点。
Python Code:
蟒蛇代码:
def Find_Optimal_Cutoff(target, predicted):
""" Find the optimal probability cutoff point for a classification model related to event rate
Parameters
----------
target : Matrix with dependent or target data, where rows are observations
predicted : Matrix with predicted data, where rows are observations
Returns
-------
list type, with optimal cutoff value
"""
fpr, tpr, threshold = roc_curve(target, predicted)
i = np.arange(len(tpr))
roc = pd.DataFrame({'tf' : pd.Series(tpr-(1-fpr), index=i), 'threshold' : pd.Series(threshold, index=i)})
roc_t = roc.iloc[(roc.tf-0).abs().argsort()[:1]]
return list(roc_t['threshold'])
# Add prediction probability to dataframe
data['pred_proba'] = result.predict(data[train_cols])
# Find optimal probability threshold
threshold = Find_Optimal_Cutoff(data['admit'], data['pred_proba'])
print threshold
# [0.31762762459360921]
# Find prediction to the dataframe applying threshold
data['pred'] = data['pred_proba'].map(lambda x: 1 if x > threshold else 0)
# Print confusion Matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(data['admit'], data['pred'])
# array([[175, 98],
# [ 46, 81]])
回答by lee
Vanilla Python Implementation of Youden's J-Score
Youden J-Score的Vanilla Python实现
def cutoff_youdens_j(fpr,tpr,thresholds):
j_scores = tpr-fpr
j_ordered = sorted(zip(j_scores,thresholds))
return j_ordered[-1][1]
回答by cgnorthcutt
Given tpr, fpr, thresholds from your question, the answer for the optimal threshold is just:
鉴于您的问题中的 tpr、fpr、阈值,最佳阈值的答案是:
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
回答by j35t3r
The post of cgnorthcutt
cgnorthcutt 的帖子
Given tpr, fpr, thresholds from your question, the answer for the optimal threshold is just:
optimal_idx = np.argmax(tpr - fpr) optimal_threshold = thresholds[optimal_idx]
鉴于您的问题中的 tpr、fpr、阈值,最佳阈值的答案是:
best_idx = np.argmax(tpr - fpr)optimal_threshold = 阈值[optimal_idx]
is almost correct. The abs value must be taken.
几乎是正确的。必须采用 abs 值。
optimal_idx = np.argmin(np.abs(tpr - fpr)) // Edit: Change to argmin!
optimal_threshold = thresholds[optimal_idx]
According to the reference mentioned --> http://www.medicalbiostatistics.com/roccurve.pdfp.6 I ve found another possibility:
根据提到的参考资料 --> http://www.medicalbiostatistics.com/roccurve.pdfp.6 我发现了另一种可能性:
opt_idx = np.argmin(np.sqrt(np.square(1-tpr) + np.square(fpr)))
opt_idx = np.argmin(np.sqrt(np.square(1-tpr) + np.square(fpr)))
回答by Ramesh Kumar
Although I am late to the party, but you can also use Geometric Mean to determine the optimal threshold as stated here: threshold tuning for imbalance classification
虽然我迟到了,但您也可以使用几何均值来确定最佳阈值,如下所述:阈值调整不平衡分类
It can be computed as:
它可以计算为:
# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))