Python scikit-learn .predict() 默认阈值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19984957/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 15:09:00  来源:igfitidea点击:

scikit-learn .predict() default threshold

pythonmachine-learningclassificationscikit-learn

提问by ADJ

I'm working on a classification problem with unbalanced classes (5% 1's). I want to predict the class, not the probability.

我正在研究不平衡类(5% 1)的分类问题。我想预测类别,而不是概率。

In a binary classification problem, is scikit's classifier.predict()using 0.5by default? If it doesn't, what's the default method? If it does, how do I change it?

在二元分类问题中,是否默认classifier.predict()使用scikit 0.5?如果没有,默认方法是什么?如果是,我该如何更改?

In scikit some classifiers have the class_weight='auto'option, but not all do. With class_weight='auto', would .predict()use the actual population proportion as a threshold?

在 scikit 中,一些分类器可以class_weight='auto'选择,但并非所有分类器都可以。对于class_weight='auto',是否会.predict()使用实际人口比例作为阈值?

What would be the way to do this in a classifier like MultinomialNBthat doesn't support class_weight? Other than using predict_proba()and then calculation the classes myself.

MultinomialNB不支持的分类器中执行此操作的方法是什么class_weight?除了使用predict_proba()然后自己计算类。

采纳答案by Fred Foo

is scikit's classifier.predict()using 0.5 by default?

scikitclassifier.predict()默认使用 0.5 吗?

In probabilistic classifiers, yes. It's the only sensible threshold from a mathematical viewpoint, as others have explained.

在概率分类器中,是的。正如其他人所解释的那样,从数学的角度来看,这是唯一合理的阈值。

What would be the way to do this in a classifier like MultinomialNB that doesn't support class_weight?

在像 MultinomialNB 这样不支持的分类器中执行此操作的方法是什么class_weight

You can set the class_prior, which is the prior probability P(y) per class y. That effectively shifts the decision boundary. E.g.

您可以设置class_prior,这是每个类别y的先验概率 P( y) 。这有效地改变了决策边界。例如

# minimal dataset
>>> X = [[1, 0], [1, 0], [0, 1]]
>>> y = [0, 0, 1]
# use empirical prior, learned from y
>>> MultinomialNB().fit(X,y).predict([1,1])
array([0])
# use custom prior to make 1 more likely
>>> MultinomialNB(class_prior=[.1, .9]).fit(X,y).predict([1,1])
array([1])

回答by lejlot

You seem to be confusing concepts here. Threshold is not a concept for a "generic classifier" - the most basic approaches are based on some tunable threshold, but most of the existing methods create complex rules for classification which cannot (or at least shouldn't) be seen as a thresholding.

你似乎在这里混淆了概念。阈值不是“通用分类器”的概念——最基本的方法是基于一些可调阈值,但大多数现有方法创建了复杂的分类规则,不能(或至少不应该)被视为阈值。

So first - one cannot answer your question for scikit's classifier default threshold because there is no such thing.

所以首先 - 人们无法回答关于 scikit 分类器默认阈值的问题,因为没有这样的东西。

Second - class weighting is not about threshold, is about classifier ability to deal with imbalanced classes, and it is something dependent on a particular classifier. For example - in SVM case it is the way of weighting the slack variables in the optimization problem, or if you prefer - the upper bounds for the lagrange multipliers values connected with particular classes. Setting this to 'auto' means using some default heuristic, but once again - it cannot be simply translated into some thresholding.

第二类加权与阈值无关,与分类器处理不平衡类的能力有关,它取决于特定的分类器。例如 - 在 SVM 情况下,它是在优化问题中对松弛变量进行加权的方式,或者如果您愿意的话 - 与特定类别相关的拉格朗日乘子值的上限。将此设置为“自动”意味着使用一些默认启发式方法,但再一次 - 它不能简单地转换为某种阈值。

Naive Bayes on the other hand directlyestimates the classes probability from the training set. It is called "class prior" and you can set it in the constructor with "class_prior" variable.

另一方面,朴素贝叶斯直接从训练集中估计类概率。它被称为“类先验”,您可以使用“class_prior”变量在构造函数中设置它。

From the documentation:

文档

Prior probabilities of the classes. If specified the priors are not adjusted according to the data.

类的先验概率。如果指定,则不会根据数据调整先验。

回答by denson

The threshold in scikit learn is 0.5 for binary classification and whichever class has the greatest probability for multiclass classification. In many problems a much better result may be obtained by adjusting the threshold. However, this must be done with care and NOT on the holdout test data but by cross validation on the training data. If you do any adjustment of the threshold on your test data you are just overfitting the test data.

scikit learn 中二元分类的阈值是 0.5,哪个类对多类分类的概率最大。在许多问题中,通过调整阈值可以获得更好的结果。但是,这必须小心完成,而不是在坚持测试数据上,而是通过对训练数据进行交叉验证。如果您对测试数据的阈值进行任何调整,则只会过度拟合测试数据。

Most methods of adjusting the threshold is based on the receiver operating characteristics (ROC)and Youden's J statisticbut it can also be done by other methods such as a search with a genetic algorithm.

大多数调整阈值的方法基于接收器操作特性 (ROC)Youden's J 统计量,但也可以通过其他方法完成,例如使用遗传算法进行搜索。

Here is a peer review journal article describing doing this in medicine:

这是一篇同行评议期刊文章,描述了在医学中这样做:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2515362/

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2515362/

So far as I know there is no package for doing it in Python but it is relatively simple (but inefficient) to find it with a brute force search in Python.

据我所知,没有在 Python 中执行此操作的软件包,但在 Python 中使用蛮力搜索找到它相对简单(但效率低下)。

This is some R code that does it.

这是一些执行此操作的 R 代码。

## load data
DD73OP <- read.table("/my_probabilites.txt", header=T, quote="\"")

library("pROC")
# No smoothing
roc_OP <- roc(DD73OP$tc, DD73OP$prob)
auc_OP <- auc(roc_OP)
auc_OP
Area under the curve: 0.8909
plot(roc_OP)

# Best threshold
# Method: Youden
#Youden's J statistic (Youden, 1950) is employed. The optimal cut-off is the threshold that maximizes the distance to the identity (diagonal) line. Can be shortened to "y".
#The optimality criterion is:
#max(sensitivities + specificities)
coords(roc_OP, "best", ret=c("threshold", "specificity", "sensitivity"), best.method="youden")
#threshold specificity sensitivity 
#0.7276835   0.9092466   0.7559022

回答by michalw

In case someone visits this thread hoping for ready-to-use function (python 2.7). In this example cutoff is designed to reflect ratio of events to non-events in original dataset df, while y_probcould be the result of .predict_proba method (assuming stratified train/test split).

万一有人访问此线程,希望有现成的功能(python 2.7)。在此示例中,截止旨在反映原始数据集df中事件与非事件的比率,而y_prob可能是.predict_proba方法的结果(假设分层训练/测试拆分)。

def predict_with_cutoff(colname, y_prob, df):
    n_events = df[colname].values
    event_rate = sum(n_events) / float(df.shape[0]) * 100
    threshold = np.percentile(y_prob[:, 1], 100 - event_rate)
    print "Cutoff/threshold at: " + str(threshold)
    y_pred = [1 if x >= threshold else 0 for x in y_prob[:, 1]]
    return y_pred

Feel free to criticize/modify. Hope it helps in rare cases when class balancing is out of the question and the dataset itself is highly imbalanced.

随意批评/修改。希望在类平衡无法解决且数据集本身高度不平衡的极少数情况下它会有所帮助。

回答by Yuchao Jiang

The threshold can be set using clf.predict_proba()

可以使用阈值设置 clf.predict_proba()

for example:

例如:

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state = 2)
clf.fit(X_train,y_train)
# y_pred = clf.predict(X_test)  # default threshold is 0.5
y_pred = (clf.predict_proba(X_test)[:,1] >= 0.3).astype(bool) # set threshold as 0.3