Python 以安全正确的方式使用 RandomForestClassifier 的 predict_proba() 函数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30814231/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using the predict_proba() function of RandomForestClassifier in the safe and right way
提问by Clinical
I'm using Scikit-learn to apply machine learning algorithm on my data sets. Sometimes I need to have the probabilities of labels/classes instead of the labels/classes themselves. Instead of having Spam/Not Spam as labels of emails, I wish to have only for example: 0.78 probability a given email is Spam.
我正在使用 Scikit-learn 在我的数据集上应用机器学习算法。有时我需要标签/类的概率而不是标签/类本身。与其将垃圾邮件/非垃圾邮件作为电子邮件的标签,我希望仅举例:给定电子邮件是垃圾邮件的概率为 0.78。
For such purpose, I'm using predict_proba()
with RandomForestClassifier as following:
为此,我使用predict_proba()
RandomForestClassifier 如下:
clf = RandomForestClassifier(n_estimators=10, max_depth=None,
min_samples_split=1, random_state=0)
scores = cross_val_score(clf, X, y)
print(scores.mean())
classifier = clf.fit(X,y)
predictions = classifier.predict_proba(Xtest)
print(predictions)
And I got those results:
我得到了这些结果:
[ 0.4 0.6]
[ 0.1 0.9]
[ 0.2 0.8]
[ 0.7 0.3]
[ 0.3 0.7]
[ 0.3 0.7]
[ 0.7 0.3]
[ 0.4 0.6]
Where the second column is for class: Spam. However, I have two main issues with the results about which I am not confident. The first issue is that the results represent the probabilities of the labels without being affected by the size of my data? The second issue is that the results show only one digit which is not very specific in some cases where the 0.701 probability is very different from 0.708. Is there any way to get the next 5 digit for example?
其中第二列用于类:垃圾邮件。但是,我对结果有两个主要问题,我对此没有信心。第一个问题是结果代表了标签的概率,而不受数据大小的影响?第二个问题是结果仅显示一位数字,这在某些情况下不是很具体,其中 0.701 概率与 0.708 的概率非常不同。例如,有没有办法获得下一个 5 位数字?
采纳答案by Sebastien
I get more than one digit in my results, are you sure it is not due to your dataset ? (for example using a very small dataset would yield to simple decision trees and so to 'simple' probabilities). Otherwise it may only be the display that shows one digit, but try to print
predictions[0,0]
.I am not sure to understand what you mean by "the probabilities aren't affected by the size of my data". If your concern is that you don't want to predict, eg, too many spams, what is usually done is to use a threshold
t
such that you predict 1 ifproba(label==1) > t
. This way you can use the threshold to balance your predictions, for example to limit the global probabilty of spams. And if you want to globally analyse your model, we usually compute the Area under the curve (AUC) of the Receiver operating characteristic (ROC) curve (see wikipedia article here). Basically the ROC curve is a description of your predictions depending on the thresholdt
.
我的结果中有不止一位数字,您确定这不是由于您的数据集造成的吗?(例如,使用非常小的数据集会产生简单的决策树,因此会产生“简单”的概率)。否则它可能只显示一位数字,但尝试打印
predictions[0,0]
。我不确定您所说的“概率不受我的数据大小的影响”是什么意思。如果您担心不想预测,例如,太多垃圾邮件,通常所做的是使用阈值
t
,以便您预测 1 ifproba(label==1) > t
。通过这种方式,您可以使用阈值来平衡您的预测,例如限制垃圾邮件的全局概率。如果您想对模型进行全局分析,我们通常会计算接收器操作特征 (ROC) 曲线的曲线下面积 (AUC)(请参阅此处的维基百科文章)。基本上,ROC 曲线是根据阈值对您的预测的描述t
。
Hope it helps!
希望能帮助到你!
回答by Andreus
A RandomForestClassifier
is a collection of DecisionTreeClassifier
's. No matter how big your training set, a decision tree simply returns: a decision. One class has probability 1, the other classes have probability 0.
ARandomForestClassifier
是DecisionTreeClassifier
's的集合。无论您的训练集有多大,决策树都会返回:一个决策。一类的概率为 1,其他类的概率为 0。
The RandomForest simply votes among the results. predict_proba()
returns the number of votes for each class (each tree in the forest makes its own decision and chooses exactly one class), divided by the number of trees in the forest. Hence, your precision is exactly 1/n_estimators
. Want more "precision"? Add more estimators. If you want to see variation at the 5th digit, you will need 10**5 = 100,000
estimators, which is excessive. You normally don't want more than 100 estimators, and often not that many.
RandomForest 只是在结果中投票。predict_proba()
返回每个类别的投票数(森林中的每棵树做出自己的决定并选择一个类别),除以森林中的树木数量。因此,您的精度正是1/n_estimators
. 想要更“精准”?添加更多估算器。如果您想看到第 5 位的变化,您将需要10**5 = 100,000
估计量,这太过分了。您通常不想要超过 100 个估算器,而且通常不需要那么多。