Python 以安全正确的方式使用 RandomForestClassifier 的 predict_proba() 函数

Question

提问by Clinical

I'm using Scikit-learn to apply machine learning algorithm on my data sets. Sometimes I need to have the probabilities of labels/classes instead of the labels/classes themselves. Instead of having Spam/Not Spam as labels of emails, I wish to have only for example: 0.78 probability a given email is Spam.

我正在使用 Scikit-learn 在我的数据集上应用机器学习算法。有时我需要标签/类的概率而不是标签/类本身。与其将垃圾邮件/非垃圾邮件作为电子邮件的标签，我希望仅举例：给定电子邮件是垃圾邮件的概率为 0.78。

For such purpose, I'm using predict_proba()with RandomForestClassifier as following:

为此，我使用predict_proba()RandomForestClassifier 如下：

clf = RandomForestClassifier(n_estimators=10, max_depth=None,
    min_samples_split=1, random_state=0)
scores = cross_val_score(clf, X, y)
print(scores.mean())

classifier = clf.fit(X,y)
predictions = classifier.predict_proba(Xtest)
print(predictions)

And I got those results:

我得到了这些结果：

 [ 0.4  0.6]
 [ 0.1  0.9]
 [ 0.2  0.8]
 [ 0.7  0.3]
 [ 0.3  0.7]
 [ 0.3  0.7]
 [ 0.7  0.3]
 [ 0.4  0.6]

Where the second column is for class: Spam. However, I have two main issues with the results about which I am not confident. The first issue is that the results represent the probabilities of the labels without being affected by the size of my data? The second issue is that the results show only one digit which is not very specific in some cases where the 0.701 probability is very different from 0.708. Is there any way to get the next 5 digit for example?

其中第二列用于类：垃圾邮件。但是，我对结果有两个主要问题，我对此没有信心。第一个问题是结果代表了标签的概率，而不受数据大小的影响？第二个问题是结果仅显示一位数字，这在某些情况下不是很具体，其中 0.701 概率与 0.708 的概率非常不同。例如，有没有办法获得下一个 5 位数字？

Answer 1

采纳答案by Sebastien

I get more than one digit in my results, are you sure it is not due to your dataset ? (for example using a very small dataset would yield to simple decision trees and so to 'simple' probabilities). Otherwise it may only be the display that shows one digit, but try to print predictions[0,0].
I am not sure to understand what you mean by "the probabilities aren't affected by the size of my data". If your concern is that you don't want to predict, eg, too many spams, what is usually done is to use a threshold tsuch that you predict 1 if proba(label==1) > t. This way you can use the threshold to balance your predictions, for example to limit the global probabilty of spams. And if you want to globally analyse your model, we usually compute the Area under the curve (AUC) of the Receiver operating characteristic (ROC) curve (see wikipedia article here). Basically the ROC curve is a description of your predictions depending on the threshold t.

我的结果中有不止一位数字，您确定这不是由于您的数据集造成的吗？（例如，使用非常小的数据集会产生简单的决策树，因此会产生“简单”的概率）。否则它可能只显示一位数字，但尝试打印predictions[0,0]。
我不确定您所说的“概率不受我的数据大小的影响”是什么意思。如果您担心不想预测，例如，太多垃圾邮件，通常所做的是使用阈值t，以便您预测 1 if proba(label==1) > t。通过这种方式，您可以使用阈值来平衡您的预测，例如限制垃圾邮件的全局概率。如果您想对模型进行全局分析，我们通常会计算接收器操作特征 (ROC) 曲线的曲线下面积 (AUC)（请参阅此处的维基百科文章）。基本上，ROC 曲线是根据阈值对您的预测的描述t。

Hope it helps!

希望能帮助到你！

Answer 2

回答by Andreus

A RandomForestClassifieris a collection of DecisionTreeClassifier's. No matter how big your training set, a decision tree simply returns: a decision. One class has probability 1, the other classes have probability 0.

ARandomForestClassifier是DecisionTreeClassifier's的集合。无论您的训练集有多大，决策树都会返回：一个决策。一类的概率为 1，其他类的概率为 0。

The RandomForest simply votes among the results. predict_proba()returns the number of votes for each class (each tree in the forest makes its own decision and chooses exactly one class), divided by the number of trees in the forest. Hence, your precision is exactly 1/n_estimators. Want more "precision"? Add more estimators. If you want to see variation at the 5th digit, you will need 10**5 = 100,000estimators, which is excessive. You normally don't want more than 100 estimators, and often not that many.

RandomForest 只是在结果中投票。predict_proba()返回每个类别的投票数（森林中的每棵树做出自己的决定并选择一个类别），除以森林中的树木数量。因此，您的精度正是1/n_estimators. 想要更“精准”？添加更多估算器。如果您想看到第 5 位的变化，您将需要10**5 = 100,000估计量，这太过分了。您通常不想要超过 100 个估算器，而且通常不需要那么多。

Python 以安全正确的方式使用 RandomForestClassifier 的 predict_proba() 函数

提问by Clinical

采纳答案by Sebastien

回答by Andreus

相关推荐

最近更新

标签

Python 以安全正确的方式使用 RandomForestClassifier 的 predict_proba() 函数

提问by Clinical

采纳答案by Sebastien

回答by Andreus

相关推荐

Python flask restful：将参数传递给 GET 请求

如何在python中创建一个表示设定天数的日期对象

Python 过滤时从熊猫数据框中获取子字符串

Python 我想在我的熊猫数据框中创建一列 value_counts

相关推荐

最近更新

标签