Python Scikit-learn predict_proba 给出了错误的答案

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17017882/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:14:27  来源:igfitidea点击:

Scikit-learn predict_proba gives wrong answers

pythonscikit-learn

提问by Alex

This is a follow-up question from How to know what classes are represented in return array from predict_proba in Scikit-learn

这是来自如何知道 Scikit-learn 中的 predict_proba 的返回数组中表示哪些类的后续问题

In that question, I quoted the following code:

在那个问题中,我引用了以下代码:

>>> import sklearn
>>> sklearn.__version__
'0.13.1'
>>> from sklearn import svm
>>> model = svm.SVC(probability=True)
>>> X = [[1,2,3], [2,3,4]] # feature vectors
>>> Y = ['apple', 'orange'] # classes
>>> model.fit(X, Y)
>>> model.predict_proba([1,2,3])
array([[ 0.39097541,  0.60902459]])

I discovered in that question this result represents the probability of the point belonging to each class, in the order given by model.classes_

我在那个问题中发现这个结果代表了属于每个类的点的概率,按照 model.classes_ 给出的顺序

>>> zip(model.classes_, model.predict_proba([1,2,3])[0])
[('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]

So... this answer, if interpreted correctly, says that the point is probably an 'orange' (with a fairly low confidence, due to the tiny amount of data). But intuitively, this result is obviously incorrect, since the point given was identical to the training data for 'apple'. Just to be sure, I tested the reverse as well:

所以......这个答案,如果正确解释,说这个点可能是一个“橙色”(由于数据量很小,置信度相当低)。但直觉上,这个结果显然是不正确的,因为给出的点与“苹果”的训练数据相同。可以肯定的是,我也进行了相反的测试:

>>> zip(model.classes_, model.predict_proba([2,3,4])[0])
[('apple', 0.60705475211840931), ('orange', 0.39294524788159074)]

Again, obviously incorrect, but in the other direction.

同样,显然不正确,但在另一个方向。

Finally, I tried it with points that were much further away.

最后,我尝试使用更远的点。

>>> X = [[1,1,1], [20,20,20]] # feature vectors
>>> model.fit(X, Y)
>>> zip(model.classes_, model.predict_proba([1,1,1])[0])
[('apple', 0.33333332048410247), ('orange', 0.66666667951589786)]

Again, the model predicts the wrong probabilities. BUT, the model.predict function gets it right!

同样,该模型预测了错误的概率。但是,model.predict 函数是正确的!

>>> model.predict([1,1,1])[0]
'apple'

Now, I remember reading something in the docs about predict_proba being inaccurate for small datasets, though I can't seem to find it again. Is this the expected behaviour, or am I doing something wrong? If this IS the expected behaviour, then why does the predict and predict_proba function disagree one the output? And importantly, how big does the dataset need to be before I can trust the results from predict_proba?

现在,我记得在文档中阅读了一些关于 predict_proba 对于小数据集不准确的内容,尽管我似乎无法再次找到它。这是预期的行为,还是我做错了什么?如果这是预期的行为,那么为什么 predict 和 predict_proba 函数与输出不一致?重要的是,数据集需要多大才能让我相信 predict_proba 的结果?

-------- UPDATE --------

- - - - 更新 - - - -

Ok, so I did some more 'experiments' into this: the behaviour of predict_proba is heavily dependent on 'n', but not in any predictable way!

好的,所以我做了一些更多的“实验”:predict_proba 的行为严重依赖于 'n',但不是以任何可预测的方式!

>>> def train_test(n):
...     X = [[1,2,3], [2,3,4]] * n
...     Y = ['apple', 'orange'] * n
...     model.fit(X, Y)
...     print "n =", n, zip(model.classes_, model.predict_proba([1,2,3])[0])
... 
>>> train_test(1)
n = 1 [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]
>>> for n in range(1,10):
...     train_test(n)
... 
n = 1 [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]
n = 2 [('apple', 0.98437355278112448), ('orange', 0.015626447218875527)]
n = 3 [('apple', 0.90235408180319321), ('orange', 0.097645918196806694)]
n = 4 [('apple', 0.83333299908143665), ('orange', 0.16666700091856332)]
n = 5 [('apple', 0.85714254878984497), ('orange', 0.14285745121015511)]
n = 6 [('apple', 0.87499969631893626), ('orange', 0.1250003036810636)]
n = 7 [('apple', 0.88888844127886335), ('orange', 0.11111155872113669)]
n = 8 [('apple', 0.89999988018127364), ('orange', 0.10000011981872642)]
n = 9 [('apple', 0.90909082368682159), ('orange', 0.090909176313178491)]

How should I use this function safely in my code? At the very least, is there any value of n for which it will be guaranteed to agree with the result of model.predict?

我应该如何在我的代码中安全地使用这个函数?至少,是否有任何 n 值可以保证与 model.predict 的结果一致?

采纳答案by Bilal Dadanlar

if you use svm.LinearSVC()as estimator, and .decision_function()(which is like svm.SVC's .predict_proba()) for sorting the results from most probable class to the least probable one. this agrees with .predict()function. Plus, this estimator is faster and gives almost the same results with svm.SVC()

如果您svm.LinearSVC()用作估计器,并且.decision_function()(类似于 svm.SVC 的 .predict_proba())用于将结果从最可能的类排序到最不可能的类。这与.predict()功能一致。另外,这个估计器更快,并且给出几乎相同的结果svm.SVC()

the only drawback for you might be that .decision_function()gives a signed value sth like between -1 and 3 instead of a probability value. but it agrees with the prediction.

对您来说唯一的缺点可能是.decision_function()给出了一个介于 -1 和 3 之间的有符号值,而不是概率值。但它与预测一致。

回答by ogrisel

predict_probasis using the Platt scaling feature of libsvm to callibrate probabilities, see:

predict_probas正在使用 libsvm 的 Platt 缩放功能来校准概率,请参阅:

So indeed the hyperplane predictions and the proba calibration can disagree, especially if you only have 2 samples in your dataset. It's weird that the internal cross validation done by libsvm for scaling the probabilities does not fail (explicitly) in this case. Maybe this is a bug. One would have to dive into the Platt scaling code of libsvm to understand what's happening.

所以确实超平面预测和 proba 校准可能不一致,特别是如果你的数据集中只有 2 个样本。奇怪的是,在这种情况下,libsvm 为缩放概率而进行的内部交叉验证并没有失败(明确地)。也许这是一个错误。人们必须深入研究 libsvm 的 Platt 缩放代码才能了解发生了什么。

回答by Statmonger

Food for thought here. I think i actually got predict_proba to work as is. Please see code below...

这里值得深思。我想我实际上让 predict_proba 按原样工作。请看下面的代码...

# Test data
TX = [[1,2,3], [4,5,6], [7,8,9], [10,11,12], [13,14,15], [16,17,18], [19,20,21], [22,23,24]]
TY = ['apple', 'orange', 'grape', 'kiwi', 'mango','peach','banana','pear']

VX2 = [[16,17,18], [19,20,21], [22,23,24], [13,14,15], [10,11,12], [7,8,9], [4,5,6], [1,2,3]]
VY2 = ['peach','banana','pear','mango', 'kiwi', 'grape', 'orange','apple']

VX2_df = pd.DataFrame(data=VX2) # convert to dataframe
VX2_df = VX2_df.rename(index=float, columns={0: "N0", 1: "N1", 2: "N2"})
VY2_df = pd.DataFrame(data=VY2) # convert to dataframe
VY2_df = VY2_df.rename(index=float, columns={0: "label"})

# NEW - in testing
def train_model(classifier, feature_vector_train, label, feature_vector_valid, valid_y, valid_x, is_neural_net=False):

    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    # predict the top n labels on validation dataset
    n = 5
    #classifier.probability = True
    probas = classifier.predict_proba(feature_vector_valid)
    predictions = classifier.predict(feature_vector_valid)

    #Identify the indexes of the top predictions
    #top_n_predictions = np.argsort(probas)[:,:-n-1:-1]
    top_n_predictions = np.argsort(probas, axis = 1)[:,-n:]

    #then find the associated SOC code for each prediction
    top_socs = classifier.classes_[top_n_predictions]

    #cast to a new dataframe
    top_n_df = pd.DataFrame(data=top_socs)

    #merge it up with the validation labels and descriptions
    results = pd.merge(valid_y, valid_x, left_index=True, right_index=True)
    results = pd.merge(results, top_n_df, left_index=True, right_index=True)

    conditions = [
        (results['label'] == results[0]),
        (results['label'] == results[1]),
        (results['label'] == results[2]),
        (results['label'] == results[3]),
        (results['label'] == results[4])]
    choices = [1, 1, 1, 1, 1]
    results['Successes'] = np.select(conditions, choices, default=0)

    print("Top 5 Accuracy Rate = ", sum(results['Successes'])/results.shape[0])
    print("Top 1 Accuracy Rate = ", metrics.accuracy_score(predictions, valid_y))

train_model(naive_bayes.MultinomialNB(), TX, TY, VX2, VY2_df, VX2_df)

Output: Top 5 Accuracy Rate = 1.0 Top 1 Accuracy Rate = 1.0

输出:前 5 名准确率 = 1.0 前 1 名准确率 = 1.0

Couldn't get it to work for my own data though :(

虽然无法让它为我自己的数据工作:(