Python Accuracy Score ValueError:无法处理二进制和连续目标的混合
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38015181/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Accuracy Score ValueError: Can't Handle mix of binary and continuous target
提问by Arij SEDIRI
I'm using linear_model.LinearRegression
from scikit-learn as a predictive model. It works and it's perfect. I have a problem to evaluate the predicted results using the accuracy_score
metric.
我使用linear_model.LinearRegression
来自 scikit-learn 作为预测模型。它有效而且很完美。我在使用accuracy_score
指标评估预测结果时遇到问题。
This is my true Data :
这是我的真实数据:
array([1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0])
My predicted Data:
我的预测数据:
array([ 0.07094605, 0.1994941 , 0.19270157, 0.13379635, 0.04654469,
0.09212494, 0.19952108, 0.12884365, 0.15685076, -0.01274453,
0.32167554, 0.32167554, -0.10023553, 0.09819648, -0.06755516,
0.25390082, 0.17248324])
My code:
我的代码:
accuracy_score(y_true, y_pred, normalize=False)
Error message:
错误信息:
ValueError: Can't handle mix of binary and continuous target
ValueError:无法处理二进制和连续目标的混合
Help ? Thank you.
帮助 ?谢谢你。
采纳答案by natbusa
EDIT (after comment): the below will solve the coding issue, but is highly not recommendedto use this approach because a linear regression model is a very poor classifier, which will very likely not separate the classes correctly.
编辑(评论后):下面将解决编码问题,但强烈不建议使用这种方法,因为线性回归模型是一个非常差的分类器,很可能无法正确分离类。
Read the well written answer below by @desertnaut, explaining why this error is an hint of something wrong in the machine learning approach rather than something you have to 'fix'.
阅读@desertnaut 下面写得很好的答案,解释为什么这个错误是机器学习方法中错误的暗示,而不是你必须“修复”的东西。
accuracy_score(y_true, y_pred.round(), normalize=False)
回答by desertnaut
Despite the plethora of wrong answers here that attempt to circumvent the error by numerically manipulating the predictions, the root cause of your error is a theoreticaland not computational issue: you are trying to use a classificationmetric (accuracy) in a regression (i.e. numeric prediction) model (LinearRegression
), which is meaningless.
尽管这里有大量错误答案试图通过数值操纵预测来规避错误,但错误的根本原因是理论问题而非计算问题:您试图在回归中使用分类度量(准确度)(即数字预测)模型(LinearRegression
),这是没有意义的。
Just like the majority of performance metrics, accuracy compares apples to apples (i.e true labels of 0/1 with predictions again of 0/1); so, when you ask the function to compare binary true labels (apples) with continuous predictions (oranges), you get an expected error, where the message tells you exactly what the problem is from a computationalpoint of view:
就像大多数性能指标一样,准确性将苹果与苹果进行比较(即真实标签为 0/1,再次预测为 0/1);因此,当您要求函数将二进制真实标签(苹果)与连续预测(橙色)进行比较时,您会得到一个预期错误,其中消息从计算的角度准确地告诉您问题是什么:
Classification metrics can't handle a mix of binary and continuous target
Despite that the message doesn't tell you directly that you are trying to compute a metric that is invalid for your problem (and we shouldn't actually expect it to go that far), it is certainly a good thing that scikit-learn at least gives you a direct and explicit warning that you are attempting something wrong; this is not necessarily the case with other frameworks - see for example the behavior of Keras in a very similar situation, where you get no warning at all, and one just ends up complaining for low "accuracy" in a regression setting...
尽管该消息没有直接告诉您您正在尝试计算一个对您的问题无效的指标(我们实际上不应该期望它走那么远),但 scikit-learn 在至少给你一个直接而明确的警告,你正在尝试错误的东西;其他框架不一定是这种情况 - 例如,在非常相似的情况下,Keras的行为,您根本没有收到任何警告,并且最终会抱怨回归设置中的“准确度”低......
I am super-surprised with all the other answers here (including the accepted & highly upvoted one) effectively suggesting to manipulate the predictions in order to simply get rid of the error; it's true that, once we end up with a set of numbers, we can certainly start mingling with them in various ways (rounding, thresholding etc) in order to make our code behave, but this of course does not mean that our numeric manipulations are meaningfulin the specific context of the ML problem we are trying to solve.
我对这里的所有其他答案(包括接受和高度赞成的答案)感到非常惊讶,有效地建议操纵预测以简单地摆脱错误;确实,一旦我们得到一组数字,我们当然可以开始以各种方式(四舍五入、阈值等)与它们混合以使我们的代码表现良好,但这当然并不意味着我们的数字操作是在我们试图解决的 ML 问题的特定上下文中是有意义的。
So, to wrap up: the problem is that you are applying a metric (accuracy) that is inappropriatefor your model (LinearRegression
): if you are in a classification setting, you should change your model (e.g. use LogisticRegression
instead); if you are in a regression (i.e. numeric prediction) setting, you should change the metric. Check the list of metrics available in scikit-learn, where you can confirm that accuracy is used only in classification.
所以,总结一下:问题在于您应用的指标(准确性)不适合您的模型(LinearRegression
):如果您处于分类设置中,您应该更改您的模型(例如LogisticRegression
改用);如果您处于回归(即数字预测)设置中,则应更改指标。检查scikit-learn 中可用的指标列表,您可以在其中确认准确性仅用于分类。
Compare also the situation with a recent SO question, where the OP is trying to get the accuracy of a list of models:
还将这种情况与最近的 SO question 进行比较,其中 OP 试图获得模型列表的准确性:
models = []
models.append(('SVM', svm.SVC()))
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
#models.append(('SGDRegressor', linear_model.SGDRegressor())) #ValueError: Classification metrics can't handle a mix of binary and continuous targets
#models.append(('BayesianRidge', linear_model.BayesianRidge())) #ValueError: Classification metrics can't handle a mix of binary and continuous targets
#models.append(('LassoLars', linear_model.LassoLars())) #ValueError: Classification metrics can't handle a mix of binary and continuous targets
#models.append(('ARDRegression', linear_model.ARDRegression())) #ValueError: Classification metrics can't handle a mix of binary and continuous targets
#models.append(('PassiveAggressiveRegressor', linear_model.PassiveAggressiveRegressor())) #ValueError: Classification metrics can't handle a mix of binary and continuous targets
#models.append(('TheilSenRegressor', linear_model.TheilSenRegressor())) #ValueError: Classification metrics can't handle a mix of binary and continuous targets
#models.append(('LinearRegression', linear_model.LinearRegression())) #ValueError: Classification metrics can't handle a mix of binary and continuous targets
where the first 6 models work OK, while all the rest (commented-out) ones give the same error. By now, you should be able to convince yourself that all the commented-out models are regression (and not classification) ones, hence the justified error.
前 6 个模型工作正常,而其余所有(注释掉)模型都出现相同的错误。到现在为止,您应该能够说服自己所有注释掉的模型都是回归(而不是分类)模型,因此是合理的错误。
A last important note: it may sound legitimate for someone to claim:
最后一个重要说明:有人声称:
OK, but I want to use linear regression and then just round/threshold the outputs, effectively treating the predictions as "probabilities" and thus converting the model into a classifier
好的,但我想使用线性回归,然后对输出进行舍入/阈值处理,有效地将预测视为“概率”,从而将模型转换为分类器
Actually, this has already been suggested in several other answers here, implicitly or not; again, this is an invalidapproach (and the fact that you have negative predictions should have already alerted you that they cannot be interpreted as probabilities). Andrew Ng, in his popular Machine Learning course at Coursera, explains why this is a bad idea - see his Lecture 6.1 - Logistic Regression | Classificationat Youtube (explanation starts at ~ 3:00), as well as section 4.2 Why Not Linear Regression [for classification]?of the (highly recommended and freely available) textbook An Introduction to Statistical Learningby Hastie, Tibshirani and coworkers...
实际上,这里已经在其他几个答案中暗示或不暗示了这一点;再次,这是一种无效的方法(并且您有负面预测的事实应该已经提醒您它们不能被解释为概率)。Andrew Ng 在 Coursera 上受欢迎的机器学习课程中解释了为什么这是一个坏主意 - 参见他的讲座 6.1 - Logistic Regression | Youtube上的分类(解释从~3:00开始),以及第4.2节为什么不是线性回归[分类]?由 Hastie、Tibshirani 和同事撰写的(强烈推荐和免费提供的)教科书An Introduction to Statistical Learning...
回答by Amey Yadav
accuracy_score is a classification metric, you cannot use it for a regression problem.
Accuracy_score 是一个分类指标,您不能将其用于回归问题。
回答by MLKing
The sklearn.metrics.accuracy_score(y_true, y_pred)
method defines y_pred as
:
该sklearn.metrics.accuracy_score(y_true, y_pred)
方法定义y_pred as
:
y_pred: 1d array-like, or label indicator array / sparse matrix. Predicted labels, as returned by a classifier.
y_pred:一维数组,或标签指示数组/稀疏矩阵。 预测标签,由分类器返回。
Which means y_pred
has to be an array of 1's or 0's (predicated labels). They should not be probabilities.
这意味着y_pred
必须是一个由 1 或 0(谓词标签)组成的数组。它们不应该是概率。
The predicated labels (1's and 0's) and/or predicted probabilites can be generated using the LinearRegression()
model's methods predict()
and predict_proba()
respectively.
可以使用LinearRegression()
模型的方法predict()
和predict_proba()
分别生成预测标签(1 和 0)和/或预测概率。
1. Generate predicted labels:
1. 生成预测标签:
LR = linear_model.LinearRegression()
y_preds=LR.predict(X_test)
print(y_preds)
output:
输出:
[1 1 0 1]
y_preds
can now be used for the accuracy_score()
method: accuracy_score(y_true, y_pred)
y_preds
现在可以用于该accuracy_score()
方法:accuracy_score(y_true, y_pred)
2. Generate probabilities for labels:
2. 为标签生成概率:
Some metrics such as 'precision_recall_curve(y_true, probas_pred)' require probabilities, which can be generated as follows:
一些指标,例如 'precision_recall_curve(y_true, probas_pred)' 需要概率,可以按如下方式生成:
LR = linear_model.LinearRegression()
y_preds=LR.predict_proba(X_test)
print(y_preds)
output:
输出:
[0.87812372 0.77490434 0.30319547 0.84999743]
回答by JohnnyQ
The problem is that the true y is binary (zeros and ones), while your predictions are not. You probably generated probabilities and not predictions, hence the result :) Try instead to generate class membership, and it should work!
问题是真正的 y 是二进制的(零和一),而你的预测不是。您可能生成了概率而不是预测,因此结果:) 尝试生成类成员资格,它应该可以工作!
回答by Het Thummar
Just use
只需使用
y_pred = (y_pred > 0.5)
accuracy_score(y_true, y_pred, normalize=False)
回答by Sreenath Nukala
The error is because difference in datatypes of y_pred and y_true. y_true might be dataframe and y_pred is arraylist. If you convert both to arrays, then issue will get resolved.
错误是因为 y_pred 和 y_true 的数据类型不同。y_true 可能是数据帧,而 y_pred 是数组列表。如果将两者都转换为数组,则问题将得到解决。