Python Scikit-learn 返回小于 -1 的决定系数 (R^2) 值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23036866/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:09:41  来源:igfitidea点击:

Scikit-learn is returning coefficient of determination (R^2) values less than -1

pythonstatisticsscikit-learn

提问by rhombidodecahedron

I'm doing a simple linear model. I have

我正在做一个简单的线性模型。我有

fire = load_data()
regr = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(regr, fire.data, fire.target, cv=10, scoring='r2')
print scores

which yields

这产生

[  0.00000000e+00   0.00000000e+00  -8.27299054e+02  -5.80431382e+00
  -1.04444147e-01  -1.19367785e+00  -1.24843536e+00  -3.39950443e-01
   1.95018287e-02  -9.73940970e-02]

How is this possible? When I do the same thing with the built in diabetes data, it works perfectly fine, but for my data, it returns these seemingly absurd results. Have I done something wrong?

这怎么可能?当我用内置的糖尿病数据做同样的事情时,它工作得很好,但对于我的数据,它返回这些看似荒谬的结果。我做错了什么吗?

采纳答案by eickenberg

There is no reason r^2shouldn't be negative (despite the ^2in its name). This is also stated in the doc. You can see r^2as the comparison of your model fit (in the context of linear regression, e.g a model of order 1 (affine)) to a model of order 0 (just fitting a constant), both by minimizing a squared loss. The constant minimizing the squared error is the mean. Since you are doing cross validation with left out data, it can happen that the mean of your test set is wildly different from the mean of your training set. This alone can induce a much higher incurred squared error in your prediction versus just predicting the mean of the test data, which results in a negative r^2score.

没有理由r^2不应该是负面的(尽管^2它的名字是)。这也在doc 中说明。您可以r^2将模型拟合(在线性回归的上下文中,例如 1 阶(仿射)模型)与 0 阶模型(仅拟合一个常数)进行比较,两者都通过最小化平方损失。最小化平方误差的常数是平均值。由于您正在使用遗漏数据进行交叉验证,因此可能会发生测试集的平均值与训练集的平均值大不相同的情况。与仅预测测试数据的平均值相比,仅此一项就会在您的预测中引起更高的平方误差,从而导致负r^2分。

In worst case, if your data do not explain your target at all, these scores can become very strongly negative. Try

在最坏的情况下,如果您的数据根本无法解释您的目标,则这些分数可能会变得非常负面。尝试

import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(100, 80)
y = rng.randn(100)  # y has nothing to do with X whatsoever
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(LinearRegression(), X, y, cv=5, scoring='r2')

This should result in negative r^2values.

这应该导致负值r^2

In [23]: scores
Out[23]: 
array([-240.17927358,   -5.51819556,  -14.06815196,  -67.87003867,
    -64.14367035])

The important question now is whether this is due to the fact that linear models just do not find anything in your data, or to something else that may be fixed in the preprocessing of your data. Have you tried scaling your columns to have mean 0 and variance 1? You can do this using sklearn.preprocessing.StandardScaler. As a matter of fact, you should create a new estimator by concatenating a StandardScalerand the LinearRegressioninto a pipeline using sklearn.pipeline.Pipeline. Next you may want to try Ridge regression.

现在的重要问题是,这是因为线性模型在您的数据中找不到任何东西,还是因为在数据预处理中可能已修复的其他问题。您是否尝试过将列缩放为均值 0 和方差 1?您可以使用sklearn.preprocessing.StandardScaler. 事实上,您应该通过使用 将 aStandardScaler和 the连接LinearRegression到管道中来创建一个新的估算器sklearn.pipeline.Pipeline。接下来您可能想尝试岭回归。

回答by Fred Foo

R2 = 1 - RSS / TSS, where RSS is the residual sum of squares ∑(y - f(x))2 and TSS is the total sum of squares ∑(y - mean(y))2. Now for R2 ≥ -1, it is required that RSS/TSS ≤ 2, but it's easy to construct a model and dataset for which this is not true:

R2 = 1 - RSS / TSS,其中RSS 是残差平方和∑(y - f(x))2,TSS 是总平方和∑(y - mean(y))2。现在对于 R2 ≥ -1,要求 RSS/TSS ≤ 2,但很容易构建一个模型和数据集,但事实并非如此:

>>> x = np.arange(50, dtype=float)
>>> y = x
>>> def f(x): return -100
...
>>> rss = np.sum((y - f(x)) ** 2)
>>> tss = np.sum((y - y.mean()) ** 2)
>>> 1 - rss / tss
-74.430972388955581

回答by mgoldwasser

Just because R^2can be negative does not mean it should be.

仅仅因为R^2可以是负面的并不意味着它应该是。

Possibility 1: a bug in your code.

可能性 1:您的代码中存在错误。

A common bug that you should double check is that you are passing in parameters correctly:

您应该仔细检查的一个常见错误是您正确地传递了参数:

r2_score(y_true, y_pred) # Correct!
r2_score(y_pred, y_true) # Incorrect!!!!

Possibility 2: small datasets

可能性 2:小数据集

If you are getting a negative R^2, you could also check for over fitting. Keep in mind that cross_validation.cross_val_score()does not randomly shuffle your inputs, so if your sample are inadvertently sorted (by date for example) then you might build models on each fold that are not predictive for the other folds.

如果您得到负的 R^2,您还可以检查是否过度拟合。请记住,cross_validation.cross_val_score()这不会随机打乱您的输入,因此如果您的样本无意中排序(例如按日期),那么您可能会在每个折叠上构建无法预测其他折叠的模型。

Try reducing the number of features, increasing the number samples, and decreasing the number of folds (if you are using cross_validation). While there is no official rule here, your m x ndataset (where mis the number of samples and nis the number of features) should be of a shape where

尝试减少特征数量、增加样本数量并减少折叠次数(如果您使用cross_validation)。虽然这里没有官方规则,但您的m x n数据集(其中m是样本n数和特征数)应该具有以下形状:

m > n^2

and when you using cross validation with fas the number of folds, you should aim for

当您使用交叉验证f作为折叠数时,您应该瞄准

m/f > n^2

回答by Alexus Wong

If you are getting negative regression r^2 scores, make sure to remove any unique identifier (e.g. "id" or "rownum") from your dataset before fitting/scoring the model. Simple check, but it'll save you some headache time.

如果您得到负回归 r^2 分数,请确保在拟合/评分模型之前从数据集中删除任何唯一标识符(例如“id”或“rownum”)。简单的检查,但它会为您节省一些头痛时间。