Python Scikit-learn 返回小于 -1 的决定系数 (R^2) 值

Question

提问by rhombidodecahedron

I'm doing a simple linear model. I have

我正在做一个简单的线性模型。我有

fire = load_data()
regr = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(regr, fire.data, fire.target, cv=10, scoring='r2')
print scores

which yields

这产生

[  0.00000000e+00   0.00000000e+00  -8.27299054e+02  -5.80431382e+00
  -1.04444147e-01  -1.19367785e+00  -1.24843536e+00  -3.39950443e-01
   1.95018287e-02  -9.73940970e-02]

How is this possible? When I do the same thing with the built in diabetes data, it works perfectly fine, but for my data, it returns these seemingly absurd results. Have I done something wrong?

这怎么可能？当我用内置的糖尿病数据做同样的事情时，它工作得很好，但对于我的数据，它返回这些看似荒谬的结果。我做错了什么吗？

Answer 1

采纳答案by eickenberg

There is no reason r^2shouldn't be negative (despite the ^2in its name). This is also stated in the doc. You can see r^2as the comparison of your model fit (in the context of linear regression, e.g a model of order 1 (affine)) to a model of order 0 (just fitting a constant), both by minimizing a squared loss. The constant minimizing the squared error is the mean. Since you are doing cross validation with left out data, it can happen that the mean of your test set is wildly different from the mean of your training set. This alone can induce a much higher incurred squared error in your prediction versus just predicting the mean of the test data, which results in a negative r^2score.

没有理由r^2不应该是负面的（尽管^2它的名字是）。这也在doc 中说明。您可以r^2将模型拟合（在线性回归的上下文中，例如 1 阶（仿射）模型）与 0 阶模型（仅拟合一个常数）进行比较，两者都通过最小化平方损失。最小化平方误差的常数是平均值。由于您正在使用遗漏数据进行交叉验证，因此可能会发生测试集的平均值与训练集的平均值大不相同的情况。与仅预测测试数据的平均值相比，仅此一项就会在您的预测中引起更高的平方误差，从而导致负r^2分。

In worst case, if your data do not explain your target at all, these scores can become very strongly negative. Try

在最坏的情况下，如果您的数据根本无法解释您的目标，则这些分数可能会变得非常负面。尝试

import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(100, 80)
y = rng.randn(100)  # y has nothing to do with X whatsoever
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(LinearRegression(), X, y, cv=5, scoring='r2')

This should result in negative r^2values.

这应该导致负值r^2。

In [23]: scores
Out[23]: 
array([-240.17927358,   -5.51819556,  -14.06815196,  -67.87003867,
    -64.14367035])

The important question now is whether this is due to the fact that linear models just do not find anything in your data, or to something else that may be fixed in the preprocessing of your data. Have you tried scaling your columns to have mean 0 and variance 1? You can do this using sklearn.preprocessing.StandardScaler. As a matter of fact, you should create a new estimator by concatenating a StandardScalerand the LinearRegressioninto a pipeline using sklearn.pipeline.Pipeline. Next you may want to try Ridge regression.

现在的重要问题是，这是因为线性模型在您的数据中找不到任何东西，还是因为在数据预处理中可能已修复的其他问题。您是否尝试过将列缩放为均值 0 和方差 1？您可以使用sklearn.preprocessing.StandardScaler. 事实上，您应该通过使用将 aStandardScaler和 the连接LinearRegression到管道中来创建一个新的估算器sklearn.pipeline.Pipeline。接下来您可能想尝试岭回归。

Answer 2

回答by Fred Foo

R2 = 1 - RSS / TSS, where RSS is the residual sum of squares ∑(y - f(x))2 and TSS is the total sum of squares ∑(y - mean(y))2. Now for R2 ≥ -1, it is required that RSS/TSS ≤ 2, but it's easy to construct a model and dataset for which this is not true:

R2 = 1 - RSS / TSS，其中RSS 是残差平方和∑(y - f(x))2，TSS 是总平方和∑(y - mean(y))2。现在对于 R2 ≥ -1，要求 RSS/TSS ≤ 2，但很容易构建一个模型和数据集，但事实并非如此：

>>> x = np.arange(50, dtype=float)
>>> y = x
>>> def f(x): return -100
...
>>> rss = np.sum((y - f(x)) ** 2)
>>> tss = np.sum((y - y.mean()) ** 2)
>>> 1 - rss / tss
-74.430972388955581

Answer 3

回答by mgoldwasser

Just because R^2can be negative does not mean it should be.

仅仅因为R^2可以是负面的并不意味着它应该是。

Possibility 1: a bug in your code.

可能性 1：您的代码中存在错误。

A common bug that you should double check is that you are passing in parameters correctly:

您应该仔细检查的一个常见错误是您正确地传递了参数：

r2_score(y_true, y_pred) # Correct!
r2_score(y_pred, y_true) # Incorrect!!!!

Possibility 2: small datasets

可能性 2：小数据集

If you are getting a negative R^2, you could also check for over fitting. Keep in mind that cross_validation.cross_val_score()does not randomly shuffle your inputs, so if your sample are inadvertently sorted (by date for example) then you might build models on each fold that are not predictive for the other folds.

如果您得到负的 R^2，您还可以检查是否过度拟合。请记住，cross_validation.cross_val_score()这不会随机打乱您的输入，因此如果您的样本无意中排序（例如按日期），那么您可能会在每个折叠上构建无法预测其他折叠的模型。

Try reducing the number of features, increasing the number samples, and decreasing the number of folds (if you are using cross_validation). While there is no official rule here, your m x ndataset (where mis the number of samples and nis the number of features) should be of a shape where

尝试减少特征数量、增加样本数量并减少折叠次数（如果您使用cross_validation）。虽然这里没有官方规则，但您的m x n数据集（其中m是样本n数和特征数）应该具有以下形状：

m > n^2

and when you using cross validation with fas the number of folds, you should aim for

当您使用交叉验证f作为折叠数时，您应该瞄准

m/f > n^2

Answer 4

回答by Alexus Wong

If you are getting negative regression r^2 scores, make sure to remove any unique identifier (e.g. "id" or "rownum") from your dataset before fitting/scoring the model. Simple check, but it'll save you some headache time.

如果您得到负回归 r^2 分数，请确保在拟合/评分模型之前从数据集中删除任何唯一标识符（例如“id”或“rownum”）。简单的检查，但它会为您节省一些头痛时间。

Python Scikit-learn 返回小于 -1 的决定系数 (R^2) 值

提问by rhombidodecahedron

采纳答案by eickenberg

回答by Fred Foo

回答by mgoldwasser

回答by Alexus Wong

相关推荐

最近更新

标签

Python Scikit-learn 返回小于 -1 的决定系数 (R^2) 值

提问by rhombidodecahedron

采纳答案by eickenberg

回答by Fred Foo

回答by mgoldwasser

回答by Alexus Wong

相关推荐

Python 如何重置字典中的所有值

Python 我为什么要在熊猫中制作数据框的副本

Python np.random.seed() 和 np.random.RandomState() 的区别

无法使用 python PDFKIT 创建 pdf 错误：“找不到 wkhtmltopdf 可执行文件：”

相关推荐

最近更新

标签