Python cross_val_score 和 cross_val_predict 的区别
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43613443/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Difference between cross_val_score and cross_val_predict
提问by Bobipuegi
I want to evaluate a regression model build with scikitlearn using cross-validation and getting confused, which of the two functions cross_val_score
and cross_val_predict
I should use.
One option would be :
我想,以评估使用交叉验证和感到困惑,这两个功能scikitlearn回归模型构建cross_val_score
和cross_val_predict
我应该使用。一种选择是:
cvs = DecisionTreeRegressor(max_depth = depth)
scores = cross_val_score(cvs, predictors, target, cv=cvfolds, scoring='r2')
print("R2-Score: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
An other one, to use the cv-predictions with the standard r2_score
:
另一个,将 cv-predictions 与标准一起使用r2_score
:
cvp = DecisionTreeRegressor(max_depth = depth)
predictions = cross_val_predict(cvp, predictors, target, cv=cvfolds)
print ("CV R^2-Score: {}".format(r2_score(df[target], predictions_cv)))
I would assume that both methods are valid and give similar results. But that is only the case with small k-folds. While the r^2 is roughly the same for 10-fold-cv, it gets increasingly lower for higher k-values in the case of the first version using "cross_vall_score". The second version is mostly unaffected by changing numbers of folds.
我会假设这两种方法都是有效的并给出相似的结果。但这只是小 k 折的情况。虽然 10 倍 cv 的 r^2 大致相同,但在使用“cross_vall_score”的第一个版本的情况下,对于更高的 k 值,它变得越来越低。第二个版本几乎不受折叠数变化的影响。
Is this behavior to be expected and do I lack some understanding regarding CV in SKLearn?
这种行为是否在意料之中,我是否对 SKLearn 中的简历缺乏一些了解?
回答by Vivek Kumar
cross_val_score
returns score of test fold where cross_val_predict
returns predicted y values for the test fold.
cross_val_score
返回测试折叠的分数,其中cross_val_predict
返回测试折叠的预测 y 值。
For the cross_val_score()
, you are using the average of the output, which will be affected by the number of folds because then it may have some folds which may have high error (not fit correctly).
对于cross_val_score()
,您使用的是输出的平均值,这将受到折叠次数的影响,因为它可能有一些可能具有高错误(不正确拟合)的折叠。
Whereas, cross_val_predict()
returns, for each element in the input, the prediction that was obtained for that element when it was in the test set. [Note that only cross-validation strategies that assign all elements to a test set exactly once can be used]. So the increasing the number of folds, only increases the training data for the test element, and hence its result may not be affected much.
而cross_val_predict()
对于输入中的每个元素,返回当该元素在测试集中时为该元素获得的预测。[请注意,只有将所有元素分配给测试集一次的交叉验证策略才能使用]。所以增加折叠次数,只会增加测试元素的训练数据,因此可能不会对其结果产生太大影响。
Hope this helps. Feel free to ask any doubt.
希望这可以帮助。如有任何疑问,请随时提出。
Edit: Answering the question in comment
编辑:在评论中回答问题
Please have a look the following answer on how cross_val_predict
works:
请查看以下有关cross_val_predict
工作原理的答案:
I think that cross_val_predict
will be overfit because as the folds increase, more data will be for train and less will for test. So the resultant label is more dependent on training data. Also as already told above, the prediction for one sample is done only once, so it may be susceptible to the splitting of data more.
Thats why most of the places or tutorials recommend using the cross_val_score
for analysis.
我认为这cross_val_predict
会过拟合,因为随着折叠次数的增加,更多的数据将用于训练,而用于测试的数据将减少。所以得到的标签更依赖于训练数据。同样如上所述,一个样本的预测只进行一次,因此它可能更容易受到数据分裂的影响。这就是为什么大多数地方或教程都推荐使用cross_val_score
for analysis。
回答by The Data Scientician
I think the difference can be made clear by inspecting their outputs. Consider this snippet:
我认为可以通过检查它们的输出来清楚区别。考虑这个片段:
# Last column is the label
print(X.shape) # (7040, 133)
clf = MLPClassifier()
scores = cross_val_score(clf, X[:,:-1], X[:,-1], cv=5)
print(scores.shape) # (5,)
y_pred = cross_val_predict(clf, X[:,:-1], X[:,-1], cv=5)
print(y_pred.shape) # (7040,)
Notice the shapes: why are these so?
scores.shape
has length 5 because it is a score computed with cross-validation over 5 folds (see argument cv=5
). Therefore, a single real value is computed for each fold. That value is the score of the classifier:
注意形状:为什么会这样?
scores.shape
长度为 5,因为它是通过 5 次交叉验证计算得出的分数(参见参数cv=5
)。因此,每个折叠都会计算一个实际值。该值是分类器的分数:
given true labels and predicted labels, how many answers the predictor were right in a particular fold?
给定真实标签和预测标签,预测变量在特定折叠中有多少正确答案?
In this case, the y labels given in input are used twice: to learn from data and to evaluate the performances of the classifier.
在这种情况下,输入中给出的 y 标签被使用两次:从数据中学习和评估分类器的性能。
On the other hand, y_pred.shape
has length 7040, which is the shape of the dataset. That is the length of the input dataset. This means that each value is not a score computed on multiple values, but a single value: the prediction of the classifier:
另一方面,y_pred.shape
长度为 7040,这是数据集的形状。那是输入数据集的长度。这意味着每个值不是对多个值计算的分数,而是单个值:分类器的预测:
given the input data and their labels, what is the prediction of the classifier on a specific example that was in a test set of a particular fold?
给定输入数据及其标签,分类器对特定折叠测试集中的特定示例的预测是什么?
Note that you do not know what fold was used: each output was computed on the test data of a certain fold, but you can't tell which (from this output, at least).
请注意,您不知道使用了什么折叠:每个输出都是根据某个折叠的测试数据计算得出的,但您无法分辨出哪个(至少从这个输出中)。
In this case, the labels are used just once: to train the classifier. It's your job to compare these outputs to the true outputs to compute the score. If you just average them, as you did, the output is not a score, it's just the average prediction.
在这种情况下,标签只使用一次:用于训练分类器。您的工作是将这些输出与真实输出进行比较以计算分数。如果你只是对它们求平均值,就像你所做的那样,输出不是一个分数,它只是平均预测。
回答by Kirgsn
So this question also bugged me and while the other's made good points, they didn't answer all aspects of OP's question.
所以这个问题也困扰着我,虽然其他人提出了很好的观点,但他们并没有回答 OP 问题的所有方面。
The true answer is: The divergence in scores for increasing k is due to the chosen metric R2 (coefficient of determination). For e.g. MSE, MSLE or MAE there won't be any difference in using cross_val_score
or cross_val_predict
.
正确答案是:增加 k 的分数差异是由于选择的度量 R2(决定系数)。对于例如 MSE、MSLE 或 MAE,使用cross_val_score
或不会有任何区别cross_val_predict
。
See the definition of R2:
R^2 = 1 - (MSE(ground truth, prediction)/ MSE(ground truth, mean(ground truth)))
R^2 = 1 - (MSE(ground truth, prediction)/ MSE(ground truth, mean(ground truth)))
The bold part explains why the score starts to differ for increasing k: the more splits we have, the fewer samples in the test fold and the higher the variance in the mean of the test fold. Conversely, for small k, the mean of the test fold won't differ much of the full ground truth mean, as sample size is still large enough to have small variance.
粗体部分解释了为什么随着 k 的增加分数开始不同:我们拥有的分割越多,测试折叠中的样本越少,测试折叠均值的方差越大。相反,对于较小的 k,测试折叠的均值与完整的真实均值相差不大,因为样本量仍然足够大以具有较小的方差。
Proof:
证明:
import numpy as np
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_log_error as msle, r2_score
predictions = np.random.rand(1000)*100
groundtruth = np.random.rand(1000)*20
def scores_for_increasing_k(score_func):
skewed_score = score_func(groundtruth, predictions)
print(f'skewed score (from cross_val_predict): {skewed_score}')
for k in (2,4,5,10,20,50,100,200,250):
fold_preds = np.split(predictions, k)
fold_gtruth = np.split(groundtruth, k)
correct_score = np.mean([score_func(g, p) for g,p in zip(fold_gtruth, fold_preds)])
print(f'correct CV for k={k}: {correct_score}')
for name, score in [('MAE', mae), ('MSLE', msle), ('R2', r2_score)]:
print(name)
scores_for_increasing_k(score)
print()
Output will be:
输出将是:
MAE
skewed score (from cross_val_predict): 42.25333901481263
correct CV for k=2: 42.25333901481264
correct CV for k=4: 42.25333901481264
correct CV for k=5: 42.25333901481264
correct CV for k=10: 42.25333901481264
correct CV for k=20: 42.25333901481264
correct CV for k=50: 42.25333901481264
correct CV for k=100: 42.25333901481264
correct CV for k=200: 42.25333901481264
correct CV for k=250: 42.25333901481264
MSLE
skewed score (from cross_val_predict): 3.5252449697327175
correct CV for k=2: 3.525244969732718
correct CV for k=4: 3.525244969732718
correct CV for k=5: 3.525244969732718
correct CV for k=10: 3.525244969732718
correct CV for k=20: 3.525244969732718
correct CV for k=50: 3.5252449697327175
correct CV for k=100: 3.5252449697327175
correct CV for k=200: 3.5252449697327175
correct CV for k=250: 3.5252449697327175
R2
skewed score (from cross_val_predict): -74.5910282783694
correct CV for k=2: -74.63582817089443
correct CV for k=4: -74.73848598638291
correct CV for k=5: -75.06145142821893
correct CV for k=10: -75.38967601572112
correct CV for k=20: -77.20560102267272
correct CV for k=50: -81.28604960074824
correct CV for k=100: -95.1061197684949
correct CV for k=200: -144.90258384605787
correct CV for k=250: -210.13375041871123
Of course, there is another effect not shown here, which was mentioned by others.
With increasing k, there are more models trained on more samples and validated on fewer samples, which will effect the final scores, but this is not induced by the choice between cross_val_score
and cross_val_predict
.
当然,还有一个效果这里没有展示,是别人提到的。随着K,还有更多型号的培训更多的样本和较少的样本验证,这将影响最终的分数,但这不是由之间的选择引起的cross_val_score
和cross_val_predict
。