Python 在 sklearn.cross_validation 中使用 train_test_split 和 cross_val_score 的区别

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30364255/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:20:06  来源:igfitidea点击:

Difference between using train_test_split and cross_val_score in sklearn.cross_validation

pythonscikit-learncross-validation

提问by evianpring

I have a matrix with 20 columns. The last column are 0/1 labels.

我有一个 20 列的矩阵。最后一列是 0/1 标签。

The link to the data is here.

数据链接在这里

I am trying to run random forest on the dataset, using cross validation. I use two methods of doing this:

我正在尝试使用交叉验证在数据集上运行随机森林。我使用两种方法来做到这一点:

  1. using sklearn.cross_validation.cross_val_score
  2. using sklearn.cross_validation.train_test_split
  1. 使用 sklearn.cross_validation.cross_val_score
  2. 使用 sklearn.cross_validation.train_test_split

I am getting different results when I do what I think is pretty much the same exact thing. To exemplify, I run a two-fold cross validation using the two methods above, as in the code below.

当我做我认为几乎完全相同的事情时,我得到了不同的结果。为了举例说明,我使用上述两种方法运行了双重交叉验证,如下面的代码所示。

import csv
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score

#read in the data
data = pd.read_csv('data_so.csv', header=None)
X = data.iloc[:,0:18]
y = data.iloc[:,19]

depth = 5
maxFeat = 3 

result = cross_val_score(ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False), X, y, scoring='roc_auc', cv=2)

result
# result is now something like array([ 0.66773295,  0.58824739])

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.50)

RFModel = ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False)
RFModel.fit(xtrain,ytrain)
prediction = RFModel.predict_proba(xtest)
auc = roc_auc_score(ytest, prediction[:,1:2])
print auc    #something like 0.83

RFModel.fit(xtest,ytest)
prediction = RFModel.predict_proba(xtrain)
auc = roc_auc_score(ytrain, prediction[:,1:2])
print auc    #also something like 0.83

My question is:

我的问题是:

why am I getting different results, ie, why is the AUC (the metric I am using) higher when I use train_test_split?

为什么我得到不同的结果,即为什么我使用时 AUC(我使用的指标)更高train_test_split

Note: When I using more folds (say 10 folds), there appears to be some kind of pattern in my results, with the first calculation always giving me the highest AUC.

注意:当我使用更多折叠(比如 10 折叠)时,我的结果中似乎存在某种模式,第一次计算总是给我最高的 AUC。

In the case of the two-fold cross validation in the example above, the first AUC is always higher than the second one; it's always something like 0.70 and 0.58.

在上面例子中的双重交叉验证的情况下,第一个 AUC 总是高于第二个;它总是类似于 0.70 和 0.58。

Thanks for your help!

谢谢你的帮助!

采纳答案by KCzar

When using cross_val_score, you'll frequently want to use a KFolds or StratifiedKFolds iterator:

使用 cross_val_score 时,您经常需要使用 KFolds 或 StratifiedKFolds 迭代器:

http://scikit-learn.org/0.10/modules/cross_validation.html#computing-cross-validated-metrics

http://scikit-learn.org/0.10/modules/cross_validation.html#computing-cross-validated-metrics

http://scikit-learn.org/0.10/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold

http://scikit-learn.org/0.10/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold

By default, cross_val_score will not randomize your data, which can produce odd results like this if you're data isn't random to begin with.

默认情况下,cross_val_score 不会随机化您的数据,如果您的数据不是随机的,这可能会产生这样的奇怪结果。

The KFolds iterator has a random state parameter:

KFolds 迭代器有一个随机状态参数:

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html

So does train_test_split, which does randomize by default:

train_test_split 也是如此,默认情况下它会随机化:

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

Patterns like what you described are usually a result of a lack of randomnesss in the train/test set.

像您所描述的那样的模式通常是由于训练/测试集中缺乏随机性造成的。