Python 在 sklearn.cross_validation 中使用 train_test_split 和 cross_val_score 的区别

Question

提问by evianpring

I have a matrix with 20 columns. The last column are 0/1 labels.

我有一个 20 列的矩阵。最后一列是 0/1 标签。

The link to the data is here.

数据链接在这里。

I am trying to run random forest on the dataset, using cross validation. I use two methods of doing this:

我正在尝试使用交叉验证在数据集上运行随机森林。我使用两种方法来做到这一点：

using sklearn.cross_validation.cross_val_score
using sklearn.cross_validation.train_test_split

使用 sklearn.cross_validation.cross_val_score
使用 sklearn.cross_validation.train_test_split

I am getting different results when I do what I think is pretty much the same exact thing. To exemplify, I run a two-fold cross validation using the two methods above, as in the code below.

当我做我认为几乎完全相同的事情时，我得到了不同的结果。为了举例说明，我使用上述两种方法运行了双重交叉验证，如下面的代码所示。

import csv
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score

#read in the data
data = pd.read_csv('data_so.csv', header=None)
X = data.iloc[:,0:18]
y = data.iloc[:,19]

depth = 5
maxFeat = 3 

result = cross_val_score(ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False), X, y, scoring='roc_auc', cv=2)

result
# result is now something like array([ 0.66773295,  0.58824739])

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.50)

RFModel = ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False)
RFModel.fit(xtrain,ytrain)
prediction = RFModel.predict_proba(xtest)
auc = roc_auc_score(ytest, prediction[:,1:2])
print auc    #something like 0.83

RFModel.fit(xtest,ytest)
prediction = RFModel.predict_proba(xtrain)
auc = roc_auc_score(ytrain, prediction[:,1:2])
print auc    #also something like 0.83

My question is:

我的问题是：

why am I getting different results, ie, why is the AUC (the metric I am using) higher when I use train_test_split?

为什么我得到不同的结果，即为什么我使用时 AUC（我使用的指标）更高train_test_split？

Note: When I using more folds (say 10 folds), there appears to be some kind of pattern in my results, with the first calculation always giving me the highest AUC.

注意：当我使用更多折叠（比如 10 折叠）时，我的结果中似乎存在某种模式，第一次计算总是给我最高的 AUC。

In the case of the two-fold cross validation in the example above, the first AUC is always higher than the second one; it's always something like 0.70 and 0.58.

在上面例子中的双重交叉验证的情况下，第一个 AUC 总是高于第二个；它总是类似于 0.70 和 0.58。

Thanks for your help!

谢谢你的帮助！

Answer 1

采纳答案by KCzar

When using cross_val_score, you'll frequently want to use a KFolds or StratifiedKFolds iterator:

使用 cross_val_score 时，您经常需要使用 KFolds 或 StratifiedKFolds 迭代器：

http://scikit-learn.org/0.10/modules/cross_validation.html#computing-cross-validated-metrics

http://scikit-learn.org/0.10/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold

By default, cross_val_score will not randomize your data, which can produce odd results like this if you're data isn't random to begin with.

默认情况下，cross_val_score 不会随机化您的数据，如果您的数据不是随机的，这可能会产生这样的奇怪结果。

The KFolds iterator has a random state parameter:

KFolds 迭代器有一个随机状态参数：

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html

So does train_test_split, which does randomize by default:

train_test_split 也是如此，默认情况下它会随机化：

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

Patterns like what you described are usually a result of a lack of randomnesss in the train/test set.

像您所描述的那样的模式通常是由于训练/测试集中缺乏随机性造成的。

Python 在 sklearn.cross_validation 中使用 train_test_split 和 cross_val_score 的区别

提问by evianpring

采纳答案by KCzar

相关推荐

最近更新

标签

Python 在 sklearn.cross_validation 中使用 train_test_split 和 cross_val_score 的区别

提问by evianpring

采纳答案by KCzar

相关推荐

使用 Python 和 OpenCV 在图像中查找红色

在 Python 中创建列表对象类

Pymongo/bson：将 python.cursor.Cursor 对象转换为可序列化/JSON 对象

Python从键列表生成动态字典

相关推荐

最近更新

标签