Python 如何使用 tensorflow 执行 k 折交叉验证?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39748660/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to perform k-fold cross validation with tensorflow?
提问by mommomonthewind
I am following the IRIS example of tensorflow.
我正在关注tensorflow 的 IRIS 示例。
My case now is I have all data in a single CSV file, not separated, and I want to apply k-fold cross validation on that data.
我现在的情况是我将所有数据都放在一个 CSV 文件中,而不是分开的,我想对这些数据应用 k 折交叉验证。
I have
我有
data_set = tf.contrib.learn.datasets.base.load_csv(filename="mydata.csv",
target_dtype=np.int)
How can I perform k-fold cross validation on this dataset with multi-layer neural network as same as IRIS example?
如何使用与 IRIS 示例相同的多层神经网络对该数据集执行 k 折交叉验证?
回答by Dan Reia
I know this question is old but in case someone is looking to do something similar, expanding on ahmedhosny'sanswer:
我知道这个问题很老,但如果有人想做类似的事情,请扩展ahmedhosny 的回答:
The new tensorflow datasets API has the ability to create dataset objects using python generators, so along with scikit-learn's KFold one option can be to create a dataset from the KFold.split() generator:
新的 tensorflow 数据集 API 能够使用 python 生成器创建数据集对象,因此与 scikit-learn 的 KFold 一起,一个选项可以是从 KFold.split() 生成器创建数据集:
import numpy as np
from sklearn.model_selection import LeaveOneOut,KFold
import tensorflow as tf
import tensorflow.contrib.eager as tfe
tf.enable_eager_execution()
from sklearn.datasets import load_iris
data = load_iris()
X=data['data']
y=data['target']
def make_dataset(X_data,y_data,n_splits):
def gen():
for train_index, test_index in KFold(n_splits).split(X_data):
X_train, X_test = X_data[train_index], X_data[test_index]
y_train, y_test = y_data[train_index], y_data[test_index]
yield X_train,y_train,X_test,y_test
return tf.data.Dataset.from_generator(gen, (tf.float64,tf.float64,tf.float64,tf.float64))
dataset=make_dataset(X,y,10)
Then one can iterate through the dataset either in the graph based tensorflow or using eager execution. Using eager execution:
然后可以在基于 tensorflow 的图形中或使用 Eager Execution 遍历数据集。使用急切执行:
for X_train,y_train,X_test,y_test in tfe.Iterator(dataset):
....
回答by ahmedhosny
NN's are usually used with large datasets where CV is not used - and very expensive. In the case of IRIS (50 samples for each species), you probably need it.. why not use scikit-learn with different random seedsto split your training and testing?
NN 通常用于不使用 CV 的大型数据集 - 并且非常昂贵。对于 IRIS(每个物种 50 个样本),您可能需要它……为什么不使用具有不同随机种子的 scikit-learn来拆分训练和测试?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
for k in kfold:
对于 k 中的 k 折:
- split data differently passing a different value to "random_state"
- learn the net using _train
- test using _test
- 以不同的方式拆分数据,将不同的值传递给“random_state”
- 使用 _train 学习网络
- 使用 _test 进行测试
If you dont like the random seed and want a more structured k-fold split, you can use this taken from here.
如果您不喜欢随机种子并想要更结构化的 k 折拆分,则可以使用取自此处的方法。
from sklearn.model_selection import KFold, cross_val_score
X = ["a", "a", "b", "c", "c", "c"]
k_fold = KFold(n_splits=3)
for train_indices, test_indices in k_fold.split(X):
print('Train: %s | test: %s' % (train_indices, test_indices))
Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]