python中的KFold究竟是做什么的?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36063014/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:21:11  来源:igfitidea点击:

What does KFold in python exactly do?

pythoncross-validationkaggle

提问by user

I am looking at this tutorial: https://www.dataquest.io/mission/74/getting-started-with-kaggle

我在看这个教程:https: //www.dataquest.io/mission/74/getting-started-with-kaggle

I got to part 9, making predictions. In there there is some data in a dataframe called titanic, which is then divided up in folds using:

我到了第 9 部分,进行预测。在名为 Titanic 的数据框中有一些数据,然后使用以下方法将其划分为折叠:

# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

I am not sure what is it exactly doing and what kind of object kf is. I tried reading the documentation but it did not help much. Also, there are three folds (n_folds=3), why is it later only accessing train and test (and how do I know they are called train and test) in this line?

我不确定它到底在做什么以及 kf 是什么类型的对象。我尝试阅读文档,但没有太大帮助。另外,有三个折叠(n_folds = 3),为什么它稍后只访问该行中的训练和测试(我怎么知道它们被称为训练和测试)?

for train, test in kf:

回答by qmaruf

KFold will provide train/test indices to split data in train and test sets. It will split dataset into kconsecutive folds (without shuffling by default).Each fold is then used a validation set once while the k - 1remaining folds form the training set (source).

KFold 将提供训练/测试索引以拆分训练和测试集中的数据。它将数据集拆分为k连续的折叠(默认情况下不混洗)。然后每个折叠使用一次验证集,而k - 1其余折叠形成训练集()。

Let's say, you have some data indices from 1 to 10. If you use n_fold=k, in first iteration you will get i'th (i<=k)fold as test indices and remaining (k-1)folds (without that i'th fold) together as train indices.

比方说,您有一些从 1 到 10 的数据索引。如果您使用n_fold=k,在第一次迭代中,您将获得i第 'th(i<=k)折叠作为测试索引,并将剩余(k-1)折叠(没有i第 'th 折叠)一起作为训练索引。

An example

一个例子

import numpy as np
from sklearn.cross_validation import KFold

x = [1,2,3,4,5,6,7,8,9,10,11,12]
kf = KFold(12, n_folds=3)

for train_index, test_index in kf:
    print (train_index, test_index)

Output

输出

Fold 1: [ 4 5 6 7 8 9 10 11] [0 1 2 3]

Fold 2: [ 0 1 2 3 8 9 10 11] [4 5 6 7]

Fold 3: [0 1 2 3 4 5 6 7] [ 8 9 10 11]

折叠 1: [ 4 5 6 7 8 9 10 11] [0 1 2 3]

折叠 2: [ 0 1 2 3 8 9 10 11] [4 5 6 7]

折叠 3: [0 1 2 3 4 5 6 7] [ 8 9 10 11]

Import Update for sklearn 0.20:

sklearn 0.20 的导入更新:

KFold object was moved to the sklearn.model_selectionmodule in version 0.20. To import KFold in sklearn 0.20+ use from sklearn.model_selection import KFold. KFold current documentation source

KFold 对象已移至sklearn.model_selection0.20 版本的模块中。要在 sklearn 0.20+ 中导入 KFold,请使用from sklearn.model_selection import KFold. KFold 当前文档来源

回答by vipin bansal

Sharing theoretical information about KF that I have learnt so far.

分享我目前学到的关于 KF 的理论信息。

KFOLD is a model validation technique, where it's not using your pre-trainedmodel. Rather it just use the hyper-parameterand trained a new model with k-1 data set and test the same model on the kth set.

KFOLD 是一种模型验证技术,它不使用您的预训练模型。相反,它只是使用超参数并用 k-1 数据集训练了一个新模型,并在第 k 个数据集上测试相同的模型。

K different models are just used for validation.

K个不同的模型仅用于验证。

It will return the K different scores(accuracy percentage), which are based on kth test data set. And we generally take the average to analyse the model.

它将返回基于第 k 个测试数据集的 K 个不同的分数(准确率百分比)。而我们一般都是取平均值来分析模型。

We repeat this process with all the different models that we want to analyse. Brief Algo:

我们对要分析的所有不同模型重复此过程。简要算法:

  1. Split data in to training and test part.
  2. Trained different models say SVM, RF, LR on this training data.
  1. 将数据拆分为训练和测试部分。
  2. 在这个训练数据上训练了不同的模型,比如 SVM、RF、LR。
   2.a Take whole data set and divide in to K-Folds.
   2.b Create a new model with the hyper parameter received after training on step 1.
   2.c Fit the newly created model on K-1 data set.
   2.d Test on Kth data set
   2.e Take average score.
   2.a Take whole data set and divide in to K-Folds.
   2.b Create a new model with the hyper parameter received after training on step 1.
   2.c Fit the newly created model on K-1 data set.
   2.d Test on Kth data set
   2.e Take average score.
  1. Analyse the different average score and select the best model out of SVM, RF and LR.
  1. 分析不同的平均分并从 SVM、RF 和 LR 中选择最佳模型。

Simple reason for doing this, we generally have data deficiencies and if we divide the whole data set into:

这样做的原因很简单,我们通常有数据不足,如果我们将整个数据集划分为:

  1. Training
  2. Validation
  3. Testing
  1. 训练
  2. 验证
  3. 测试

We may left out relatively small chunk of data and which may overfit our model. Also possible that some of the data remain untouched for our training and we are not analysing the behavior against such data.

我们可能会遗漏相对较小的数据块,这可能会过度拟合我们的模型。也有可能一些数据在我们的训练中保持不变,我们没有根据这些数据分析行为。

KF overcome with both the issues.

KF 克服了这两个问题。

回答by zz x

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation..

该过程有一个称为 k 的参数,它指的是给定数据样本要分成的组数。因此,该过程通常称为 k 折交叉验证。当 k 的特定值被选择时,它可以在模型参考中用来代替 k,例如 k=10 成为 10 倍交叉验证。

You can refer to this post for more information. https://medium.com/@xzz201920/stratifiedkfold-v-s-kfold-v-s-stratifiedshufflesplit-ffcae5bfdf

你可以参考这篇文章了解更多信息。 https://medium.com/@xzz201920/stratifiedkfold-vs-kfold-vs-stratifiedshufflesplit-ffcae5bfdf