Pandas：如何在不使用 scikit 的情况下进行交叉验证？

Question

提问by jubins

I am trying to implement my own cross-validation function. I read about cross-validation on this link, and was able to split my dataset into training and test. However how can I define the folds? For example my data frame looks like this.

我正在尝试实现我自己的交叉验证功能。我在此链接上阅读了交叉验证，并且能够将我的数据集拆分为训练和测试。但是，我如何定义折叠？例如我的数据框看起来像这样。

    Dataframe:
        MMC         MET_lep     MASS_Vis    Pt_H        Y
    0   138.70      51.65       97.82       0.91        0
    1   160.93      68.78       103.23      -999.00     0
    2   -999.00     162.17      125.95      -999.00     0
    3   143.90      81.41       80.94       -999.00     1
    4   175.86      16.91       134.80      -999.00     0
    5   -999.00     162.17      125.95      -999.00     0
    6   143.90      81.41       80.94       -999.00     1
    7   175.86      16.91       134.80      -999.00     0
    8   -999.00     162.17      125.95      -999.00     0
    9   143.90      81.41       80.94       -999.00     1

And want output like this:

并想要这样的输出：

For    K=3 (Folds)

When K=1
Training:
            MMC         MET_lep     MASS_Vis    Pt_H        Y
        0   138.70      51.65       97.82       0.91        0
        1   160.93      68.78       103.23      -999.00     0
        2   -999.00     162.17      125.95      -999.00     0
        3   143.90      81.41       80.94       -999.00     1
        4   175.86      16.91       134.80      -999.00     0
        5   -999.00     162.17      125.95      -999.00     0
        6   143.90      81.41       80.94       -999.00     1
Test:
        7   175.86      16.91       134.80      -999.00     0
        8   -999.00     162.17      125.95      -999.00     0
        9   143.90      81.41       80.94       -999.00     1

When K=2
Training:
            MMC         MET_lep     MASS_Vis    Pt_H        Y
        0   138.70      51.65       97.82       0.91        0
        1   160.93      68.78       103.23      -999.00     0
        2   -999.00     162.17      125.95      -999.00     0
        6   143.90      81.41       80.94       -999.00     1
        7   175.86      16.91       134.80      -999.00     0
        8   -999.00     162.17      125.95      -999.00     0
        9   143.90      81.41       80.94       -999.00     1

Test:
        3   143.90      81.41       80.94       -999.00     1
        4   175.86      16.91       134.80      -999.00     0
        5   -999.00     162.17      125.95      -999.00     0

When K=3
Training:
            MMC         MET_lep     MASS_Vis    Pt_H        Y
        0   138.70      51.65       97.82       0.91        0
        1   160.93      68.78       103.23      -999.00     0
        2   -999.00     162.17      125.95      -999.00     0
        3   143.90      81.41       80.94       -999.00     1
        7   175.86      16.91       134.80      -999.00     0
        8   -999.00     162.17      125.95      -999.00     0
        9   143.90      81.41       80.94       -999.00     1
Test:
        4   175.86      16.91       134.80      -999.00     0
        5   -999.00     162.17      125.95      -999.00     0
        6   143.90      81.41       80.94       -999.00     1

Below is my code, it does the job of splitting but does not do the folds:

下面是我的代码，它完成拆分但不折叠的工作：

 split = math.floor(dataset.shape[0]*0.8)
    data_train = dataset[:split]
    data_test = dataset[split:]

Thank you in advance for helping on this.

在此先感谢您的帮助。

Answer 1

回答by Michael Nelson

Is it your intention for the K=2 fold to overlap with the K=3 test fold (3,4,5) vs (4,5,6)? Also, it seems like K is being overloaded in your example to mean both the number of folds, and the index of the current fold. In my answer, I'll use i for the i-th fold out of k total folds.

您是否打算将 K=2 折叠与 K=3 测试折叠 (3,4,5) 与 (4,5,6) 重叠？此外，在您的示例中，似乎 K 被重载以表示折叠数和当前折叠的索引。在我的回答中，我将使用 i 作为 k 个总折叠中的第 i 个折叠。

Assuming the goal is to create non-overlapping folds, it should be sufficient to have a function that produces roughly even ranges out of the range 0 to len(dataset) - 1. You can get a roughly even split even when your list is not perfectly divisible by k splitting at floor((n*i)/k). In python you could use a function like this:

假设目标是创建不重叠的折叠，那么有一个函数应该足以产生从 0 到 len(dataset) - 1 范围内大致均匀的范围。即使您的列表不是，您也可以获得大致均匀的分割完全可以被 k 在 floor((n*i)/k 处分裂) 整除。在 python 中，你可以使用这样的函数：

def fold_i_of_k(dataset, i, k):
    n = len(dataset)
    return dataset[n*(i-1)//k:n*i//k]

Here is an example on a one dimensional data-set (should work just as well for a DataFrame):

这是一个关于一维数据集的示例（对于 DataFrame 应该也适用）：

>>> fold_i_of_k(list(range(0,11)),1,3)
[0, 1, 2]
>>> fold_i_of_k(list(range(0,11)),2,3)
[3, 4, 5, 6]
>>> fold_i_of_k(list(range(0,11)),3,3)
[7, 8, 9, 10]

Answer 2

回答by Fabian D.

this solution is based on pandas and numpy libraries:

此解决方案基于 pandas 和 numpy 库：

import pandas as pd
import numpy as np

First you split your dataset into k parts:

首先，您将数据集拆分为 k 个部分：

k = 10
folds = np.array_split(data, k)

Then you iterate over your folds, using one as testset and the other k-1 as training, so at last you perform the fitting k times:

然后你迭代你的折叠，使用一个作为测试集，另一个 k-1 作为训练，所以最后你执行了 k 次拟合：

for i in range(k):
    train = folds.copy() // you wanna work on a copy of your array
    test = folds[i]
    del train[i]
    train = pd.concat(train, sort=False)
    perform(clf, train.copy(), test.copy()) // do the fitting, here you also want to copy

In this function you remove the label column from your sets and fit the scikit-classifier (clf) and then return the prediction.

在此函数中，您从集合中删除标签列并拟合 scikit 分类器 (clf)，然后返回预测。

def perform(clf, train_set, test_set):
    # remove labels from data
    train_labels = train_set.pop('Y').values
    test_labels = test_set.pop('Y').values
    clf.fit(train_set, train_labels)
    return clf.score(test_set, test_labels)

Pandas：如何在不使用 scikit 的情况下进行交叉验证？

提问by jubins

回答by Michael Nelson

回答by Fabian D.

相关推荐

最近更新

标签

Pandas：如何在不使用 scikit 的情况下进行交叉验证？

提问by jubins

回答by Michael Nelson

回答by Fabian D.

相关推荐

pandas.to_dict 返回 None 与 nan 混合

在 Pandas 中计算奇数比的更好方法

Pandas：如何根据列表从数据框中删除行？

按列数过滤 Pandas df 并写入数据

相关推荐

最近更新

标签