Pandas:如何在不使用 scikit 的情况下进行交叉验证?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43442072/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: How can I do cross validation without using scikit?
提问by jubins
I am trying to implement my own cross-validation function. I read about cross-validation on this link, and was able to split my dataset into training and test. However how can I define the folds? For example my data frame looks like this.
我正在尝试实现我自己的交叉验证功能。我在此链接上阅读了交叉验证,并且能够将我的数据集拆分为训练和测试。但是,我如何定义折叠?例如我的数据框看起来像这样。
Dataframe:
MMC MET_lep MASS_Vis Pt_H Y
0 138.70 51.65 97.82 0.91 0
1 160.93 68.78 103.23 -999.00 0
2 -999.00 162.17 125.95 -999.00 0
3 143.90 81.41 80.94 -999.00 1
4 175.86 16.91 134.80 -999.00 0
5 -999.00 162.17 125.95 -999.00 0
6 143.90 81.41 80.94 -999.00 1
7 175.86 16.91 134.80 -999.00 0
8 -999.00 162.17 125.95 -999.00 0
9 143.90 81.41 80.94 -999.00 1
And want output like this:
并想要这样的输出:
For K=3 (Folds)
When K=1
Training:
MMC MET_lep MASS_Vis Pt_H Y
0 138.70 51.65 97.82 0.91 0
1 160.93 68.78 103.23 -999.00 0
2 -999.00 162.17 125.95 -999.00 0
3 143.90 81.41 80.94 -999.00 1
4 175.86 16.91 134.80 -999.00 0
5 -999.00 162.17 125.95 -999.00 0
6 143.90 81.41 80.94 -999.00 1
Test:
7 175.86 16.91 134.80 -999.00 0
8 -999.00 162.17 125.95 -999.00 0
9 143.90 81.41 80.94 -999.00 1
When K=2
Training:
MMC MET_lep MASS_Vis Pt_H Y
0 138.70 51.65 97.82 0.91 0
1 160.93 68.78 103.23 -999.00 0
2 -999.00 162.17 125.95 -999.00 0
6 143.90 81.41 80.94 -999.00 1
7 175.86 16.91 134.80 -999.00 0
8 -999.00 162.17 125.95 -999.00 0
9 143.90 81.41 80.94 -999.00 1
Test:
3 143.90 81.41 80.94 -999.00 1
4 175.86 16.91 134.80 -999.00 0
5 -999.00 162.17 125.95 -999.00 0
When K=3
Training:
MMC MET_lep MASS_Vis Pt_H Y
0 138.70 51.65 97.82 0.91 0
1 160.93 68.78 103.23 -999.00 0
2 -999.00 162.17 125.95 -999.00 0
3 143.90 81.41 80.94 -999.00 1
7 175.86 16.91 134.80 -999.00 0
8 -999.00 162.17 125.95 -999.00 0
9 143.90 81.41 80.94 -999.00 1
Test:
4 175.86 16.91 134.80 -999.00 0
5 -999.00 162.17 125.95 -999.00 0
6 143.90 81.41 80.94 -999.00 1
Below is my code, it does the job of splitting but does not do the folds:
下面是我的代码,它完成拆分但不折叠的工作:
split = math.floor(dataset.shape[0]*0.8)
data_train = dataset[:split]
data_test = dataset[split:]
Thank you in advance for helping on this.
在此先感谢您的帮助。
回答by Michael Nelson
Is it your intention for the K=2 fold to overlap with the K=3 test fold (3,4,5) vs (4,5,6)? Also, it seems like K is being overloaded in your example to mean both the number of folds, and the index of the current fold. In my answer, I'll use i for the i-th fold out of k total folds.
您是否打算将 K=2 折叠与 K=3 测试折叠 (3,4,5) 与 (4,5,6) 重叠?此外,在您的示例中,似乎 K 被重载以表示折叠数和当前折叠的索引。在我的回答中,我将使用 i 作为 k 个总折叠中的第 i 个折叠。
Assuming the goal is to create non-overlapping folds, it should be sufficient to have a function that produces roughly even ranges out of the range 0 to len(dataset) - 1. You can get a roughly even split even when your list is not perfectly divisible by k splitting at floor((n*i)/k). In python you could use a function like this:
假设目标是创建不重叠的折叠,那么有一个函数应该足以产生从 0 到 len(dataset) - 1 范围内大致均匀的范围。即使您的列表不是,您也可以获得大致均匀的分割完全可以被 k 在 floor((n*i)/k 处分裂) 整除。在 python 中,你可以使用这样的函数:
def fold_i_of_k(dataset, i, k):
n = len(dataset)
return dataset[n*(i-1)//k:n*i//k]
Here is an example on a one dimensional data-set (should work just as well for a DataFrame):
这是一个关于一维数据集的示例(对于 DataFrame 应该也适用):
>>> fold_i_of_k(list(range(0,11)),1,3)
[0, 1, 2]
>>> fold_i_of_k(list(range(0,11)),2,3)
[3, 4, 5, 6]
>>> fold_i_of_k(list(range(0,11)),3,3)
[7, 8, 9, 10]
回答by Fabian D.
this solution is based on pandas and numpy libraries:
此解决方案基于 pandas 和 numpy 库:
import pandas as pd
import numpy as np
First you split your dataset into k parts:
首先,您将数据集拆分为 k 个部分:
k = 10
folds = np.array_split(data, k)
Then you iterate over your folds, using one as testset and the other k-1 as training, so at last you perform the fitting k times:
然后你迭代你的折叠,使用一个作为测试集,另一个 k-1 作为训练,所以最后你执行了 k 次拟合:
for i in range(k):
train = folds.copy() // you wanna work on a copy of your array
test = folds[i]
del train[i]
train = pd.concat(train, sort=False)
perform(clf, train.copy(), test.copy()) // do the fitting, here you also want to copy
In this function you remove the label column from your sets and fit the scikit-classifier (clf) and then return the prediction.
在此函数中,您从集合中删除标签列并拟合 scikit 分类器 (clf),然后返回预测。
def perform(clf, train_set, test_set):
# remove labels from data
train_labels = train_set.pop('Y').values
test_labels = test_set.pop('Y').values
clf.fit(train_set, train_labels)
return clf.score(test_set, test_labels)