使用 Pandas 为 Scikit-Learn 准备 CSV 文件数据？

Question

提问by KingPolygon

I have a csv file without headers which I'm importing into python using pandas. The last column is the target class, while the rest of the columns are pixel values for images. How can I go ahead and split this dataset into a training set and a testing set using pandas (80/20)?

我有一个没有标题的 csv 文件，我正在使用 Pandas 将其导入 python。最后一列是目标类，其余的列是图像的像素值。我如何继续使用 Pandas (80/20) 将此数据集拆分为训练集和测试集？

Also, once that is done how would I also split each of those sets so that I can define x (all columns except the last one), and y (the last column)?

此外，一旦完成，我将如何拆分这些集合中的每一个，以便我可以定义 x（除最后一列之外的所有列）和 y（最后一列）？

I've imported my file using:

我已经使用以下方法导入了我的文件：

dataset = pd.read_csv('example.csv', header=None, sep=',')

Thanks

谢谢

Answer 1

回答by ayhan

I'd recommend using sklearn's train_test_split

我建议使用 sklearn 的train_test_split

from sklearn.model_selection import train_test_split
# for older versions import from sklearn.cross_validation
# from sklearn.cross_validation import train_test_split
X, y = dataset.iloc[:, :-1], dataset.iloc[:, -1]
kwargs = dict(test_size=0.2, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, **kwargs)

Answer 2

回答by Kartik

You can simply do:

你可以简单地做：

choices = np.in1d(dataset.index, np.random.choice(dataset.index,int(0.8*len(dataset)),replace=False))
training = dataset[choices]
testing = dataset[np.invert(choices)]

Then, to pass it as x and y to Scikit-Learn:

然后，将其作为 x 和 y 传递给 Scikit-Learn：

scikit_func(x=training.iloc[:,0:-1], y=training.iloc[:,-1])

Let me know if this doesn't work.

如果这不起作用，请告诉我。

Answer 3

回答by Randhawa

You can try this.

你可以试试这个。

Sperating target class from the rest:

其他人的目标班级：

pixel_values = Dataset[df.columns[0:len(Dataset.axes[1])-1]]
target_class = Dataset[df.columns[len(Dataset.axes[1])-1:]]

Now to create test and training samples:

现在创建测试和训练样本：

I would just use numpy's randn:

我只会使用 numpy 的 randn：

 mask = np.random.rand(len(pixel_values )) < 0.8
 train = pixel_values [mask]
 test = pixel_values [~msk]

Now you have traning and test samples in train and test with 80:20 ratio.

现在您在训练和测试中拥有 80:20 比例的训练和测试样本。

使用 Pandas 为 Scikit-Learn 准备 CSV 文件数据？

提问by KingPolygon

回答by ayhan

回答by Kartik

回答by Randhawa

相关推荐

最近更新

标签

使用 Pandas 为 Scikit-Learn 准备 CSV 文件数据？

提问by KingPolygon

回答by ayhan

回答by Kartik

回答by Randhawa

相关推荐

pandas 熊猫一次替换多个值

pandas 熊猫：在 groupby 组内对观察进行排序

pandas 熊猫：如何找到每行最频繁的值？

pandas 查询pandas MultiIndex的正确方法

相关推荐

最近更新

标签