Python 如何以“mnist.pkl.gz”中使用的确切格式和数据结构将我的数据集放入 .pkl 文件中?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26107927/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to put my dataset in a .pkl file in the exact format and data structure used in "mnist.pkl.gz"?
提问by John Krit
I'm trying to use the Theano library in python to do some experiments with Deep Belief Networks. I use the code in this address: DBN full code. This code use the MNIST Handwritten database. This file is already in pickle format. It is unpicked in:
我正在尝试使用 python 中的 Theano 库对 Deep Belief Networks 进行一些实验。我使用这个地址中的代码:DBN full code。此代码使用MNIST 手写数据库。这个文件已经是pickle格式了。它在以下位置未选中:
- train_set
- valid_set
- test_set
- 动车组
- 有效集
- 测试集
Which is further unpickled in:
其中进一步未腌制:
- train_set_x, train_set_y = train_set
- valid_set_x, valid_set_y = valid_set
- test_set_x, test_set_y = test_set
- train_set_x, train_set_y = train_set
- valid_set_x, valid_set_y = valid_set
- test_set_x, test_set_y = test_set
Please can someone give me the code that constructs this dataset in order to create my own? The DBN example I use needs the data in this format and I don't know how to do it. if anyone has any ideas how to fix this, please tell me.
请有人给我构建这个数据集的代码以创建我自己的数据集吗?我使用的 DBN 示例需要这种格式的数据,我不知道该怎么做。如果有人有任何想法如何解决这个问题,请告诉我。
Here is my code:
这是我的代码:
from datetime import datetime
import time
import os
from pprint import pprint
import numpy as np
import gzip, cPickle
import theano.tensor as T
from theano import function
os.system("cls")
filename = "completeData.txt"
f = open(filename,"r")
X = []
Y = []
for line in f:
line = line.strip('\n')
b = line.split(';')
b[0] = float(b[0])
b[1] = float(b[1])
b[2] = float(b[2])
b[3] = float(b[3])
b[4] = float(b[4])
b[5] = float(b[5])
b[6] = float(b[6])
b[7] = float(b[7])
b[8] = float(b[8])
b[9] = float(b[9])
b[10] = float(b[10])
b[11] = float(b[11])
b[12] = float(b[12])
b[13] = float(b[13])
b[14] = float(b[14])
b[15] = float(b[15])
b[17] = int(b[17])
X.append(b[:16])
Y.append(b[17])
Len = len(X);
X = np.asmatrix(X)
Y = np.asarray(Y)
sizes = [0.8, 0.1, 0.1]
arr_index = int(sizes[0]*Len)
arr_index2_start = arr_index + 1
arr_index2_end = arr_index + int(sizes[1]*Len)
arr_index3_start = arr_index2_start + 1
"""
train_set_x = np.array(X[:arr_index])
train_set_y = np.array(Y[:arr_index])
val_set_x = np.array(X[arr_index2_start:arr_index2_end])
val_set_y = np.array(Y[arr_index2_start:arr_index2_end])
test_set_x = np.array(X[arr_index3_start:])
test_set_y = np.array(X[arr_index3_start:])
train_set = train_set_x, train_set_y
val_set = val_set_x, val_set_y
test_set = test_set_x, test_set_y
"""
x = T.dmatrix('x')
z = x
t_mat = function([x],z)
y = T.dvector('y')
k = y
t_vec = function([y],k)
train_set_x = t_mat(X[:arr_index].T)
train_set_y = t_vec(Y[:arr_index])
val_set_x = t_mat(X[arr_index2_start:arr_index2_end].T)
val_set_y = t_vec(Y[arr_index2_start:arr_index2_end])
test_set_x = t_mat(X[arr_index3_start:].T)
test_set_y = t_vec(Y[arr_index3_start:])
train_set = train_set_x, train_set_y
val_set = val_set_x, val_set_y
test_set = test_set_x, test_set_y
dataset = [train_set, val_set, test_set]
f = gzip.open('..\..\..\data\dex.pkl.gz','wb')
cPickle.dump(dataset, f, protocol=-1)
f.close()
pprint(train_set_x.shape)
print('Finished\n')
采纳答案by xagg
A .pkl file is not necessary to adapt code from the Theano tutorial to your own data. You only need to mimic their data structure.
将 Theano 教程中的代码改编为您自己的数据不需要 .pkl 文件。你只需要模仿他们的数据结构。
Quick fix
快速解决
Look for the following lines. It's line 303 on DBN.py.
寻找以下几行。这是DBN.py上的第 303行。
datasets = load_data(dataset)
train_set_x, train_set_y = datasets[0]
Replace with your own train_set_xand train_set_y.
替换为您自己的train_set_x和train_set_y.
my_x = []
my_y = []
with open('path_to_file', 'r') as f:
for line in f:
my_list = line.split(' ') # replace with your own separator instead
my_x.append(my_list[1:-1]) # omitting identifier in [0] and target in [-1]
my_y.append(my_list[-1])
train_set_x = theano.shared(numpy.array(my_x, dtype='float64'))
train_set_y = theano.shared(numpy.array(my_y, dtype='float64'))
Adapt this to your input data and the code you're using.
使其适应您的输入数据和您正在使用的代码。
The same thing works for cA.py, dA.pyand SdA.pybut they only use train_set_x.
同样的事情适用于cA.py、dA.py和SdA.py但它们只使用train_set_x.
Look for places such as n_ins=28 * 28where mnist image sizes are hardcoded. Replace 28 * 28with your own number of columns.
寻找诸如n_ins=28 * 28mnist 图像大小被硬编码的地方。替换28 * 28为您自己的列数。
Explanation
解释
This is where you put your data in a format that Theano can work with.
您可以在此处以 Theano 可以使用的格式放置数据。
train_set_x = theano.shared(numpy.array(my_x, dtype='float64'))
train_set_y = theano.shared(numpy.array(my_y, dtype='float64'))
shared()turns a numpy array into the Theano format designed for efficiency on GPUs.
shared()将 numpy 数组转换为专为提高 GPU 效率而设计的 Theano 格式。
dtype='float64'is expected in Theano arrays.
dtype='float64'预计在 Theano 数组中。
More details on basic tensor functionality.
有关基本张量功能的更多详细信息。
.pkl file
.pkl 文件
The .pkl file is a way to save your data structure.
.pkl 文件是一种保存数据结构的方法。
You can create your own.
您可以创建自己的。
import cPickle
f = file('my_data.pkl', 'wb')
cPickle.dump((train_set_x, train_set_y), f, protocol=cPickle.HIGHEST_PROTOCOL)
f.close()
More details on loading and saving.
有关加载和保存的更多详细信息。
回答by anh_ng8
The pickled file represents a tuple of 3 lists : the training set, the validation set and the testing set. (train, val, test)
腌制文件表示 3 个列表的元组:训练集、验证集和测试集。(训练,验证,测试)
- Each of the three lists is a pair formed from a list of images and a list of class labels for each of the images.
- An image is represented as numpy 1-dimensional array of 784 (28 x 28) float values between 0 and 1 (0 stands for black, 1 for white).
- The labels are numbers between 0 and 9 indicating which digit the image represents.
- 这三个列表中的每一个都是由图像列表和每个图像的类标签列表组成的一对。
- 图像表示为 784 (28 x 28) 个介于 0 和 1 之间的浮点值的 numpy 一维数组(0 代表黑色,1 代表白色)。
- 标签是 0 到 9 之间的数字,表示图像代表哪个数字。
回答by sinhayash
This can help:
这可以帮助:
from PIL import Image
from numpy import genfromtxt
import gzip, cPickle
from glob import glob
import numpy as np
import pandas as pd
Data, y = dir_to_dataset("trainMNISTForm\*.BMP","trainLabels.csv")
# Data and labels are read
train_set_x = Data[:2093]
val_set_x = Data[2094:4187]
test_set_x = Data[4188:6281]
train_set_y = y[:2093]
val_set_y = y[2094:4187]
test_set_y = y[4188:6281]
# Divided dataset into 3 parts. I had 6281 images.
train_set = train_set_x, train_set_y
val_set = val_set_x, val_set_y
test_set = test_set_x, val_set_y
dataset = [train_set, val_set, test_set]
f = gzip.open('file.pkl.gz','wb')
cPickle.dump(dataset, f, protocol=2)
f.close()
This is the function I used. May change according to your file details.
这是我使用的功能。可能会根据您的文件详细信息进行更改。
def dir_to_dataset(glob_files, loc_train_labels=""):
print("Gonna process:\n\t %s"%glob_files)
dataset = []
for file_count, file_name in enumerate( sorted(glob(glob_files),key=len) ):
image = Image.open(file_name)
img = Image.open(file_name).convert('LA') #tograyscale
pixels = [f[0] for f in list(img.getdata())]
dataset.append(pixels)
if file_count % 1000 == 0:
print("\t %s files processed"%file_count)
# outfile = glob_files+"out"
# np.save(outfile, dataset)
if len(loc_train_labels) > 0:
df = pd.read_csv(loc_train_labels)
return np.array(dataset), np.array(df["Class"])
else:
return np.array(dataset)

