Python 如何从特定目录或文件夹导入预先下载的 MNIST 数据集?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48257255/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to import pre-downloaded MNIST dataset from a specific directory or folder?
提问by Joshua
I have downloaded the MNIST dataset from LeCun site. What I want is to write the Python code in order to extract the gzip and read the dataset directly from the directory, meaning that I don't have to download or access to the MNIST site anymore.
我已经从 LeCun 站点下载了 MNIST 数据集。我想要的是编写 Python 代码以提取 gzip 并直接从目录中读取数据集,这意味着我不必再下载或访问 MNIST 站点。
Desire process:Access folder/directory --> extract gzip --> read dataset (one hot encoding)
欲望过程:访问文件夹/目录-->解压gzip-->读取数据集(一热编码)
How to do it? Since almost all tutorials have to access to the either the LeCun or Tensoflow site to download and read the dataset. Thanks in advance!
怎么做?由于几乎所有教程都必须访问 LeCun 或 Tensoflow 站点才能下载和读取数据集。提前致谢!
采纳答案by Maxim
This tensorflow call
这个张量流调用
from tensorflow.examples.tutorials.mnist import input_data
input_data.read_data_sets('my/directory')
... won't download anythingit if you already have the files there.
...如果您已经在那里拥有文件,则不会下载任何内容。
But if for some reason you wish to unzip it yourself, here's how you do it:
但是,如果由于某种原因您希望自己解压缩,请按照以下方法进行:
from tensorflow.contrib.learn.python.learn.datasets.mnist import extract_images, extract_labels
with open('my/directory/train-images-idx3-ubyte.gz', 'rb') as f:
train_images = extract_images(f)
with open('my/directory/train-labels-idx1-ubyte.gz', 'rb') as f:
train_labels = extract_labels(f)
with open('my/directory/t10k-images-idx3-ubyte.gz', 'rb') as f:
test_images = extract_images(f)
with open('my/directory/t10k-labels-idx1-ubyte.gz', 'rb') as f:
test_labels = extract_labels(f)
回答by mxmlnkn
If you have the MNIST dataextracted, then you can load it low-level with NumPy directly:
如果您提取了MNIST 数据,那么您可以直接使用 NumPy 将其低级加载:
def loadMNIST( prefix, folder ):
intType = np.dtype( 'int32' ).newbyteorder( '>' )
nMetaDataBytes = 4 * intType.itemsize
data = np.fromfile( folder + "/" + prefix + '-images-idx3-ubyte', dtype = 'ubyte' )
magicBytes, nImages, width, height = np.frombuffer( data[:nMetaDataBytes].tobytes(), intType )
data = data[nMetaDataBytes:].astype( dtype = 'float32' ).reshape( [ nImages, width, height ] )
labels = np.fromfile( folder + "/" + prefix + '-labels-idx1-ubyte',
dtype = 'ubyte' )[2 * intType.itemsize:]
return data, labels
trainingImages, trainingLabels = loadMNIST( "train", "../datasets/mnist/" )
testImages, testLabels = loadMNIST( "t10k", "../datasets/mnist/" )
And to convert to hot-encoding:
并转换为热编码:
def toHotEncoding( classification ):
# emulates the functionality of tf.keras.utils.to_categorical( y )
hotEncoding = np.zeros( [ len( classification ),
np.max( classification ) + 1 ] )
hotEncoding[ np.arange( len( hotEncoding ) ), classification ] = 1
return hotEncoding
trainingLabels = toHotEncoding( trainingLabels )
testLabels = toHotEncoding( testLabels )
回答by Jayhello
I will show how to load it from scratch(for better understanding), and show how to show digit image from it by matplotlib.pyplot
我将展示如何从头开始加载它(为了更好地理解),并展示如何从中显示数字图像 matplotlib.pyplot
import cPickle
import gzip
import numpy as np
import matplotlib.pyplot as plt
def load_data():
path = '../../data/mnist.pkl.gz'
f = gzip.open(path, 'rb')
training_data, validation_data, test_data = cPickle.load(f)
f.close()
X_train, y_train = training_data[0], training_data[1]
print X_train.shape, y_train.shape
# (50000L, 784L) (50000L,)
# get the first image and it's label
img1_arr, img1_label = X_train[0], y_train[0]
print img1_arr.shape, img1_label
# (784L,) , 5
# reshape first image(1 D vector) to 2D dimension image
img1_2d = np.reshape(img1_arr, (28, 28))
# show it
plt.subplot(111)
plt.imshow(img1_2d, cmap=plt.get_cmap('gray'))
plt.show()
You can also vectorize label to a 10-dimensional unit vector
by this sample function:
您还可以a 10-dimensional unit vector
通过此示例函数将标签矢量化:
def vectorized_result(label):
e = np.zeros((10, 1))
e[label] = 1.0
return e
vectorize the above label:
向量化上述标签:
print vectorized_result(img1_label)
# output as below:
[[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 1.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]]
If you want to translate it to CNN input, you can reshape it like this:
如果你想把它翻译成 CNN 输入,你可以像这样重塑它:
def load_data_v2():
path = '../../data/mnist.pkl.gz'
f = gzip.open(path, 'rb')
training_data, validation_data, test_data = cPickle.load(f)
f.close()
X_train, y_train = training_data[0], training_data[1]
print X_train.shape, y_train.shape
# (50000L, 784L) (50000L,)
X_train = np.array([np.reshape(item, (28, 28)) for item in X_train])
y_train = np.array([vectorized_result(item) for item in y_train])
print X_train.shape, y_train.shape
# (50000L, 28L, 28L) (50000L, 10L, 1L)