Python 如何从特定目录或文件夹导入预先下载的 MNIST 数据集？

Question

提问by Joshua

I have downloaded the MNIST dataset from LeCun site. What I want is to write the Python code in order to extract the gzip and read the dataset directly from the directory, meaning that I don't have to download or access to the MNIST site anymore.

我已经从 LeCun 站点下载了 MNIST 数据集。我想要的是编写 Python 代码以提取 gzip 并直接从目录中读取数据集，这意味着我不必再下载或访问 MNIST 站点。

Desire process:Access folder/directory --> extract gzip --> read dataset (one hot encoding)

欲望过程：访问文件夹/目录-->解压gzip-->读取数据集（一热编码）

How to do it? Since almost all tutorials have to access to the either the LeCun or Tensoflow site to download and read the dataset. Thanks in advance!

怎么做？由于几乎所有教程都必须访问 LeCun 或 Tensoflow 站点才能下载和读取数据集。提前致谢！

Answer 1

采纳答案by Maxim

This tensorflow call

这个张量流调用

from tensorflow.examples.tutorials.mnist import input_data
input_data.read_data_sets('my/directory')

... won't download anythingit if you already have the files there.

...如果您已经在那里拥有文件，则不会下载任何内容。

But if for some reason you wish to unzip it yourself, here's how you do it:

但是，如果由于某种原因您希望自己解压缩，请按照以下方法进行：

from tensorflow.contrib.learn.python.learn.datasets.mnist import extract_images, extract_labels

with open('my/directory/train-images-idx3-ubyte.gz', 'rb') as f:
  train_images = extract_images(f)
with open('my/directory/train-labels-idx1-ubyte.gz', 'rb') as f:
  train_labels = extract_labels(f)

with open('my/directory/t10k-images-idx3-ubyte.gz', 'rb') as f:
  test_images = extract_images(f)
with open('my/directory/t10k-labels-idx1-ubyte.gz', 'rb') as f:
  test_labels = extract_labels(f)

Answer 2

回答by mxmlnkn

If you have the MNIST dataextracted, then you can load it low-level with NumPy directly:

如果您提取了MNIST 数据，那么您可以直接使用 NumPy 将其低级加载：

def loadMNIST( prefix, folder ):
    intType = np.dtype( 'int32' ).newbyteorder( '>' )
    nMetaDataBytes = 4 * intType.itemsize

    data = np.fromfile( folder + "/" + prefix + '-images-idx3-ubyte', dtype = 'ubyte' )
    magicBytes, nImages, width, height = np.frombuffer( data[:nMetaDataBytes].tobytes(), intType )
    data = data[nMetaDataBytes:].astype( dtype = 'float32' ).reshape( [ nImages, width, height ] )

    labels = np.fromfile( folder + "/" + prefix + '-labels-idx1-ubyte',
                          dtype = 'ubyte' )[2 * intType.itemsize:]

    return data, labels

trainingImages, trainingLabels = loadMNIST( "train", "../datasets/mnist/" )
testImages, testLabels = loadMNIST( "t10k", "../datasets/mnist/" )

And to convert to hot-encoding:

并转换为热编码：

def toHotEncoding( classification ):
    # emulates the functionality of tf.keras.utils.to_categorical( y )
    hotEncoding = np.zeros( [ len( classification ), 
                              np.max( classification ) + 1 ] )
    hotEncoding[ np.arange( len( hotEncoding ) ), classification ] = 1
    return hotEncoding

trainingLabels = toHotEncoding( trainingLabels )
testLabels = toHotEncoding( testLabels )

Answer 3

回答by Jayhello

I will show how to load it from scratch(for better understanding), and show how to show digit image from it by matplotlib.pyplot

我将展示如何从头开始加载它（为了更好地理解），并展示如何从中显示数字图像 matplotlib.pyplot

import cPickle
import gzip
import numpy as np
import matplotlib.pyplot as plt

def load_data():
    path = '../../data/mnist.pkl.gz'
    f = gzip.open(path, 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()

    X_train, y_train = training_data[0], training_data[1]
    print X_train.shape, y_train.shape
    # (50000L, 784L) (50000L,)

    # get the first image and it's label
    img1_arr, img1_label = X_train[0], y_train[0]
    print img1_arr.shape, img1_label
    # (784L,) , 5

    # reshape first image(1 D vector) to 2D dimension image
    img1_2d = np.reshape(img1_arr, (28, 28))
    # show it
    plt.subplot(111)
    plt.imshow(img1_2d, cmap=plt.get_cmap('gray'))
    plt.show()

You can also vectorize label to a 10-dimensional unit vectorby this sample function:

您还可以a 10-dimensional unit vector通过此示例函数将标签矢量化：

def vectorized_result(label):
    e = np.zeros((10, 1))
    e[label] = 1.0
    return e

vectorize the above label:

向量化上述标签：

print vectorized_result(img1_label)
# output as below:
[[ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 1.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]]

If you want to translate it to CNN input, you can reshape it like this:

如果你想把它翻译成 CNN 输入，你可以像这样重塑它：

def load_data_v2():
    path = '../../data/mnist.pkl.gz'
    f = gzip.open(path, 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()

    X_train, y_train = training_data[0], training_data[1]
    print X_train.shape, y_train.shape
    # (50000L, 784L) (50000L,)

    X_train = np.array([np.reshape(item, (28, 28)) for item in X_train])
    y_train = np.array([vectorized_result(item) for item in y_train])

    print X_train.shape, y_train.shape
    # (50000L, 28L, 28L) (50000L, 10L, 1L)

Python 如何从特定目录或文件夹导入预先下载的 MNIST 数据集？

提问by Joshua

采纳答案by Maxim

回答by mxmlnkn

回答by Jayhello

相关推荐

最近更新

标签

Python 如何从特定目录或文件夹导入预先下载的 MNIST 数据集？

提问by Joshua

采纳答案by Maxim

回答by mxmlnkn

回答by Jayhello

相关推荐

Python json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 190)

Python 如何将 numpy 数组转换为标准的 TensorFlow 格式？

Python 如何检查pytorch是否正在使用GPU？

python中的[None]和[]有什么区别？

相关推荐

最近更新

标签