Python 如何在 sklearn 中使用 datasets.fetch_mldata()？

Question

提问by Patthebug

I am trying to run the following code for a brief machine learning algorithm:

我正在尝试为一个简短的机器学习算法运行以下代码：

import re
import argparse
import csv
from collections import Counter
from sklearn import datasets
import sklearn
from sklearn.datasets import fetch_mldata

dataDict = datasets.fetch_mldata('MNIST Original')

In this piece of code, I am trying to read the dataset 'MNIST Original' present at mldata.org via sklearn. This results in the following error(there are more lines of code but I am getting error at this particular line):

在这段代码中，我试图通过 sklearn 读取 mldata.org 上的数据集“MNIST Original”。这会导致以下错误（有更多的代码行，但我在这一行出现错误）：

Traceback (most recent call last):
  File "C:\Program Files (x86)\JetBrains\PyCharm 2.7.3\helpers\pydev\pydevd.py", line 1481, in <module>
    debugger.run(setup['file'], None, None)
  File "C:\Program Files (x86)\JetBrains\PyCharm 2.7.3\helpers\pydev\pydevd.py", line 1124, in run
    pydev_imports.execfile(file, globals, locals) #execute the script
  File "C:/Users/sony/PycharmProjects/Machine_Learning_Homework1/zeroR.py", line 131, in <module>
    dataDict = datasets.fetch_mldata('MNIST Original')
  File "C:\Anaconda\lib\site-packages\sklearn\datasets\mldata.py", line 157, in fetch_mldata
    matlab_dict = io.loadmat(matlab_file, struct_as_record=True)
  File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio.py", line 176, in loadmat
    matfile_dict = MR.get_variables(variable_names)
  File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio5.py", line 294, in get_variables
    res = self.read_var_array(hdr, process)
  File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio5.py", line 257, in read_var_array
    return self._matrix_reader.array_from_header(header, process)
  File "mio5_utils.pyx", line 624, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header (scipy\io\matlab\mio5_utils.c:5717)
  File "mio5_utils.pyx", line 653, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header (scipy\io\matlab\mio5_utils.c:5147)
  File "mio5_utils.pyx", line 721, in scipy.io.matlab.mio5_utils.VarReader5.read_real_complex (scipy\io\matlab\mio5_utils.c:6134)
  File "mio5_utils.pyx", line 424, in scipy.io.matlab.mio5_utils.VarReader5.read_numeric (scipy\io\matlab\mio5_utils.c:3704)
  File "mio5_utils.pyx", line 360, in scipy.io.matlab.mio5_utils.VarReader5.read_element (scipy\io\matlab\mio5_utils.c:3429)
  File "streams.pyx", line 181, in scipy.io.matlab.streams.FileStream.read_string (scipy\io\matlab\streams.c:2711)
IOError: could not read bytes

I have tried researching on internet but there is hardly any help available. Any expert help related to solving this error will be much appreciated.

我曾尝试在互联网上进行研究，但几乎没有任何可用的帮助。任何与解决此错误相关的专家帮助将不胜感激。

TIA.

TIA。

Answer 1

回答by Lucas Ribeiro

That's 'MNIST original'. With a lowercase on "o".

那是“MNIST 原创”。“o”上的小写字母。

Answer 2

回答by Brent

Try it like this:

像这样尝试：

dataDict = fetch_mldata('MNIST original')

This worked for me. Since you used the from ... import ...syntax, you shouldn't prepend datasetswhen you use it

这对我有用。由于您使用了from ... import ...语法，因此datasets在使用它时不应预先添加

Answer 3

回答by Szymon Laszczyński

Looks like the cached data are corrupted. Try removing them and download again (it takes a moment). If not specified differently the data for 'MINST original' should be in

看起来缓存的数据已损坏。尝试删除它们并重新下载（需要一些时间）。如果没有另外指定，“MINST original”的数据应该在

~/scikit_learn_data/mldata/mnist-original.mat

Answer 4

回答by Martin Thoma

Here is some sample code how to get MNIST data ready to use for sklearn:

下面是一些如何让 MNIST 数据准备好用于 sklearn 的示例代码：

def get_data():
    """
    Get MNIST data ready to learn with.

    Returns
    -------
    dict
        With keys 'train' and 'test'. Both do have the keys 'X' (features)
        and'y' (labels)
    """
    from sklearn.datasets import fetch_mldata
    mnist = fetch_mldata('MNIST original')

    x = mnist.data
    y = mnist.target

    # Scale data to [-1, 1] - This is of mayor importance!!!
    x = x/255.0*2 - 1

    from sklearn.cross_validation import train_test_split
    x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                        test_size=0.33,
                                                        random_state=42)
    data = {'train': {'X': x_train,
                      'y': y_train},
            'test': {'X': x_test,
                     'y': y_test}}
    return data

Answer 5

回答by Victoria Stuart

I was also getting a fetch_mldata() "IOError: could not read bytes" error. Here is the solution; the relevant lines of code are

我还收到了 fetch_mldata()“IOError：无法读取字节”错误。这是解决方案；相关的代码行是

from sklearn.datasets.mldata import fetch_mldata
mnist = fetch_mldata('mnist-original', data_home='/media/Vancouver/apps/mnist_dataset/')

... be sure to change 'data_home' for your preferred location (directory).

...一定要为您的首选位置（目录）更改“data_home”。

Here is a script:

这是一个脚本：

#!/usr/bin/python
# coding: utf-8

# Source:
# https://stackoverflow.com/questions/19530383/how-to-use-datasets-fetch-mldata-in-sklearn
# ... modified, below, by Victoria

"""
pers. comm. (Jan 27, 2016) from MLdata.org MNIST dataset contactee "Cheng Ong:"

    The MNIST data is called 'mnist-original'. The string you pass to sklearn
    has to match the name of the URL:

    from sklearn.datasets.mldata import fetch_mldata
    data = fetch_mldata('mnist-original')
"""

def get_data():

    """
    Get MNIST data; returns a dict with keys 'train' and 'test'.
    Both have the keys 'X' (features) and 'y' (labels)
    """

    from sklearn.datasets.mldata import fetch_mldata

    mnist = fetch_mldata('mnist-original', data_home='/media/Vancouver/apps/mnist_dataset/')

    x = mnist.data
    y = mnist.target

    # Scale data to [-1, 1]
    x = x/255.0*2 - 1

    from sklearn.cross_validation import train_test_split

    x_train, x_test, y_train, y_test = train_test_split(x, y,
        test_size=0.33, random_state=42)

    data = {'train': {'X': x_train, 'y': y_train},
            'test': {'X': x_test, 'y': y_test}}

    return data

data = get_data()
print '\n', data, '\n'

Answer 6

回答by mcolak

If you didn't give the data_home, program look the ${yourprojectpath}/mldata/minist-original.mat you can download the program and put the file the correct path

如果你没有给data_home，程序看看${yourprojectpath}/mldata/minist-original.mat 你可以下载程序并将文件放在正确的路径

Answer 7

回答by YH Hsu

I experienced the same issue and found different file size of mnist-original.mat at different times while I use my poor WiFi. I switched to LAN and it works fine. It maybe the issue of networking.

我遇到了同样的问题，并在我使用较差的 WiFi 时在不同时间发现了不同的 mnist-original.mat 文件大小。我切换到局域网，它工作正常。可能是网络问题。

Answer 8

回答by Thang Tran

I also had this problem in the past. It is due to the dataset is quite large (about 55.4 mb), I run the "fetch_mldata" but because of the internet connection, it took awhile to download them all. I did not know and interrupt the process.

我过去也遇到过这个问题。这是由于数据集非常大（大约 55.4 mb），我运行了“fetch_mldata”，但由于互联网连接，下载它们需要一段时间。我不知道并中断了这个过程。

The dataset is corrupted and that why the error happened.

数据集已损坏，这就是发生错误的原因。

Answer 9

回答by ??????? ????

Apart from what @szymon has mentioned you can alternatively load dataset using:

除了@szymon 提到的内容之外，您还可以使用以下方法加载数据集：

from six.moves import urllib
from sklearn.datasets import fetch_mldata

from scipy.io import loadmat
mnist_alternative_url = "https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat"
mnist_path = "./mnist-original.mat"
response = urllib.request.urlopen(mnist_alternative_url)
with open(mnist_path, "wb") as f:
    content = response.read()
    f.write(content)
mnist_raw = loadmat(mnist_path)
mnist = {
    "data": mnist_raw["data"].T,
    "target": mnist_raw["label"][0],
    "COL_NAMES": ["label", "data"],
    "DESCR": "mldata.org dataset: mnist-original",
}

Answer 10

回答by Soundous Bahri

I downloaded the dataset from this link

我从这个链接下载了数据集

https://github.com/amplab/datascience-sp14/blob/master/lab7/mldata/mnist-original.mat

then I typed these lines

然后我输入了这些行

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original', transpose_data=True, data_home='files')

*** the path is (your working directory)/files/mldata/mnist-original.mat

*** 路径是（您的工作目录）/files/mldata/mnist-original.mat

I hope you get it , it worked well for me

我希望你明白，它对我很有效

Python 如何在 sklearn 中使用 datasets.fetch_mldata()？

提问by Patthebug

回答by Lucas Ribeiro

回答by Brent

回答by Szymon Laszczyński

回答by Martin Thoma

回答by Victoria Stuart

回答by mcolak

回答by YH Hsu

回答by Thang Tran

回答by ??????? ????

回答by Soundous Bahri

相关推荐

最近更新

标签

Python 如何在 sklearn 中使用 datasets.fetch_mldata()？

提问by Patthebug

回答by Lucas Ribeiro

回答by Brent

回答by Szymon Laszczyński

回答by Martin Thoma

回答by Victoria Stuart

回答by mcolak

回答by YH Hsu

回答by Thang Tran

回答by ??????? ????

回答by Soundous Bahri

相关推荐

使用python通过sftp上传文件

Python 'float' 对象不能被解释为 int，但转换为 int 不会产生任何输出

Python pandas.DataFrame.from_dict 不使用 OrderedDict 保留顺序

Python：函数可以返回数组和变量吗？

相关推荐

最近更新

标签