Python 如何在 sklearn 中使用 datasets.fetch_mldata()?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19530383/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to use datasets.fetch_mldata() in sklearn?
提问by Patthebug
I am trying to run the following code for a brief machine learning algorithm:
我正在尝试为一个简短的机器学习算法运行以下代码:
import re
import argparse
import csv
from collections import Counter
from sklearn import datasets
import sklearn
from sklearn.datasets import fetch_mldata
dataDict = datasets.fetch_mldata('MNIST Original')
In this piece of code, I am trying to read the dataset 'MNIST Original' present at mldata.org via sklearn. This results in the following error(there are more lines of code but I am getting error at this particular line):
在这段代码中,我试图通过 sklearn 读取 mldata.org 上的数据集“MNIST Original”。这会导致以下错误(有更多的代码行,但我在这一行出现错误):
Traceback (most recent call last):
File "C:\Program Files (x86)\JetBrains\PyCharm 2.7.3\helpers\pydev\pydevd.py", line 1481, in <module>
debugger.run(setup['file'], None, None)
File "C:\Program Files (x86)\JetBrains\PyCharm 2.7.3\helpers\pydev\pydevd.py", line 1124, in run
pydev_imports.execfile(file, globals, locals) #execute the script
File "C:/Users/sony/PycharmProjects/Machine_Learning_Homework1/zeroR.py", line 131, in <module>
dataDict = datasets.fetch_mldata('MNIST Original')
File "C:\Anaconda\lib\site-packages\sklearn\datasets\mldata.py", line 157, in fetch_mldata
matlab_dict = io.loadmat(matlab_file, struct_as_record=True)
File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio.py", line 176, in loadmat
matfile_dict = MR.get_variables(variable_names)
File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio5.py", line 294, in get_variables
res = self.read_var_array(hdr, process)
File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio5.py", line 257, in read_var_array
return self._matrix_reader.array_from_header(header, process)
File "mio5_utils.pyx", line 624, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header (scipy\io\matlab\mio5_utils.c:5717)
File "mio5_utils.pyx", line 653, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header (scipy\io\matlab\mio5_utils.c:5147)
File "mio5_utils.pyx", line 721, in scipy.io.matlab.mio5_utils.VarReader5.read_real_complex (scipy\io\matlab\mio5_utils.c:6134)
File "mio5_utils.pyx", line 424, in scipy.io.matlab.mio5_utils.VarReader5.read_numeric (scipy\io\matlab\mio5_utils.c:3704)
File "mio5_utils.pyx", line 360, in scipy.io.matlab.mio5_utils.VarReader5.read_element (scipy\io\matlab\mio5_utils.c:3429)
File "streams.pyx", line 181, in scipy.io.matlab.streams.FileStream.read_string (scipy\io\matlab\streams.c:2711)
IOError: could not read bytes
I have tried researching on internet but there is hardly any help available. Any expert help related to solving this error will be much appreciated.
我曾尝试在互联网上进行研究,但几乎没有任何可用的帮助。任何与解决此错误相关的专家帮助将不胜感激。
TIA.
TIA。
回答by Lucas Ribeiro
That's 'MNIST original'. With a lowercase on "o".
那是“MNIST 原创”。“o”上的小写字母。
回答by Brent
Try it like this:
像这样尝试:
dataDict = fetch_mldata('MNIST original')
This worked for me. Since you used the from ... import ...
syntax, you shouldn't prepend datasets
when you use it
这对我有用。由于您使用了from ... import ...
语法,因此datasets
在使用它时不应预先添加
回答by Szymon Laszczyński
Looks like the cached data are corrupted. Try removing them and download again (it takes a moment). If not specified differently the data for 'MINST original' should be in
看起来缓存的数据已损坏。尝试删除它们并重新下载(需要一些时间)。如果没有另外指定,“MINST original”的数据应该在
~/scikit_learn_data/mldata/mnist-original.mat
回答by Martin Thoma
Here is some sample code how to get MNIST data ready to use for sklearn:
下面是一些如何让 MNIST 数据准备好用于 sklearn 的示例代码:
def get_data():
"""
Get MNIST data ready to learn with.
Returns
-------
dict
With keys 'train' and 'test'. Both do have the keys 'X' (features)
and'y' (labels)
"""
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
x = mnist.data
y = mnist.target
# Scale data to [-1, 1] - This is of mayor importance!!!
x = x/255.0*2 - 1
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.33,
random_state=42)
data = {'train': {'X': x_train,
'y': y_train},
'test': {'X': x_test,
'y': y_test}}
return data
回答by Victoria Stuart
I was also getting a fetch_mldata() "IOError: could not read bytes" error. Here is the solution; the relevant lines of code are
我还收到了 fetch_mldata()“IOError:无法读取字节”错误。这是解决方案;相关的代码行是
from sklearn.datasets.mldata import fetch_mldata
mnist = fetch_mldata('mnist-original', data_home='/media/Vancouver/apps/mnist_dataset/')
... be sure to change 'data_home' for your preferred location (directory).
...一定要为您的首选位置(目录)更改“data_home”。
Here is a script:
这是一个脚本:
#!/usr/bin/python
# coding: utf-8
# Source:
# https://stackoverflow.com/questions/19530383/how-to-use-datasets-fetch-mldata-in-sklearn
# ... modified, below, by Victoria
"""
pers. comm. (Jan 27, 2016) from MLdata.org MNIST dataset contactee "Cheng Ong:"
The MNIST data is called 'mnist-original'. The string you pass to sklearn
has to match the name of the URL:
from sklearn.datasets.mldata import fetch_mldata
data = fetch_mldata('mnist-original')
"""
def get_data():
"""
Get MNIST data; returns a dict with keys 'train' and 'test'.
Both have the keys 'X' (features) and 'y' (labels)
"""
from sklearn.datasets.mldata import fetch_mldata
mnist = fetch_mldata('mnist-original', data_home='/media/Vancouver/apps/mnist_dataset/')
x = mnist.data
y = mnist.target
# Scale data to [-1, 1]
x = x/255.0*2 - 1
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.33, random_state=42)
data = {'train': {'X': x_train, 'y': y_train},
'test': {'X': x_test, 'y': y_test}}
return data
data = get_data()
print '\n', data, '\n'
回答by mcolak
If you didn't give the data_home, program look the ${yourprojectpath}/mldata/minist-original.mat you can download the program and put the file the correct path
如果你没有给data_home,程序看看${yourprojectpath}/mldata/minist-original.mat 你可以下载程序并将文件放在正确的路径
回答by YH Hsu
I experienced the same issue and found different file size of mnist-original.mat at different times while I use my poor WiFi. I switched to LAN and it works fine. It maybe the issue of networking.
我遇到了同样的问题,并在我使用较差的 WiFi 时在不同时间发现了不同的 mnist-original.mat 文件大小。我切换到局域网,它工作正常。可能是网络问题。
回答by Thang Tran
I also had this problem in the past. It is due to the dataset is quite large (about 55.4 mb), I run the "fetch_mldata" but because of the internet connection, it took awhile to download them all. I did not know and interrupt the process.
我过去也遇到过这个问题。这是由于数据集非常大(大约 55.4 mb),我运行了“fetch_mldata”,但由于互联网连接,下载它们需要一段时间。我不知道并中断了这个过程。
The dataset is corrupted and that why the error happened.
数据集已损坏,这就是发生错误的原因。
回答by ??????? ????
Apart from what @szymon has mentioned you can alternatively load dataset using:
除了@szymon 提到的内容之外,您还可以使用以下方法加载数据集:
from six.moves import urllib
from sklearn.datasets import fetch_mldata
from scipy.io import loadmat
mnist_alternative_url = "https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat"
mnist_path = "./mnist-original.mat"
response = urllib.request.urlopen(mnist_alternative_url)
with open(mnist_path, "wb") as f:
content = response.read()
f.write(content)
mnist_raw = loadmat(mnist_path)
mnist = {
"data": mnist_raw["data"].T,
"target": mnist_raw["label"][0],
"COL_NAMES": ["label", "data"],
"DESCR": "mldata.org dataset: mnist-original",
}
回答by Soundous Bahri
I downloaded the dataset from this link
我从这个链接下载了数据集
https://github.com/amplab/datascience-sp14/blob/master/lab7/mldata/mnist-original.mat
https://github.com/amplab/datascience-sp14/blob/master/lab7/mldata/mnist-original.mat
then I typed these lines
然后我输入了这些行
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original', transpose_data=True, data_home='files')
*** the path is (your working directory)/files/mldata/mnist-original.mat
*** 路径是(您的工作目录)/files/mldata/mnist-original.mat
I hope you get it , it worked well for me
我希望你明白,它对我很有效