Python 从文件加载数据集,与 sklearn/numpy 一起使用,包括标签

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15109165/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 13:21:00  来源:igfitidea点击:

Loading a dataset from file, to use with sklearn/numpy, including labels

pythonnumpyscikit-learndataset

提问by shn

I saw that with sklearn we can use some predefined datasets, for example mydataset = datasets.load_digits()the we can get an array (a numpy array?) of the dataset mydataset.dataand an array of the corresponding labels mydataset.target. However I want to load my own dataset to be able to use it with sklearn. How and in which format should I load my data ? My file have the following format (each line is a data-point):

我看到使用 sklearn 我们可以使用一些预定义的数据集,例如mydataset = datasets.load_digits()我们可以获得数据集的数组(numpy 数组?)mydataset.data和相应标签的数组mydataset.target。但是我想加载我自己的数据集以便能够将它与 sklearn 一起使用。我应该如何以及以哪种格式加载我的数据?我的文件具有以下格式(每一行都是一个数据点):

-0.2080,0.3480,0.3280,0.5040,0.9320,1.0000,label1
-0.2864,0.1992,0.2822,0.4398,0.7012,0.7800,label3
...
...
-0.2348,0.3826,0.6142,0.7492,0.0546,-0.4020,label2
-0.1856,0.3592,0.7126,0.7366,0.3414,0.1018,label1

采纳答案by Ando Saabas

You can use numpy's genfromtxt function to retrieve data from the file(http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html)

您可以使用 numpy 的 genfromtxt 函数从文件中检索数据(http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html

import numpy as np
mydata = np.genfromtxt(filename, delimiter=",")

However, if you have textual columns, using genfromtxt is trickier, since you need to specify the data types.

但是,如果您有文本列,则使用 genfromtxt 会更棘手,因为您需要指定数据类型。

It will be much easier with the excellent Pandas library (http://pandas.pydata.org/)

使用优秀的 Pandas 库(http://pandas.pydata.org/)会容易得多

import pandas as pd
mydata = pd.read_csv(filename)
target = mydata["Label"]  #provided your csv has header row, and the label column is named "Label"

#select all but the last column as data
data = mydata.ix[:,:-1]