使用 Pandas 为 Scikit-Learn 准备 CSV 文件数据?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36256708/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:56:55  来源:igfitidea点击:

Preparing CSV file data for Scikit-Learn Using Pandas?

pythoncsvpandasscikit-learn

提问by KingPolygon

I have a csv file without headers which I'm importing into python using pandas. The last column is the target class, while the rest of the columns are pixel values for images. How can I go ahead and split this dataset into a training set and a testing set using pandas (80/20)?

我有一个没有标题的 csv 文件,我正在使用 Pandas 将其导入 python。最后一列是目标类,其余的列是图像的像素值。我如何继续使用 Pandas (80/20) 将此数据集拆分为训练集和测试集?

Also, once that is done how would I also split each of those sets so that I can define x (all columns except the last one), and y (the last column)?

此外,一旦完成,我将如何拆分这些集合中的每一个,以便我可以定义 x(除最后一列之外的所有列)和 y(最后一列)?

I've imported my file using:

我已经使用以下方法导入了我的文件:

dataset = pd.read_csv('example.csv', header=None, sep=',')

Thanks

谢谢

回答by ayhan

I'd recommend using sklearn's train_test_split

我建议使用 sklearn 的train_test_split

from sklearn.model_selection import train_test_split
# for older versions import from sklearn.cross_validation
# from sklearn.cross_validation import train_test_split
X, y = dataset.iloc[:, :-1], dataset.iloc[:, -1]
kwargs = dict(test_size=0.2, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, **kwargs)

回答by Kartik

You can simply do:

你可以简单地做:

choices = np.in1d(dataset.index, np.random.choice(dataset.index,int(0.8*len(dataset)),replace=False))
training = dataset[choices]
testing = dataset[np.invert(choices)]

Then, to pass it as x and y to Scikit-Learn:

然后,将其作为 x 和 y 传递给 Scikit-Learn:

scikit_func(x=training.iloc[:,0:-1], y=training.iloc[:,-1])

Let me know if this doesn't work.

如果这不起作用,请告诉我。

回答by Randhawa

You can try this.

你可以试试这个。

Sperating target class from the rest:

其他人的目标班级:

pixel_values = Dataset[df.columns[0:len(Dataset.axes[1])-1]]
target_class = Dataset[df.columns[len(Dataset.axes[1])-1:]]

Now to create test and training samples:

现在创建测试和训练样本:

I would just use numpy's randn:

我只会使用 numpy 的 randn:

 mask = np.random.rand(len(pixel_values )) < 0.8
 train = pixel_values [mask]
 test = pixel_values [~msk] 

Now you have traning and test samples in train and test with 80:20 ratio.

现在您在训练和测试中拥有 80:20 比例的训练和测试样本。