pandas sklearn.cross_validation.StratifiedShuffleSplit - 错误:“索引越界”

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30023927/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:18:28  来源:igfitidea点击:

sklearn.cross_validation.StratifiedShuffleSplit - error: "indices are out-of-bounds"

pythonpandasscikit-learn

提问by Jason

I was trying to split the sample dataset using Scikit-learn's Stratified Shuffle Split. I followed the example shown on the Scikit-learn documentation here

我试图使用 Scikit-learn 的 Stratified Shuffle Split 拆分样本数据集。我按照此处Scikit-learn 文档中显示的示例进行操作

import pandas as pd
import numpy as np
# UCI's wine dataset
wine = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")

# separate target variable from dataset
target = wine['quality']
data = wine.drop('quality',axis = 1)

# Stratified Split of train and test data
from sklearn.cross_validation import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(target, n_iter=3, test_size=0.2)

for train_index, test_index in sss:
    xtrain, xtest = data[train_index], data[test_index]
    ytrain, ytest = target[train_index], target[test_index]

# Check target series for distribution of classes
ytrain.value_counts()
ytest.value_counts()

However, upon running this script, I get the following error:

但是,在运行此脚本时,出现以下错误:

IndexError: indices are out-of-bounds

Could someone please point out what I am doing wrong here? Thanks!

有人可以指出我在这里做错了什么吗?谢谢!

回答by Mark Dickinson

You're running into the different conventions for Pandas DataFrameindexing versus NumPy ndarrayindexing. The arrays train_indexand test_indexare collections of row indices. But datais a Pandas DataFrameobject, and when you use a single index into that object, as in data[train_index], Pandas is expecting train_indexto contain columnlabels rather than row indices. You can either convert the dataframe to a NumPy array, using .values:

DataFrame遇到了 Pandas索引与 NumPyndarray索引的不同约定。数组train_indextest_index是行索引的集合。但是data是 PandasDataFrame对象,当您在该对象中使用单个索引时,如 中所示data[train_index],Pandas 期望train_index包含标签而不是行索引。您可以使用以下方法将数据帧转换为 NumPy 数组.values

data_array = data.values
for train_index, test_index in sss:
    xtrain, xtest = data_array[train_index], data_array[test_index]
    ytrain, ytest = target[train_index], target[test_index]

or use the Pandas .ilocaccessor:

或使用 Pandas.iloc访问器:

for train_index, test_index in sss:
    xtrain, xtest = data.iloc[train_index], data.iloc[test_index]
    ytrain, ytest = target[train_index], target[test_index]

I tend to favour the second approach, since it gives xtrainand xtestof type DataFramerather than ndarray, and so keeps the column labels.

我倾向于支持第二种方法,因为它给出了xtrainandxtest的类型DataFrame而不是ndarray,因此保留了列标签。