Python 带有索引的 Scikit-learn train_test_split

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31521170/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:09:08  来源:igfitidea点击:

Scikit-learn train_test_split with indices

pythonscipyscikit-learnclassification

提问by CentAu

How do I get the original indices of the data when using train_test_split()?

使用 train_test_split() 时如何获取数据的原始索引?

What I have is the following

我所拥有的是以下内容

from sklearn.cross_validation import train_test_split
import numpy as np
data = np.reshape(np.randn(20),(10,2)) # 10 training examples
labels = np.random.randint(2, size=10) # 10 labels
x1, x2, y1, y2 = train_test_split(data, labels, size=0.2)

But this does not give the indices of the original data. One workaround is to add the indices to data (e.g. data = [(i, d) for i, d in enumerate(data)]) and then pass them inside train_test_splitand then expand again. Are there any cleaner solutions?

但这并没有给出原始数据的索引。一种解决方法是将索引添加到数据(例如data = [(i, d) for i, d in enumerate(data)]),然后将它们传递到内部train_test_split,然后再次展开。有没有更清洁的解决方案?

采纳答案by Julien Marrec

Scikit learn plays really well with Pandas, so I suggest you use it. Here's an example:

Scikit learn 与 Pandas 配合得很好,所以我建议你使用它。下面是一个例子:

In [1]: 
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
data = np.reshape(np.random.randn(20),(10,2)) # 10 training examples
labels = np.random.randint(2, size=10) # 10 labels

In [2]: # Giving columns in X a name
X = pd.DataFrame(data, columns=['Column_1', 'Column_2'])
y = pd.Series(labels)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=0)

In [4]: X_test
Out[4]:

     Column_1    Column_2
2   -1.39       -1.86
8    0.48       -0.81
4   -0.10       -1.83

In [5]: y_test
Out[5]:

2    1
8    1
4    1
dtype: int32

You can directly call any scikit functions on DataFrame/Series and it will work.

您可以直接调用 DataFrame/Series 上的任何 scikit 函数,它会起作用。

Let's say you wanted to do a LogisticRegression, here's how you could retrieve the coefficients in a nice way:

假设你想做一个 LogisticRegression,下面是你如何以一种很好的方式检索系数:

In [6]: 
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model = model.fit(X_train, y_train)

# Retrieve coefficients: index is the feature name (['Column_1', 'Column_2'] here)
df_coefs = pd.DataFrame(model.coef_[0], index=X.columns, columns = ['Coefficient'])
df_coefs
Out[6]:
            Coefficient
Column_1    0.076987
Column_2    -0.352463

回答by ogrisel

You can use pandas dataframes or series as Julien said but if you want to restrict your-self to numpy you can pass an additional array of indices:

您可以使用 Julien 所说的 Pandas 数据帧或系列,但如果您想将自己限制为 numpy,您可以传递一个额外的索引数组:

from sklearn.model_selection import train_test_split
import numpy as np
n_samples, n_features, n_classes = 10, 2, 2
data = np.random.randn(n_samples, n_features)  # 10 training examples
labels = np.random.randint(n_classes, size=n_samples)  # 10 labels
indices = np.arange(n_samples)
x1, x2, y1, y2, idx1, idx2 = train_test_split(
    data, labels, indices, test_size=0.2)

回答by Jibwa

The docsmention train_test_split is just a convenience function on top of shuffle split.

文档提到train_test_split只是在洗牌分裂的顶部方便的功能。

I just rearranged some of their code to make my own example. Note the actual solution is the middle block of code. The rest is imports, and setup for a runnable example.

我只是重新排列了他们的一些代码来制作我自己的例子。请注意,实际的解决方案是中间的代码块。其余的是导入和可运行示例的设置。

from sklearn.model_selection import ShuffleSplit
from sklearn.utils import safe_indexing, indexable
from itertools import chain
import numpy as np
X = np.reshape(np.random.randn(20),(10,2)) # 10 training examples
y = np.random.randint(2, size=10) # 10 labels
seed = 1

cv = ShuffleSplit(random_state=seed, test_size=0.25)
arrays = indexable(X, y)
train, test = next(cv.split(X=X))
iterator = list(chain.from_iterable((
    safe_indexing(a, train),
    safe_indexing(a, test),
    train,
    test
    ) for a in arrays)
)
X_train, X_test, train_is, test_is, y_train, y_test, _, _  = iterator

print(X)
print(train_is)
print(X_train)

Now I have the actual indexes: train_is, test_is

现在我有了实际的索引: train_is, test_is

回答by m3h0w

Here's the simplest solution (Jibwa made it seem complicated in another answer), without having to generate indices yourself - just using the ShuffleSplit object to generate 1 split.

这是最简单的解决方案(Jibwa 在另一个答案中使它看起来很复杂),无需自己生成索引 - 只需使用 ShuffleSplit 对象生成 1 个拆分。

import numpy as np 
from sklearn.model_selection import ShuffleSplit # or StratifiedShuffleSplit
sss = ShuffleSplit(n_splits=1, test_size=0.1)

data_size = 100
X = np.reshape(np.random.rand(data_size*2),(data_size,2))
y = np.random.randint(2, size=data_size)

sss.get_n_splits(X, y)
train_index, test_index = next(sss.split(X, y)) 

X_train, X_test = X[train_index], X[test_index] 
y_train, y_test = y[train_index], y[test_index]