Python sklearn 基于列的分层抽样

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36997619/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:39:16  来源:igfitidea点击:

sklearn stratified sampling based on a column

pythonpandasscikit-learnsklearn-pandas

提问by Muhammad Ali Zia

I have a fairly large CSV file containing amazon review data which I read into a pandas data frame. I want to split the data 80-20(train-test) but while doing so I want to ensure that the split data is proportionally representing the values of one column (Categories), i.e all the different category of reviews are present both in train and test data proportionally.

我有一个相当大的 CSV 文件,其中包含我读入熊猫数据框的亚马逊评论数据。我想将数据拆分为 80-20(训练测试),但在这样做的同时,我想确保拆分数据按比例表示一列(类别)的值,即所有不同类别的评论都存在于训练中并按比例测试数据。

The data looks like this:

数据如下所示:

**ReviewerID**       **ReviewText**        **Categories**       **ProductId**

1212                   good product         Mobile               14444425
1233                   will buy again       drugs                324532
5432                   not recomended       dvd                  789654123 

Im using the following code to do so:

我使用以下代码来做到这一点:

import pandas as pd
Meta = pd.read_csv('C:\Users\xyz\Desktop\WM Project\Joined.csv')
import numpy as np
from sklearn.cross_validation import train_test_split

train, test = train_test_split(Meta.categories, test_size = 0.2, stratify=y)

it gives the following error

它给出了以下错误

NameError: name 'y' is not defined

As I'm relatively new to python I cant figure out what I'm doing wrong or whether this code will stratify based on column categories. It seems to work fine when i remove the stratify option as well as the categories column from train-test split.

由于我对 python 比较陌生,我无法弄清楚我做错了什么,或者这段代码是否会根据列类别进行分层。当我从训练测试拆分中删除分层选项以及类别列时,它似乎工作正常。

Any help will be appreciated.

任何帮助将不胜感激。

回答by nEO

    >>> import pandas as pd
    >>> Meta = pd.read_csv('C:\Users\*****\Downloads\so\Book1.csv')
    >>> import numpy as np
    >>> from sklearn.model_selection import train_test_split
    >>> y = Meta.pop('Categories')
    >>> Meta
        ReviewerID      ReviewText  ProductId
        0        1212    good product   14444425
        1        1233  will buy again     324532
        2        5432  not recomended  789654123
    >>> y
        0    Mobile
        1     drugs
        2       dvd
        Name: Categories, dtype: object
    >>> X = Meta
    >>> X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42, stratify=y)
    >>> X_test
        ReviewerID    ReviewText  ProductId
        0        1212  good product   14444425

回答by su79eu7k

sklearn.model_selection.train_test_split

stratify : array-like or None (default is None)

If not None, data is split in a stratified fashion, using this as the class labels.

sklearn.model_selection.train_test_split

分层:类似数组或无(默认为无)

如果不是 None,则以分层方式拆分数据,将其用作类标签。

Along the API docs, I think you have to try like X_train, X_test, y_train, y_test = train_test_split(Meta_X, Meta_Y, test_size = 0.2, stratify=Meta_Y).

沿着 API 文档,我认为您必须尝试像X_train, X_test, y_train, y_test = train_test_split(Meta_X, Meta_Y, test_size = 0.2, stratify=Meta_Y).

Meta_X, Meta_Yshould be assigned properly by you(I think Meta_Yshould be Meta.categoriesbased on your code).

Meta_X,Meta_Y应该由您正确分配(我认为Meta_Y应该Meta.categories基于您的代码)。