Python Scikit-learn 平衡子采样

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23455728/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:00:07  来源:igfitidea点击:

Scikit-learn balanced subsampling

pythonpandasscikit-learnsubsampling

提问by mikkom

I'm trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it myself? Any pointers to code that does this?

我正在尝试为我的大型不平衡数据集创建 N 个平衡的随机子样本。有没有办法简单地使用 scikit-learn / pandas 来做到这一点,或者我必须自己实现它?任何指向执行此操作的代码的指针?

These subsamples should be random and can be overlapping as I feed each to separate classifier in a very large ensemble of classifiers.

这些子样本应该是随机的并且可以重叠,因为我将每个子样本提供给一个非常大的分类器集合中的单独分类器。

In Weka there is tool called spreadsubsample, is there equivalent in sklearn? http://wiki.pentaho.com/display/DATAMINING/SpreadSubsample

在 Weka 中有一个叫做 spreadsubsample 的工具,在 sklearn 中是否有等价物? http://wiki.pentaho.com/display/DATAMINING/SpreadSubsample

(I know about weighting but that's not what I'm looking for.)

(我知道权重,但这不是我要找的。)

采纳答案by mikkom

Here is my first version that seems to be working fine, feel free to copy or make suggestions on how it could be more efficient (I have quite a long experience with programming in general but not that long with python or numpy)

这是我的第一个似乎运行良好的版本,请随意复制或提出有关如何提高效率的建议(我在编程方面有相当长的经验,但在 python 或 numpy 方面没有那么长)

This function creates single random balanced subsample.

此函数创建单个随机平衡子样本。

edit: The subsample size now samples down minority classes, this should probably be changed.

编辑:子样本大小现在对少数类进行采样,这可能应该改变。

def balanced_subsample(x,y,subsample_size=1.0):

    class_xs = []
    min_elems = None

    for yi in np.unique(y):
        elems = x[(y == yi)]
        class_xs.append((yi, elems))
        if min_elems == None or elems.shape[0] < min_elems:
            min_elems = elems.shape[0]

    use_elems = min_elems
    if subsample_size < 1:
        use_elems = int(min_elems*subsample_size)

    xs = []
    ys = []

    for ci,this_xs in class_xs:
        if len(this_xs) > use_elems:
            np.random.shuffle(this_xs)

        x_ = this_xs[:use_elems]
        y_ = np.empty(use_elems)
        y_.fill(ci)

        xs.append(x_)
        ys.append(y_)

    xs = np.concatenate(xs)
    ys = np.concatenate(ys)

    return xs,ys

For anyone trying to make the above work with a Pandas DataFrame, you need to make a couple of changes:

对于尝试使用 Pandas DataFrame 进行上述操作的任何人,您需要进行一些更改:

  1. Replace the np.random.shuffleline with

    this_xs = this_xs.reindex(np.random.permutation(this_xs.index))

  2. Replace the np.concatenatelines with

    xs = pd.concat(xs) ys = pd.Series(data=np.concatenate(ys),name='target')

  1. np.random.shuffle行替换为

    this_xs = this_xs.reindex(np.random.permutation(this_xs.index))

  2. np.concatenate行替换为

    xs = pd.concat(xs) ys = pd.Series(data=np.concatenate(ys),name='target')

回答by eickenberg

This type of data splitting is notprovided among the built-in data splitting techniques exposed in sklearn.cross_validation.

中公开的内置数据拆分技术中提供这种类型的数据拆分sklearn.cross_validation

What seems similar to your needs is sklearn.cross_validation.StratifiedShuffleSplit, which can generate subsamples of any size while retaining the structure of the whole dataset, i.e. meticulously enforcing the sameunbalance that is in your main dataset. While this is not what you are looking for, you may be able to use the code therein and change the imposed ratio to 50/50 always.

看起来与您的需求相似的是sklearn.cross_validation.StratifiedShuffleSplit,它可以生成任何大小的子样本,同时保留整个数据集的结构,即精心实施与主数据集中相同的不平衡。虽然这不是您要查找的内容,但您可以使用其中的代码并将强制比率始终更改为 50/50。

(This would probably be a very good contribution to scikit-learn if you feel up to it.)

(如果您愿意,这可能是对 scikit-learn 的一个很好的贡献。)

回答by hernan

My subsampler version, hope this helps

我的子采样器版本,希望这有帮助

def subsample_indices(y, size):
    indices = {}
    target_values = set(y_train)
    for t in target_values:
        indices[t] = [i for i in range(len(y)) if y[i] == t]
    min_len = min(size, min([len(indices[t]) for t in indices]))
    for t in indices:
        if len(indices[t]) > min_len:
            indices[t] = random.sample(indices[t], min_len)
    return indices

x = [1, 1, 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, -1]
j = subsample_indices(x, 2)
print j
print [x[t] for t in j[-1]]
print [x[t] for t in j[1]]

回答by beingzy

Below is my python implementation for creating balanced data copy. Assumptions: 1. target variable (y) is binary class (0 vs. 1) 2. 1 is the minority.

下面是我用于创建平衡数据副本的 python 实现。假设: 1. 目标变量 (y) 是二元类(0 vs. 1) 2. 1 是少数。

from numpy import unique
from numpy import random 

def balanced_sample_maker(X, y, random_seed=None):
    """ return a balanced data set by oversampling minority class 
        current version is developed on assumption that the positive
        class is the minority.

    Parameters:
    ===========
    X: {numpy.ndarrray}
    y: {numpy.ndarray}
    """
    uniq_levels = unique(y)
    uniq_counts = {level: sum(y == level) for level in uniq_levels}

    if not random_seed is None:
        random.seed(random_seed)

    # find observation index of each class levels
    groupby_levels = {}
    for ii, level in enumerate(uniq_levels):
        obs_idx = [idx for idx, val in enumerate(y) if val == level]
        groupby_levels[level] = obs_idx

    # oversampling on observations of positive label
    sample_size = uniq_counts[0]
    over_sample_idx = random.choice(groupby_levels[1], size=sample_size, replace=True).tolist()
    balanced_copy_idx = groupby_levels[0] + over_sample_idx
    random.shuffle(balanced_copy_idx)

    return X[balanced_copy_idx, :], y[balanced_copy_idx]

回答by gc5

A version for pandas Series:

熊猫系列的一个版本:

import numpy as np

def balanced_subsample(y, size=None):

    subsample = []

    if size is None:
        n_smp = y.value_counts().min()
    else:
        n_smp = int(size / len(y.value_counts().index))

    for label in y.value_counts().index:
        samples = y[y == label].index.values
        index_range = range(samples.shape[0])
        indexes = np.random.choice(index_range, size=n_smp, replace=False)
        subsample += samples[indexes].tolist()

    return subsample

回答by Kevin Mader

Here is a version of the above code that works for multiclass groups (in my tested case group 0, 1, 2, 3, 4)

这是适用于多类组的上述代码的一个版本(在我的测试案例组 0、1​​、2、3、4 中)

import numpy as np
def balanced_sample_maker(X, y, sample_size, random_seed=None):
    """ return a balanced data set by sampling all classes with sample_size 
        current version is developed on assumption that the positive
        class is the minority.

    Parameters:
    ===========
    X: {numpy.ndarrray}
    y: {numpy.ndarray}
    """
    uniq_levels = np.unique(y)
    uniq_counts = {level: sum(y == level) for level in uniq_levels}

    if not random_seed is None:
        np.random.seed(random_seed)

    # find observation index of each class levels
    groupby_levels = {}
    for ii, level in enumerate(uniq_levels):
        obs_idx = [idx for idx, val in enumerate(y) if val == level]
        groupby_levels[level] = obs_idx
    # oversampling on observations of each label
    balanced_copy_idx = []
    for gb_level, gb_idx in groupby_levels.iteritems():
        over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
        balanced_copy_idx+=over_sample_idx
    np.random.shuffle(balanced_copy_idx)

    return (X[balanced_copy_idx, :], y[balanced_copy_idx], balanced_copy_idx)

This also returns the indices so they can be used for other datasets and to keep track of how frequently each data set was used (helpful for training)

这也会返回索引,以便它们可用于其他数据集并跟踪每个数据集的使用频率(有助于训练)

回答by kadu

Although it is already answered, I stumbled upon your question looking for something similar. After some more research, I believe sklearn.model_selection.StratifiedKFoldcan be used for this purpose:

虽然已经回答了,但我偶然发现了你的问题,正在寻找类似的东西。经过一些更多的研究,我相信sklearn.model_selection.StratifiedKFold可以用于此目的:

from sklearn.model_selection import StratifiedKFold

X = samples_array
y = classes_array # subsamples will be stratified according to y
n = desired_number_of_subsamples

skf = StratifiedKFold(n, shuffle = True)

batches = []
for _, batch in skf.split(X, y):
    do_something(X[batch], y[batch])

It's important that you add the _because since skf.split()is used to create stratified folds for K-fold cross-validation, it returns two lists of indices: train(n - 1 / nelements) and test (1 / nelements).

您添加的是很重要的_,因为,因为skf.split()用于为创建分层折叠K-倍交叉验证,它返回指数的两个列表:trainn - 1 / n元素)和测试(1 / n元素)。

Please note that this is as of sklearn 0.18. In sklearn 0.17the same function can be found in module cross_validationinstead.

请注意,这是从sklearn 0.18 开始的。在sklearn 0.17 中,可以在模块中找到相同的功能cross_validation

回答by Roko Mijic

A short, pythonic solution to balance a pandas DataFrame either by subsampling (uspl=True) or oversampling (uspl=False), balanced by a specified column in that dataframe that has two or more values.

一个简短的 Pythonic 解决方案,用于通过子采样 ( uspl=True) 或过采样 ( uspl=False) 平衡Pandas DataFrame,由该数据帧中具有两个或多个值的指定列平衡。

For uspl=True, this code will take a random sample without replacementof size equal to the smallest stratum from all strata. For uspl=False, this code will take a random sample with replacementof size equal to the largest stratum from all strata.

对于uspl=True,此代码将随机抽取样本而不替换等于所有层中最小层的大小。对于uspl=False,此代码将随机抽样,替换大小等于所有层中最大层的大小。

def balanced_spl_by(df, lblcol, uspl=True):
    datas_l = [ df[df[lblcol]==l].copy() for l in list(set(df[lblcol].values)) ]
    lsz = [f.shape[0] for f in datas_l ]
    return pd.concat([f.sample(n = (min(lsz) if uspl else max(lsz)), replace = (not uspl)).copy() for f in datas_l ], axis=0 ).sample(frac=1) 

This will only work with a Pandas DataFrame, but that seems to be a common application, and restricting it to Pandas DataFrames significantly shortens the code as far as I can tell.

这仅适用于 Pandas DataFrame,但这似乎是一个常见的应用程序,据我所知,将其限制为 Pandas DataFrame 会显着缩短代码。

回答by Bert Kellerman

A slight modification to the top answer by mikkom.

对 mikkom 的最佳答案稍作修改。

If you want to preserve ordering of the larger class data, ie. you don't want to shuffle.

如果要保留较大类数据的排序,即。你不想洗牌。

Instead of

代替

    if len(this_xs) > use_elems:
        np.random.shuffle(this_xs)

do this

做这个

        if len(this_xs) > use_elems:
            ratio = len(this_xs) / use_elems
            this_xs = this_xs[::ratio]

回答by eickenberg

There now exists a full-blown python package to address imbalanced data. It is available as a sklearn-contrib package at https://github.com/scikit-learn-contrib/imbalanced-learn

现在有一个完整的 python 包来解决不平衡的数据。它可作为 sklearn-contrib 包在https://github.com/scikit-learn-contrib/imbalanced-learn