在 scikit-learn 和/或 pandas 中重新采样

Question

提问by user1507844

Is there a built in function in either Pandas or Scikit-learn for resampling according to a specified strategy? I want to resample my data based on a categorical variable.

Pandas 或 Scikit-learn 中是否有用于根据指定策略进行重采样的内置函数？我想根据分类变量重新采样我的数据。

For example, if my data has 75% men and 25% women, but I'd like to train my model on 50% men and 50% women. (I'd also like to be able to generalize to cases that aren't 50/50)

例如，如果我的数据有 75% 的男性和 25% 的女性，但我想用 50% 的男性和 50% 的女性训练我的模型。（我也希望能够推广到不是 50/50 的情况）

What I need is something that resamples my data according to specified proportions.

我需要的是根据指定的比例重新采样我的数据的东西。

Answer 1

回答by Alexander Bauer

Stratified sampling means that the class distribution is preserved. If you are looking for this, you can still use StratifiedKFoldand StratifiedShuffleSplit, as long as you have a categorical variable for which you want to ensure to have the same distribution in each fold. Just use the variable instead of the target variable. For example if you have a categorical variable in column i,

分层抽样意味着保留类分布。如果您正在寻找这个，您仍然可以使用StratifiedKFoldand StratifiedShuffleSplit，只要您有一个要确保在每个折叠中具有相同分布的分类变量。只需使用变量而不是目标变量。例如，如果您在 column 中有一个分类变量i，

skf = cross_validation.StratifiedKFold(X[:,i])

However if I understand you correctly, you want to resample to a certain target distribution (e.g. 50/50) of one of the categorical features. I guess you would have to come up with your own method to get such a sample (split the dataset by variable value, then take same number of random samples from each split). If your main motivation is to balance the training set for a classifier, a trick could be to adjust the sample_weights. You can set the weights so that they balance the training set according to the desired variable:

但是，如果我对您的理解正确的话，您希望重新采样到某个分类特征的某个目标分布（例如 50/50）。我想您必须想出自己的方法来获得这样的样本（按变量值拆分数据集，然后从每个拆分中获取相同数量的随机样本）。如果您的主要动机是平衡分类器的训练集，一个技巧可能是调整sample_weights. 您可以设置权重，以便它们根据所需的变量平衡训练集：

sample_weights = sklearn.preprocessing.balance_weights(X[:,i])
clf = svm.SVC()
clf_weights.fit(X, y, sample_weight=sample_weights)

For a non-uniform target distribution, you would have to adjust the sample_weights accordingly.

对于非均匀目标分布，您必须相应地调整 sample_weights。

Answer 2

回答by user1507844

My stab at a function to do what I want is below. Hope this is helpful to someone else.

我尝试执行我想要的功能的方法如下。希望这对其他人有帮助。

Xand yare assumed to be a Pandas DataFrame and Series respectively.

X并且y分别假定为 Pandas DataFrame 和 Series。

def resample(X, y, sample_type=None, sample_size=None, class_weights=None, seed=None):

    # Nothing to do if sample_type is 'abs' or not set.  sample_size should then be int
    # If sample type is 'min' or 'max' then sample_size should be float
    if sample_type == 'min':
        sample_size_ = np.round(sample_size * y.value_counts().min()).astype(int)
    elif sample_type == 'max':
        sample_size_ = np.round(sample_size * y.value_counts().max()).astype(int)
    else:
        sample_size_ = max(int(sample_size), 1)

    if seed is not None:
        np.random.seed(seed)

    if class_weights is None:
        class_weights = dict()

    X_resampled = pd.DataFrame()

    for yi in y.unique():
        size = np.round(sample_size_ * class_weights.get(yi, 1.)).astype(int)

        X_yi = X[y == yi]
        sample_index = np.random.choice(X_yi.index, size=size)
        X_resampled = X_resampled.append(X_yi.reindex(sample_index))

    return X_resampled

Answer 3

回答by MyopicVisage

If you are open to importing a library, I find the imbalanced-learnlibrary useful when addressing resampling. Here the categorical variable is the target 'y' and the data to re-sample on is 'X'. In the example below fish are resampled to equal the number of dogs, 3:3.

如果您愿意导入库，我发现不平衡学习库在解决重采样问题时很有用。这里的分类变量是目标“y”，要重新采样的数据是“X”。在下面的示例中，鱼被重新采样以等于狗的数量，3:3。

The code is slightly modified from the docs on imbalance-learn: 2.1.1. Naive random over-sampling. You can use this method with numeric data and strings.

该代码从不平衡学习的文档中略有修改：2.1.1。朴素的随机过采样。您可以将此方法用于数字数据和字符串。

import numpy as np  
from collections import Counter  
from imblearn.over_sampling import RandomOverSampler  

y = np.array([1,1,0,0,0]); # Fish / Dog  
print('target:\n', y)  
X = np.array([['red fish'],['blue fish'],['dog'],['dog'],['dog']]);  
print('data:\n',X);  

print('Original dataset shape {}'.format(Counter(y))) # Original dataset shape Counter({1: 900, 0: 100})  
print(type(X)); print(X);  
print(y);  

ros = RandomOverSampler(ratio='auto', random_state=42);  
X_res, y_res = ros.fit_sample(X, y);  

print('Resampled dataset shape {}'.format(Counter(y_res))) # Resampled dataset shape Counter({0: 900, 1: 900});  
print(type(X_res)); print(X_res); print(y_res);

在 scikit-learn 和/或 pandas 中重新采样

提问by user1507844

回答by Alexander Bauer

回答by user1507844

回答by MyopicVisage

相关推荐

最近更新

标签

在 scikit-learn 和/或 pandas 中重新采样

提问by user1507844

回答by Alexander Bauer

回答by user1507844

回答by MyopicVisage

相关推荐

来自 numpy 或 pandas 邻接矩阵的 igraph 图

pandas 合并多个具有非唯一索引的数据帧

你能格式化 Pandas 整数来显示，比如用于浮点数的 `pd.options.display.float_format` 吗？

为什么 numpy/pandas 解析长行的 csv 文件这么慢？

相关推荐

最近更新

标签