在 scikit-learn 和/或 pandas 中重新采样
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29873224/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Resampling in scikit-learn and/or pandas
提问by user1507844
Is there a built in function in either Pandas or Scikit-learn for resampling according to a specified strategy? I want to resample my data based on a categorical variable.
Pandas 或 Scikit-learn 中是否有用于根据指定策略进行重采样的内置函数?我想根据分类变量重新采样我的数据。
For example, if my data has 75% men and 25% women, but I'd like to train my model on 50% men and 50% women. (I'd also like to be able to generalize to cases that aren't 50/50)
例如,如果我的数据有 75% 的男性和 25% 的女性,但我想用 50% 的男性和 50% 的女性训练我的模型。(我也希望能够推广到不是 50/50 的情况)
What I need is something that resamples my data according to specified proportions.
我需要的是根据指定的比例重新采样我的数据的东西。
回答by Alexander Bauer
Stratified sampling means that the class distribution is preserved. If you are looking for this, you can still use StratifiedKFoldand StratifiedShuffleSplit, as long as you have a categorical variable for which you want to ensure to have the same distribution in each fold. Just use the variable instead of the target variable. For example if you have a categorical variable in column i,
分层抽样意味着保留类分布。如果您正在寻找这个,您仍然可以使用StratifiedKFoldand StratifiedShuffleSplit,只要您有一个要确保在每个折叠中具有相同分布的分类变量。只需使用变量而不是目标变量。例如,如果您在 column 中有一个分类变量i,
skf = cross_validation.StratifiedKFold(X[:,i])
However if I understand you correctly, you want to resample to a certain target distribution (e.g. 50/50) of one of the categorical features. I guess you would have to come up with your own method to get such a sample (split the dataset by variable value, then take same number of random samples from each split). If your main motivation is to balance the training set for a classifier, a trick could be to adjust the sample_weights. You can set the weights so that they balance the training set according to the desired variable:
但是,如果我对您的理解正确的话,您希望重新采样到某个分类特征的某个目标分布(例如 50/50)。我想您必须想出自己的方法来获得这样的样本(按变量值拆分数据集,然后从每个拆分中获取相同数量的随机样本)。如果您的主要动机是平衡分类器的训练集,一个技巧可能是调整sample_weights. 您可以设置权重,以便它们根据所需的变量平衡训练集:
sample_weights = sklearn.preprocessing.balance_weights(X[:,i])
clf = svm.SVC()
clf_weights.fit(X, y, sample_weight=sample_weights)
For a non-uniform target distribution, you would have to adjust the sample_weights accordingly.
对于非均匀目标分布,您必须相应地调整 sample_weights。
回答by user1507844
My stab at a function to do what I want is below. Hope this is helpful to someone else.
我尝试执行我想要的功能的方法如下。希望这对其他人有帮助。
Xand yare assumed to be a Pandas DataFrame and Series respectively.
X并且y分别假定为 Pandas DataFrame 和 Series。
def resample(X, y, sample_type=None, sample_size=None, class_weights=None, seed=None):
# Nothing to do if sample_type is 'abs' or not set. sample_size should then be int
# If sample type is 'min' or 'max' then sample_size should be float
if sample_type == 'min':
sample_size_ = np.round(sample_size * y.value_counts().min()).astype(int)
elif sample_type == 'max':
sample_size_ = np.round(sample_size * y.value_counts().max()).astype(int)
else:
sample_size_ = max(int(sample_size), 1)
if seed is not None:
np.random.seed(seed)
if class_weights is None:
class_weights = dict()
X_resampled = pd.DataFrame()
for yi in y.unique():
size = np.round(sample_size_ * class_weights.get(yi, 1.)).astype(int)
X_yi = X[y == yi]
sample_index = np.random.choice(X_yi.index, size=size)
X_resampled = X_resampled.append(X_yi.reindex(sample_index))
return X_resampled
回答by MyopicVisage
If you are open to importing a library, I find the imbalanced-learnlibrary useful when addressing resampling. Here the categorical variable is the target 'y' and the data to re-sample on is 'X'. In the example below fish are resampled to equal the number of dogs, 3:3.
如果您愿意导入库,我发现不平衡学习库在解决重采样问题时很有用。这里的分类变量是目标“y”,要重新采样的数据是“X”。在下面的示例中,鱼被重新采样以等于狗的数量,3:3。
The code is slightly modified from the docs on imbalance-learn: 2.1.1. Naive random over-sampling. You can use this method with numeric data and strings.
该代码从不平衡学习的文档中略有修改:2.1.1。朴素的随机过采样。您可以将此方法用于数字数据和字符串。
import numpy as np
from collections import Counter
from imblearn.over_sampling import RandomOverSampler
y = np.array([1,1,0,0,0]); # Fish / Dog
print('target:\n', y)
X = np.array([['red fish'],['blue fish'],['dog'],['dog'],['dog']]);
print('data:\n',X);
print('Original dataset shape {}'.format(Counter(y))) # Original dataset shape Counter({1: 900, 0: 100})
print(type(X)); print(X);
print(y);
ros = RandomOverSampler(ratio='auto', random_state=42);
X_res, y_res = ros.fit_sample(X, y);
print('Resampled dataset shape {}'.format(Counter(y_res))) # Resampled dataset shape Counter({0: 900, 1: 900});
print(type(X_res)); print(X_res); print(y_res);

