Python Pandas 中的分层抽样

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44114463/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:44:09  来源:igfitidea点击:

Stratified Sampling in Pandas

pythonpandasnumpyscikit-learn

提问by Wboy

I've looked at the Sklearn stratified sampling docsas well as the pandas docsand also Stratified samples from Pandasand sklearn stratified sampling based on a columnbut they do not address this issue.

我查看了Sklearn 分层抽样文档以及Pandas文档以及Pandas 的分层样本基于列的 sklearn 分层抽样,但它们没有解决这个问题。

Im looking for a fast pandas/sklearn/numpy way to generate stratified samples of size n from a dataset. However, for rows with less than the specified sampling number, it should take all of the entries.

我正在寻找一种快速的 pandas/sklearn/numpy 方法来从数据集中生成大小为 n 的分层样本。但是,对于小于指定采样数的行,它应该取所有条目。

Concrete example:

具体例子:

enter image description here

在此处输入图片说明

Thank you! :)

谢谢!:)

回答by piRSquared

Use minwhen passing the number to sample. Consider the dataframe df

使用min经过数样本时。考虑数据框df

df = pd.DataFrame(dict(
        A=[1, 1, 1, 2, 2, 2, 2, 3, 4, 4],
        B=range(10)
    ))

df.groupby('A', group_keys=False).apply(lambda x: x.sample(min(len(x), 2)))

   A  B
1  1  1
2  1  2
3  2  3
6  2  6
7  3  7
9  4  9
8  4  8

回答by Ilya Prokin

Extending the groupbyanswer, we can make sure that sample is balanced. To do so, when for all classes the number of samples is >= n_samples, we can just take n_samplesfor all classes (previous answer). When minority class contains < n_samples, we can take the number of samples for all classes to be the same as of minority class.

扩展groupby答案,我们可以确保样本是平衡的。为此,当所有类别的样本数为 >= 时n_samples,我们可以n_samples针对所有类别(先前的答案)。当少数类包含 < 时n_samples,我们可以取所有类的样本数与少数类相同。

def stratified_sample_df(df, col, n_samples):
    n = min(n_samples, df[col].value_counts().min())
    df_ = df.groupby(col).apply(lambda x: x.sample(n))
    df_.index = df_.index.droplevel(0)
    return df_

回答by irkinosor

the following sample a total of N row where each group appear in its original proportion to the nearest integer, then shuffle and reset the index using:

以下示例总共 N 行,其中每个组以其与最接近的整数的原始比例出现,然后使用以下方法混洗和重置索引:

df = pd.DataFrame(dict(
    A=[1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4],
    B=range(20)
))

Short and sweet:

简短而甜蜜:

df.sample(n=N, weights='A', random_state=1).reset_index(drop=True)

Long version

长版

df.groupby('A', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(df))))).sample(frac=1).reset_index(drop=True)