Python Pandas 中的分层抽样

Question

提问by Wboy

I've looked at the Sklearn stratified sampling docsas well as the pandas docsand also Stratified samples from Pandasand sklearn stratified sampling based on a columnbut they do not address this issue.

我查看了Sklearn 分层抽样文档以及Pandas 文档以及Pandas 的分层样本和基于列的 sklearn 分层抽样，但它们没有解决这个问题。

Im looking for a fast pandas/sklearn/numpy way to generate stratified samples of size n from a dataset. However, for rows with less than the specified sampling number, it should take all of the entries.

我正在寻找一种快速的 pandas/sklearn/numpy 方法来从数据集中生成大小为 n 的分层样本。但是，对于小于指定采样数的行，它应该取所有条目。

Concrete example:

具体例子：

Thank you! :)

谢谢！:)

Answer 1

回答by piRSquared

Use minwhen passing the number to sample. Consider the dataframe df

使用min经过数样本时。考虑数据框df

df = pd.DataFrame(dict(
        A=[1, 1, 1, 2, 2, 2, 2, 3, 4, 4],
        B=range(10)
    ))

df.groupby('A', group_keys=False).apply(lambda x: x.sample(min(len(x), 2)))

   A  B
1  1  1
2  1  2
3  2  3
6  2  6
7  3  7
9  4  9
8  4  8

Answer 2

回答by Ilya Prokin

Extending the groupbyanswer, we can make sure that sample is balanced. To do so, when for all classes the number of samples is >= n_samples, we can just take n_samplesfor all classes (previous answer). When minority class contains < n_samples, we can take the number of samples for all classes to be the same as of minority class.

扩展groupby答案，我们可以确保样本是平衡的。为此，当所有类别的样本数为 >= 时n_samples，我们可以n_samples针对所有类别（先前的答案）。当少数类包含 < 时n_samples，我们可以取所有类的样本数与少数类相同。

def stratified_sample_df(df, col, n_samples):
    n = min(n_samples, df[col].value_counts().min())
    df_ = df.groupby(col).apply(lambda x: x.sample(n))
    df_.index = df_.index.droplevel(0)
    return df_

Answer 3

回答by irkinosor

the following sample a total of N row where each group appear in its original proportion to the nearest integer, then shuffle and reset the index using:

以下示例总共 N 行，其中每个组以其与最接近的整数的原始比例出现，然后使用以下方法混洗和重置索引：

df = pd.DataFrame(dict(
    A=[1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4],
    B=range(20)
))

Short and sweet:

简短而甜蜜：

df.sample(n=N, weights='A', random_state=1).reset_index(drop=True)

Long version

长版

df.groupby('A', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(df))))).sample(frac=1).reset_index(drop=True)

Python Pandas 中的分层抽样

提问by Wboy

回答by piRSquared

回答by Ilya Prokin

回答by irkinosor

相关推荐

最近更新

标签

Python Pandas 中的分层抽样

提问by Wboy

回答by piRSquared

回答by Ilya Prokin

回答by irkinosor

相关推荐

Python 通过引用更改类属性

Python matplotlib 条形图：间隔条形

Python keras.utils.to_categorical() - 名称 keras 未定义

Python numpy : 计算 softmax 函数的导数

相关推荐

最近更新

标签