来自 Pandas 的分层样本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41035187/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:35:25  来源:igfitidea点击:

Stratified samples from Pandas

pythonpandas

提问by HonzaB

I have a pandas DataFrame which looks approximately as follows:

我有一个 Pandas DataFrame,它看起来大致如下:

cli_id | X1 | X2 | X3 | ... | Xn |  Y  |
----------------------------------------
123    | 1  | A  | XX | ... | 4  | 0.1 |
456    | 2  | B  | XY | ... | 5  | 0.2 |
789    | 1  | B  | XY | ... | 5  | 0.3 |
101    | 2  | A  | XX | ... | 4  | 0.1 |
...

I have client id, few categorical attributes and Y which is probability of an event which has values from 0 to 1 by 0.1.

我有客户 ID、很少的分类属性和 Y,它是一个事件的概率,其值从 0 到 1 乘以 0.1。

I need to take a stratified sample in every group (so 10 folds) of Y of size of 200

我需要在大小为 200 的 Y 的每一组(所以 10 倍)中抽取一个分层样本

I often use this to take a stratified sample when splitting into train/test:

在拆分为训练/测试时,我经常使用它来获取分层样本:

def stratifiedSplit(X,y,size):
    sss = StratifiedShuffleSplit(y, n_iter=1, test_size=size, random_state=0)

    for train_index, test_index in sss:
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    return X_train, X_test, y_train, y_test

But I don't know how to modify it in this case.

但我不知道在这种情况下如何修改它。

采纳答案by elelias

I'm not totally sure whether you mean this:

我不完全确定你是不是这个意思:

strats = []
for k in range(11):
    y_val = k*0.1
    dummy_df = your_df[your_df['Y'] == y_val]
    stats.append( dummy_df.sample(200) )

That makes a dummy dataframe consisting in only the Yvalues you want, and then takes a sample of 200.

这会生成一个仅包含Y您想要的值的虚拟数据框,然后取 200 个样本。

OK so you need the different chunks to have the same structure. I guess that's a bit harder, here's how I would do it:

好的,所以您需要不同的块具有相同的结构。我想这有点难,这是我的方法:

First of all, I would get a histogram of what X1looks like:

首先,我会得到一个X1看起来像的直方图:

hist, edges = np.histogram(your_df['X1'], bins=np.linespace(min_x, max_x, nbins))

we have now a histogram with nbinsbins.

我们现在有一个带nbinsbin的直方图。

Now the strategy is to draw a certain number of rows depending on what their value of X1is. We will draw more from the bins with more observations and less from the bins with less, so that the structure of Xis preserved.

现在的策略是根据它们的值绘制一定数量的行X1。我们将从具有更多观察的 bin 中抽取更多,而从具有较少观察的 bin 中抽取更少,以便X保留的结构。

In particular, the relative contribution of every bin should be:

特别是,每个 bin 的相对贡献应该是:

rel = [float(i) / sum(hist) for i in hist]

This will be something like [0.1, 0.2, 0.1, 0.3, 0.3]

这将是类似 [0.1, 0.2, 0.1, 0.3, 0.3]

If we want 200 samples, we need to draw:

如果我们想要200个样本,我们需要绘制:

draws_in_bin = [int(i*200) for i in rel]

Now we know how many observations to draw from every bin:

现在我们知道从每个 bin 中抽取多少个观察值:

strats = []
for k in range(11):
        y_val = k*0.1

        #get a dataframe for every value of Y
        dummy_df = your_df[your_df['Y'] == y_val]

        bin_strat = []
        for left_edge, right_edge, n_draws in zip(edges[:-1], edges[1:], draws_in_bin):

             bin_df = dummy_df[ (dummy_df['X1']> left_edge) 
                              & (dummy_df['X1']< right_edge) ]

             bin_strat.append(bin_df.sample(n_draws))
             # this takes the right number of draws out 
             # of the X1 bin where we currently are
             # Note that every element of bin_strat is a dataframe
             # with a number of entries that corresponds to the 
             # structure of draws_in_bin
        #
        #concatenate the dataframes for every bin and append to the list
        strats.append( pd.concat(bin_strat) )

回答by Quickbeam2k1

If the number of samples is the same for every group, or if the proportion is constant for every group, you could try something like

如果每个组的样本数量相同,或者每个组的比例不变,您可以尝试类似的方法

df.groupby('Y').apply(lambda x: x.sample(n=200))

or

或者

df.groupby('Y').apply(lambda x: x.sample(frac=.1))

To perform stratified sampling with respect to more than one variable, just group with respect to more variables. It may be necessary to construct new binned variables to this end.

要对多个变量进行分层抽样,只需对更多变量进行分组。为此可能需要构建新的分箱变量。

However, if the group size is too small w.r.t. the proportion like groupsize 1 and propotion .25, then no item will be returned. This is due to pythons rounding implementation of the int function int(0.25)=0

但是,如果组大小太小,与组大小 1 和比例 0.25 之类的比例相比,则不会返回任何项目。这是由于python对int函数的舍入实现int(0.25)=0