来自 Pandas 的分层样本

Question

提问by HonzaB

I have a pandas DataFrame which looks approximately as follows:

我有一个 Pandas DataFrame，它看起来大致如下：

cli_id | X1 | X2 | X3 | ... | Xn |  Y  |
----------------------------------------
123    | 1  | A  | XX | ... | 4  | 0.1 |
456    | 2  | B  | XY | ... | 5  | 0.2 |
789    | 1  | B  | XY | ... | 5  | 0.3 |
101    | 2  | A  | XX | ... | 4  | 0.1 |
...

I have client id, few categorical attributes and Y which is probability of an event which has values from 0 to 1 by 0.1.

我有客户 ID、很少的分类属性和 Y，它是一个事件的概率，其值从 0 到 1 乘以 0.1。

I need to take a stratified sample in every group (so 10 folds) of Y of size of 200

我需要在大小为 200 的 Y 的每一组（所以 10 倍）中抽取一个分层样本

I often use this to take a stratified sample when splitting into train/test:

在拆分为训练/测试时，我经常使用它来获取分层样本：

def stratifiedSplit(X,y,size):
    sss = StratifiedShuffleSplit(y, n_iter=1, test_size=size, random_state=0)

    for train_index, test_index in sss:
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    return X_train, X_test, y_train, y_test

But I don't know how to modify it in this case.

但我不知道在这种情况下如何修改它。

Answer 1

采纳答案by elelias

I'm not totally sure whether you mean this:

我不完全确定你是不是这个意思：

strats = []
for k in range(11):
    y_val = k*0.1
    dummy_df = your_df[your_df['Y'] == y_val]
    stats.append( dummy_df.sample(200) )

That makes a dummy dataframe consisting in only the Yvalues you want, and then takes a sample of 200.

这会生成一个仅包含Y您想要的值的虚拟数据框，然后取 200 个样本。

OK so you need the different chunks to have the same structure. I guess that's a bit harder, here's how I would do it:

好的，所以您需要不同的块具有相同的结构。我想这有点难，这是我的方法：

First of all, I would get a histogram of what X1looks like:

首先，我会得到一个X1看起来像的直方图：

hist, edges = np.histogram(your_df['X1'], bins=np.linespace(min_x, max_x, nbins))

we have now a histogram with nbinsbins.

我们现在有一个带nbinsbin的直方图。

Now the strategy is to draw a certain number of rows depending on what their value of X1is. We will draw more from the bins with more observations and less from the bins with less, so that the structure of Xis preserved.

现在的策略是根据它们的值绘制一定数量的行X1。我们将从具有更多观察的 bin 中抽取更多，而从具有较少观察的 bin 中抽取更少，以便X保留的结构。

In particular, the relative contribution of every bin should be:

特别是，每个 bin 的相对贡献应该是：

rel = [float(i) / sum(hist) for i in hist]

This will be something like [0.1, 0.2, 0.1, 0.3, 0.3]

这将是类似 [0.1, 0.2, 0.1, 0.3, 0.3]

If we want 200 samples, we need to draw:

如果我们想要200个样本，我们需要绘制：

draws_in_bin = [int(i*200) for i in rel]

Now we know how many observations to draw from every bin:

现在我们知道从每个 bin 中抽取多少个观察值：

strats = []
for k in range(11):
        y_val = k*0.1

        #get a dataframe for every value of Y
        dummy_df = your_df[your_df['Y'] == y_val]

        bin_strat = []
        for left_edge, right_edge, n_draws in zip(edges[:-1], edges[1:], draws_in_bin):

             bin_df = dummy_df[ (dummy_df['X1']> left_edge) 
                              & (dummy_df['X1']< right_edge) ]

             bin_strat.append(bin_df.sample(n_draws))
             # this takes the right number of draws out 
             # of the X1 bin where we currently are
             # Note that every element of bin_strat is a dataframe
             # with a number of entries that corresponds to the 
             # structure of draws_in_bin
        #
        #concatenate the dataframes for every bin and append to the list
        strats.append( pd.concat(bin_strat) )

Answer 2

回答by Quickbeam2k1

If the number of samples is the same for every group, or if the proportion is constant for every group, you could try something like

如果每个组的样本数量相同，或者每个组的比例不变，您可以尝试类似的方法

df.groupby('Y').apply(lambda x: x.sample(n=200))

or

或者

df.groupby('Y').apply(lambda x: x.sample(frac=.1))

To perform stratified sampling with respect to more than one variable, just group with respect to more variables. It may be necessary to construct new binned variables to this end.

要对多个变量进行分层抽样，只需对更多变量进行分组。为此可能需要构建新的分箱变量。

However, if the group size is too small w.r.t. the proportion like groupsize 1 and propotion .25, then no item will be returned. This is due to pythons rounding implementation of the int function int(0.25)=0

但是，如果组大小太小，与组大小 1 和比例 0.25 之类的比例相比，则不会返回任何项目。这是由于python对int函数的舍入实现int(0.25)=0

来自 Pandas 的分层样本

提问by HonzaB

采纳答案by elelias

回答by Quickbeam2k1

相关推荐

最近更新

标签

来自 Pandas 的分层样本

提问by HonzaB

采纳答案by elelias

回答by Quickbeam2k1

相关推荐

pandas 使用str中的常量值在pandas df中添加日期列

将 json 转换为 Pandas DataFrame

带有 WHERE 子句的 JOIN 的 Pandas 模拟

pandas 如何删除熊猫数据框中的行？

相关推荐

最近更新

标签