pandas 基于给定分布对数据帧进行采样

Question

提问by stackit

How can I sample a pandas dataframe or graphlab sframe based on a given class\label distribution values eg: I want to sample an data frame having a label\class column to select rows such that each class label is equally fetched thereby having a similar frequency for each class label corresponding to a uniform distribution of class labels . Or best would be to get samples according to the class distribution we want.

如何根据给定的类\标签分布值对 Pandas 数据框或 graphlab sframe 进行采样，例如：我想对具有标签\类列的数据框进行采样以选择行，以便每个类标签都被同等地获取，从而具有相似的频率对于每个类标签对应的类标签的均匀分布。或者最好是根据我们想要的类分布来获取样本。

+------+-------+-------+
| col1 | clol2 | class |
+------+-------+-------+
| 4    | 45    | A     |
+------+-------+-------+
| 5    | 66    | B     |
+------+-------+-------+
| 5    | 6     | C     |
+------+-------+-------+
| 4    | 6     | C     |
+------+-------+-------+
| 321  | 1     | A     |
+------+-------+-------+
| 32   | 432   | B     |
+------+-------+-------+
| 5    | 3     | B     |
+------+-------+-------+

given a huge dataframe like above and the required frequency distribution like below:
+-------+--------------+
| class | nostoextract |
+-------+--------------+
| A     | 2            |
+-------+--------------+
| B     | 2            |
+-------+--------------+
| C     | 2            |
+-------+--------------+

The above should extract rows from the first dataframe based on the given frequency distribution in the second frame where the frequency count values are given in nostoextract column to give a sampled frame where each class appears at max 2 times. should ignore and continue if cant find sufficient classes to meet the required count. The resulting dataframe is to be used for a decision tree based classifier.

以上应该根据第二帧中给定的频率分布从第一个数据帧中提取行，其中频率计数值在 nostoextract 列中给出，以给出每个类最多出现 2 次的采样帧。如果找不到足够的类来满足所需的数量，则应忽略并继续。生成的数据框将用于基于决策树的分类器。

As a commentator puts it the sampled dataframe has to contain nostoextract different instances of the corresponding class? Unless there are not enough examples for a given class in which case you just take all the available ones.

正如评论员所说，采样数据帧必须包含 nostoextract 相应类的不同实例？除非给定的类没有足够的示例，在这种情况下，您只需使用所有可用的示例。

Answer 1

回答by Thomas Kimber

Can you split your first dataframe into class-specific sub-dataframes, and then sample at will from those?

您能否将第一个数据帧拆分为特定于类的子数据帧，然后随意从中采样？

i.e.

IE

dfa = df[df['class']=='A']
dfb = df[df['class']=='B']
dfc = df[df['class']=='C']
....

Then once you've split/created/filtered on dfa, dfb, dfc, pick a number from the top as desired (if dataframes don't have any particular sort-pattern)

然后在 dfa、dfb、dfc 上拆分/创建/过滤后，根据需要从顶部选择一个数字（如果数据帧没有任何特定的排序模式）

 dfasamplefive = dfa[:5]

Or use the sample method as described by a previous commenter to directly take a random sample:

或者使用前面评论者描述的sample方法直接随机抽取一个样本：

dfasamplefive = dfa.sample(n=5)

If that suits your needs, all that's left to do is automate the process, feeding in the number to be sampled from the control dataframe you have as your second dataframe containing the desired number of samples.

如果这满足您的需求，剩下要做的就是自动化该过程，输入要从您拥有的控制数据帧中采样的数字，作为包含所需样本数量的第二个数据帧。

Answer 2

回答by swenzel

I think this will solve your problem:

我认为这将解决您的问题：

import pandas as pd

data = pd.DataFrame({'cols1':[4, 5, 5, 4, 321, 32, 5],
                     'clol2':[45, 66, 6, 6, 1, 432, 3],
                     'class':['A', 'B', 'C', 'C', 'A', 'B', 'B']})

freq = pd.DataFrame({'class':['A', 'B', 'C'],
                     'nostoextract':[2, 2, 2], })

def bootstrap(data, freq):
    freq = freq.set_index('class')

    # This function will be applied on each group of instances of the same
    # class in `data`.
    def sampleClass(classgroup):
        cls = classgroup['class'].iloc[0]
        nDesired = freq.nostoextract[cls]
        nRows = len(classgroup)

        nSamples = min(nRows, nDesired)
        return classgroup.sample(nSamples)

    samples = data.groupby('class').apply(sampleClass)

    # If you want a new index with ascending values
    # samples.index = range(len(samples))

    # If you want an index which is equal to the row in `data` where the sample
    # came from
    samples.index = samples.index.get_level_values(1)

    # If you don't change it then you'll have a multiindex with level 0
    # being the class and level 1 being the row in `data` where
    # the sample came from.

    return samples

print(bootstrap(data,freq))

Prints:

印刷：

  class  clol2  cols1
0     A     45      4
4     A      1    321
1     B     66      5
5     B    432     32
3     C      6      4
2     C      6      5

If you don't want the result to be ordered by classes, you can permuteit in the end.

如果您不想按类对结果进行排序，则可以在最后对其进行置换。

Answer 3

回答by papayawarrior

Here's a solution for SFrames. It's not exactlywhat you want, because it samples points randomly, so that the results don't necessarily have precisely the number of rows you specify. An exact method would probably shuffle the data randomly then take the first krows for a given class, but this gets you pretty darn close.

这是 SFrames 的解决方案。这不完全是您想要的，因为它随机采样点，因此结果不一定具有您指定的精确行数。一个精确的方法可能会随机打乱数据，然后取k给定类的第一行，但这会让你非常接近。

import random
import graphlab as gl

## Construct data.
sf = gl.SFrame({'col1': [4, 5, 5, 4, 321, 32, 5],
                'col2': [45, 66, 6, 6, 1, 432, 3],
                'class': ['A', 'B', 'C', 'C', 'A', 'B', 'B']})

freq = gl.SFrame({'class': ['A', 'B', 'C'],
                  'number': [3, 1, 0]})

## Count how many instances of each class and compute a sampling
#  probability.
grp = sf.groupby('class', gl.aggregate.COUNT)
freq = freq.join(grp, on ='class', how='left')
freq['prob'] = freq.apply(lambda x: float(x['number']) / x['Count'])

## Join the sampling probability back to the original data.
sf = sf.join(freq[['class', 'prob']], on='class', how='left')

## Sample the original data, then subset.
sf['sample_mask'] = sf.apply(lambda x: 1 if random.random() <= x['prob'] 
                             else 0)
sf2 = sf[sf['sample_mask'] == 1]

In my sample run, I happened to get the exact number of samples I specified, but again, this is not guaranteed with this solution.

在我的样本运行中，我碰巧获得了我指定的确切样本数，但同样，此解决方案不能保证这一点。

>>> sf2
+-------+------+------+
| class | col1 | col2 |
+-------+------+------+
|   A   |  4   |  45  |
|   A   | 321  |  1   |
|   B   |  32  | 432  |
+-------+------+------+

pandas 基于给定分布对数据帧进行采样

提问by stackit

回答by Thomas Kimber

回答by swenzel

回答by papayawarrior

相关推荐

最近更新

标签

pandas 基于给定分布对数据帧进行采样

提问by stackit

回答by Thomas Kimber

回答by swenzel

回答by papayawarrior

相关推荐

pandas 有没有办法测试 SQLAlchemy 连接？

Python Pandas.Series.asof：无法将“Timestamp”类型与“struct_time”类型进行比较

pandas 根据前几年的数据计算熊猫数据框行的百分位数

pandas ValueError：无法将大小为 5 的序列复制到维度为 2 的数组轴

相关推荐

最近更新

标签