Python 在pandas groupby之后对每组进行采样

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36390406/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:48:57  来源:igfitidea点击:

Sample each group after pandas groupby

pythonpandasrandomgroup-bypandas-groupby

提问by gongzhitaao

I know this must have been answered some where but I just could not find it.

我知道这一定在某些地方得到了回答,但我就是找不到。

Problem: Sample each group after groupby operation.

问题:在 groupby 操作后对每个组进行采样。

import pandas as pd

df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
                   'b': [1,1,1,0,0,0,0]})

grouped = df.groupby('b')

# now sample from each group, e.g., I want 30% of each group

回答by EdChum

Apply a lambda and call samplewith param frac:

应用 lambda 并sample使用 param调用frac

In [2]:
df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
                   'b': [1,1,1,0,0,0,0]})
?
grouped = df.groupby('b')
grouped.apply(lambda x: x.sample(frac=0.3))

Out[2]:
     a  b
b        
0 6  7  0
1 2  3  1

回答by cs95

Sample a fraction of each group

对每组的一小部分进行采样

You can use GroupBy.applywith sample. You do not need to use a lambda; applyaccepts keyword arguments:

您可以GroupBy.applysample. 您不需要使用 lambda;apply接受关键字参数:

df.groupby('b').apply(pd.DataFrame.sample, frac=.3)
     a  b
b        
0 6  7  0
1 0  1  1

If the MultiIndex is not required, you may specify group_keys=Falseto groupby:

如果不需要多指标,你可以指定group_keys=Falsegroupby

df.groupby('b', group_keys=False).apply(pd.DataFrame.sample, frac=.3)

   a  b
6  7  0
2  3  1


Sample Nrows from each group

N来自每组的样本行

applyis slow. If your use case is to sample a fixed number of rows, you can shuffle the DataFrame beforehand, then use GroupBy.head.

apply是慢的。如果您的用例是对固定数量的行进行采样,您可以事先对 DataFrame 进行混洗,然后使用GroupBy.head.

df.sample(frac=1).groupby('b').head(2)

   a  b
2  3  1
5  6  0
1  2  1
4  5  0

This is the same as df.groupby('b', group_keys=False).apply(pd.DataFrame.sample, n=N), but faster:

这与 相同df.groupby('b', group_keys=False).apply(pd.DataFrame.sample, n=N)但速度更快

%%timeit df.groupby('b', group_keys=False).apply(pd.DataFrame.sample, n=2)  
                                                 # 3.19 ms ± 90.5 μs
%timeit df.sample(frac=1).groupby('b').head(2)   # 1.56 ms ± 103 μs