Python 在pandas groupby之后对每组进行采样
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36390406/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Sample each group after pandas groupby
提问by gongzhitaao
I know this must have been answered some where but I just could not find it.
我知道这一定在某些地方得到了回答,但我就是找不到。
Problem: Sample each group after groupby operation.
问题:在 groupby 操作后对每个组进行采样。
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
'b': [1,1,1,0,0,0,0]})
grouped = df.groupby('b')
# now sample from each group, e.g., I want 30% of each group
回答by EdChum
回答by cs95
Sample a fraction of each group
对每组的一小部分进行采样
You can use GroupBy.apply
with sample
. You do not need to use a lambda; apply
accepts keyword arguments:
您可以GroupBy.apply
与sample
. 您不需要使用 lambda;apply
接受关键字参数:
df.groupby('b').apply(pd.DataFrame.sample, frac=.3)
a b
b
0 6 7 0
1 0 1 1
If the MultiIndex is not required, you may specify group_keys=False
to groupby
:
如果不需要多指标,你可以指定group_keys=False
到groupby
:
df.groupby('b', group_keys=False).apply(pd.DataFrame.sample, frac=.3)
a b
6 7 0
2 3 1
Sample N
rows from each group
N
来自每组的样本行
apply
is slow. If your use case is to sample a fixed number of rows, you can shuffle the DataFrame beforehand, then use GroupBy.head
.
apply
是慢的。如果您的用例是对固定数量的行进行采样,您可以事先对 DataFrame 进行混洗,然后使用GroupBy.head
.
df.sample(frac=1).groupby('b').head(2)
a b
2 3 1
5 6 0
1 2 1
4 5 0
This is the same as df.groupby('b', group_keys=False).apply(pd.DataFrame.sample, n=N)
, but faster:
这与 相同df.groupby('b', group_keys=False).apply(pd.DataFrame.sample, n=N)
,但速度更快:
%%timeit df.groupby('b', group_keys=False).apply(pd.DataFrame.sample, n=2)
# 3.19 ms ± 90.5 μs
%timeit df.sample(frac=1).groupby('b').head(2) # 1.56 ms ± 103 μs