Python Pandas 从 Groupby 中选择组的随机样本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32340604/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas Choosing Random Sample of Groups from Groupby
提问by sfortney
What is the best way to get a random sample of the elements of a groupby? As I understand it, a groupbyis just an iterable over groups.
获取 a 元素的随机样本的最佳方法是groupby什么?据我了解, agroupby只是一个可迭代的组。
The standard way I would do this for an iterable, if I wanted to select N = 200elements is:
如果我想选择N = 200元素,我会为可迭代对象执行此操作的标准方法是:
rand = random.sample(data, N)
If you attempt the above where data is a 'grouped' the elements of the resultant list are tuples for some reason.
如果您尝试上述数据是“分组”的,则结果列表的元素出于某种原因是元组。
I found the below example for randomly selecting the elements of a single key groupby, however this does not work with a multi-key groupby. From, How to access pandas groupby dataframe by key
我发现下面的示例用于随机选择单个 key 的元素groupby,但是这不适用于 multi-key groupby。来自,如何通过键访问pandas groupby数据帧
create groupby object
grouped = df.groupby('some_key')pick N dataframes and grab their indices
sampled_df_i = random.sample(grouped.indices, N)grab the groups using the groupby object 'get_group' method
df_list = map(lambda df_i: grouped.get_group(df_i),sampled_df_i)optionally - turn it all back into a single dataframe object
sampled_df = pd.concat(df_list, axis=0, join='outer')
创建 groupby 对象
grouped = df.groupby('some_key')选择 N 个数据帧并获取它们的索引
sampled_df_i = random.sample(grouped.indices, N)使用 groupby 对象“get_group”方法获取组
df_list = map(lambda df_i: grouped.get_group(df_i),sampled_df_i)可选 - 将其全部转回单个数据帧对象
sampled_df = pd.concat(df_list, axis=0, join='outer')
回答by CT Zhu
You can take a randoms sample of the unique values of df.some_key.unique(), use that to slice the dfand finally groupbyon the resultant:
您可以获取 的唯一值的随机样本df.some_key.unique(),使用它来切片df,最后groupby对结果进行切片:
In [337]:
df = pd.DataFrame({'some_key': [0,1,2,3,0,1,2,3,0,1,2,3],
'val': [1,2,3,4,1,5,1,5,1,6,7,8]})
In [338]:
print df[df.some_key.isin(random.sample(df.some_key.unique(),2))].groupby('some_key').mean()
val
some_key
0 1.000000
2 3.666667
If there are more than one groupby keys:
如果有多个 groupby 键:
In [358]:
df = pd.DataFrame({'some_key1':[0,1,2,3,0,1,2,3,0,1,2,3],
'some_key2':[0,0,0,0,1,1,1,1,2,2,2,2],
'val': [1,2,3,4,1,5,1,5,1,6,7,8]})
In [359]:
gby = df.groupby(['some_key1', 'some_key2'])
In [360]:
print gby.mean().ix[random.sample(gby.indices.keys(),2)]
val
some_key1 some_key2
1 1 5
3 2 8
But if you are just going to get the values of each group, you don't even need to groubpy, MultiIndexwill do:
但是,如果您只是要获取每个组的值,您甚至不需要groubpy,MultiIndex将执行以下操作:
In [372]:
idx = random.sample(set(pd.MultiIndex.from_product((df.some_key1, df.some_key2)).tolist()),
2)
print df.set_index(['some_key1', 'some_key2']).ix[idx]
val
some_key1 some_key2
2 0 3
3 1 5

