与列中的计数成比例地对 Pandas 数据框的行进行采样

Question

提问by eleanora

I have a large pandas dataframe with about 10,000,000 rows. Each one represents a feature vector. The feature vectors come in natural groups and the group label is in a column called group_id. I would like to randomly sample 10%say of the rows but in proportion to the numbers of each group_id.

我有一个大约有 10,000,000 行的大Pandas数据框。每个代表一个特征向量。特征向量来自自然组，组标签位于名为的列中group_id。我想随机抽样10%说行，但与每个group_id.

For example, if the group_id'sare A, B, A, C, A, Bthen I would like half of my sampled rows to have group_idA, two sixths to have group_idBand one sixth to have group_idC.

例如，如果group_id's是，A, B, A, C, A, B那么我希望我的采样行的一半有group_idA，六分之二有，六分group_idB之一有group_idC。

I can see the pandas function samplebut I am not sure how to use it to achieve this goal.

我可以看到 pandas 函数示例，但我不确定如何使用它来实现此目标。

Answer 1

回答by Vaishali

You can use groupby and sample

您可以使用 groupby 和 sample

sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=0.1))

Answer 2

回答by Abdou

This is not as simple as just grouping and using .sample. You need to actually get the fractions first. Since you said that you are looking to grab 10% of the total numbers of rows in different proportions, you will need to calculate how much each group will have to take out from the main dataframe. For instance, if we use the divide you mentioned in the question, then group Awill end up with 1/20for a fraction of the total number of rows, group Bwill get 1/30and group Cends up with 1/60. You can put these fractions in a dictionary and then use .groupbyand pd.concatto concatenate the number of rows*from each group into a dataframe. You will be using the nparameter from the .samplemethod instead of the fracparameter.

这不仅仅是分组和使用.sample. 你需要先得到分数。由于您说您希望以不同的比例获取总行数的 10%，因此您需要计算每个组必须从主数据框中取出多少。例如，如果我们使用您在问题中提到的除法，则 groupA将以1/20总行数的一小部分结束， groupB将得到1/30， groupC以1/60. 您可以将这些分数放入字典中，然后使用.groupby和pd.concat将每个组中的行数*连接到一个数据框中。您将使用方法中的n参数.sample而不是frac范围。

fracs = {'A': 1/20, 'B': 1/30, 'C': 1/60}
N = len(df)
pd.concat(dff.sample(n=int(fracs.get(i)*N)) for i,dff in df.groupby('group_id'))

Edit:

编辑：

This is to highlight the importance in fulfilling the requirement that group_id A should have half of the sampled rows, group_id B two sixths of the sampled rowsand group_id C one sixth of the sampled rows, regardless of the original group divides.

这是为了突出履行该GROUP_ID甲应该有一半的要求的重要性抽样行，GROUP_ID B中的六分之二抽样行的和GROUP_IDÇ六分之一抽样行，而不管原始组分歧的。

Starting with equal portions: each group starts with 40 rows

从等份开始：每组从 40 行开始

df1 = pd.DataFrame({'group_id': ['A','B', 'C']*40,
                   'vals': np.random.randn(120)})
N = len(df1)
fracs = {'A': 1/20, 'B': 1/30, 'C': 1/60}
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df1.groupby('group_id')))

#     group_id      vals
# 12         A -0.175109
# 51         A -1.936231
# 81         A  2.057427
# 111        A  0.851301
# 114        A  0.669910
# 60         A  1.226954
# 73         B -0.166516
# 82         B  0.662789
# 94         B -0.863640
# 31         B  0.188097
# 101        C  1.802802
# 53         C  0.696984


print(df1.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))

#              group_id      vals
# group_id
# A        24         A  0.161328
#          21         A -1.399320
#          30         A -0.115725
#          114        A  0.669910
# B        34         B -0.348558
#          7          B -0.855432
#          106        B -1.163899
#          79         B  0.532049
# C        65         C -2.836438
#          95         C  1.701192
#          80         C -0.421549
#          74         C -1.089400

First solution: 6 rows for group A (1/2 of the sampled rows), 4 rows for group B (one third of the sampled rows) and 2 rows for group C (one sixth of the sampled rows).

第一种解决方案：A 组 6 行（采样行的 1/2），B 组 4 行（采样行的三分之一），C 组 2 行（采样行的六分之一）。

Second solution: 4 rows for each group (each one third of the sampled rows)

第二种解决方案：每组 4 行（每个采样行的三分之一）

Working with differently sized groups: 40 for A, 60 for B and 20 for C

与不同规模的团队合作：A 40，B 60，C 20

df2 = pd.DataFrame({'group_id': np.repeat(['A', 'B', 'C'], (40, 60, 20)),
                   'vals': np.random.randn(120)})
N = len(df2)
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df2.groupby('group_id')))

#     group_id      vals
# 29         A  0.306738
# 35         A  1.785479
# 21         A -0.119405
# 4          A  2.579824
# 5          A  1.138887
# 11         A  0.566093
# 80         B  1.207676
# 41         B -0.577513
# 44         B  0.286967
# 77         B  0.402427
# 103        C -1.760442
# 114        C  0.717776

print(df2.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))

#              group_id      vals
# group_id
# A        4          A  2.579824
#          32         A  0.451882
#          5          A  1.138887
#          17         A -0.614331
# B        47         B -0.308123
#          52         B -1.504321
#          42         B -0.547335
#          84         B -1.398953
#          61         B  1.679014
#          66         B  0.546688
# C        105        C  0.988320
#          107        C  0.698790

First solution: consistent Second solution: Now group B has taken 6 of the sampled rows when it's supposed to only take 4.

第一个解决方案：一致的第二个解决方案：现在 B 组已经采用了 6 个采样行，而它应该只需要 4 个。

Working with another set of differently sized groups: 60 for A, 40 for B and 20 for C

与另一组不同规模的小组合作：A 60，B 40，C 20

df3 = pd.DataFrame({'group_id': np.repeat(['A', 'B', 'C'], (60, 40, 20)),
                   'vals': np.random.randn(120)})
N = len(df3)
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df3.groupby('group_id')))

#     group_id      vals
# 48         A  1.214525
# 19         A -0.237562
# 0          A  3.385037
# 11         A  1.948405
# 8          A  0.696629
# 39         A -0.422851
# 62         B  1.669020
# 94         B  0.037814
# 67         B  0.627173
# 93         B  0.696366
# 104        C  0.616140
# 113        C  0.577033

print(df3.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))

#              group_id      vals
# group_id
# A        4          A  0.284448
#          11         A  1.948405
#          8          A  0.696629
#          0          A  3.385037
#          31         A  0.579405
#          24         A -0.309709
# B        70         B -0.480442
#          69         B -0.317613
#          96         B -0.930522
#          80         B -1.184937
# C        101        C  0.420421
#          106        C  0.058900

This is the only time the second solution offered some consistency (out of sheer luck, I might add).

这是第二个解决方案唯一一次提供了一些一致性（我可能会补充说，纯粹是运气不好）。

I hope this proves useful.

我希望这证明是有用的。

Answer 3

回答by Rakesh Poduval

I was looking for similar solution. The code provided by @Vaishali works absolutely fine. What @Abdou's trying to do also makes sense when we want to extract samples from each group based on their proportions to the full data.

我正在寻找类似的解决方案。@Vaishali 提供的代码工作得很好。当我们想要根据每个组与完整数据的比例从每个组中提取样本时，@Abdou 尝试做的事情也很有意义。

# original : 10% from each group
sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=0.1))

# modified : sample size based on proportions of group size
n = df.shape[0]
sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=length(x)/n))

Answer 4

回答by irkinosor

the following sample a total of N row where each group appear in its original proportion to the nearest integer, then shuffle and reset the index using:

以下示例总共 N 行，其中每个组以其与最接近的整数的原始比例出现，然后使用以下方法混洗和重置索引：

df = pd.DataFrame(dict(
    A=[1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4],
    B=range(20)
))

Short and sweet:

简短而甜蜜：

df.sample(n=N, weights='A', random_state=1).reset_index(drop=True)

Long version

长版

df.groupby('A', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(df))))).sample(frac=1).reset_index(drop=True)

与列中的计数成比例地对 Pandas 数据框的行进行采样

提问by eleanora

回答by Vaishali

回答by Abdou

Edit:

编辑：

回答by Rakesh Poduval

回答by irkinosor

相关推荐

最近更新

标签

与列中的计数成比例地对 Pandas 数据框的行进行采样

提问by eleanora

回答by Vaishali

回答by Abdou

Edit:

编辑：

回答by Rakesh Poduval

回答by irkinosor

相关推荐

pandas 将某些列除以熊猫中的另一列

pandas 将熊猫系列列表转换为数据框

pandas 熊猫系列'对象没有属性'find'

ValueError：时间数据 - 与格式不匹配 - Pandas

相关推荐

最近更新

标签