Python Pandas 中数据帧子集的随机样本

Question

提问by WGP

Say i have a dataframe with 100,000 entries and want to split it into 100 sections of 1000 entries.

假设我有一个包含 100,000 个条目的数据框，并且想将其拆分为 1000 个条目的 100 个部分。

How do i take a random sample of say size 50 of just one of the 100 sections. the data set is already ordered such that the first 1000 results are the first section the next section the next and so on.

我如何从 100 个部分中的一个中抽取一个大小为 50 的随机样本。数据集已经排序，前 1000 个结果是第一部分，下一部分是下一部分，依此类推。

many thanks

非常感谢

Answer 1

采纳答案by jpjandrade

One solution is to use the choicefunction from numpy.

一种解决方案是使用choicenumpy 中的函数。

Say you want 50 entries out of 100, you can use:

假设您想要 100 个条目中的 50 个条目，您可以使用：

import numpy as np
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed = df.iloc[chosen_idx]

This is of course not considering your block structure. If you want a 50 item sample from block ifor example, you can do:

这当然没有考虑您的块结构。i例如，如果您想要来自块的 50 个项目样本，您可以执行以下操作：

import numpy as np
block_start_idx = 1000 * i
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]

Answer 2

回答by Andy Hayden

You can use the samplemethod*:

您可以使用sample方法*：

In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])

In [12]: df.sample(2)
Out[12]:
   A  B
0  1  2
2  5  6

In [13]: df.sample(2)
Out[13]:
   A  B
3  7  8
0  1  2

*On one of the section DataFrames.

*在 DataFrames 部分之一。

Note: If you have a larger sample size that the size of the DataFrame this will raise an error unless you sample with replacement.

注意：如果您的样本大小大于 DataFrame 的大小，除非您进行替换采样，否则这将引发错误。

In [14]: df.sample(5)
ValueError: Cannot take a larger sample than population when 'replace=False'

In [15]: df.sample(5, replace=True)
Out[15]:
   A  B
0  1  2
1  3  4
2  5  6
3  7  8
1  3  4

Answer 3

回答by GeneralCode

This is a nice place for recursion.

这是递归的好地方。

def main2():
    rows = 8  # say you have 8 rows, real data will need len(rows) for int
    rands = []
    for i in range(rows):
        gen = fun(rands)
        rands.append(gen)
    print(rands)  # now range through random values


def fun(rands):
    gen = np.random.randint(0, 8)
    if gen in rands:
        a = fun(rands)
        return a
    else: return gen


if __name__ == "__main__":
    main2()

output: [6, 0, 7, 1, 3, 5, 4, 2]

Python Pandas 中数据帧子集的随机样本

提问by WGP

采纳答案by jpjandrade

回答by Andy Hayden

回答by GeneralCode

相关推荐

最近更新

标签

Python Pandas 中数据帧子集的随机样本

提问by WGP

采纳答案by jpjandrade

回答by Andy Hayden

回答by GeneralCode

相关推荐

Python 导入错误：没有名为“pandas”的模块

Python pyspark：合并（外连接）两个数据框

Python请求。[SSL: CERTIFICATE_VERIFY_FAILED] 证书验证失败 (_ssl.c:645)

Python 如何在 Pandas DataFrame 中取消嵌套（爆炸）一列？

相关推荐

最近更新

标签