Python Pandas 中数据帧子集的随机样本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38085547/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Random Sample of a subset of a dataframe in Pandas
提问by WGP
Say i have a dataframe with 100,000 entries and want to split it into 100 sections of 1000 entries.
假设我有一个包含 100,000 个条目的数据框,并且想将其拆分为 1000 个条目的 100 个部分。
How do i take a random sample of say size 50 of just one of the 100 sections. the data set is already ordered such that the first 1000 results are the first section the next section the next and so on.
我如何从 100 个部分中的一个中抽取一个大小为 50 的随机样本。数据集已经排序,前 1000 个结果是第一部分,下一部分是下一部分,依此类推。
many thanks
非常感谢
采纳答案by jpjandrade
One solution is to use the choice
function from numpy.
一种解决方案是使用choice
numpy 中的函数。
Say you want 50 entries out of 100, you can use:
假设您想要 100 个条目中的 50 个条目,您可以使用:
import numpy as np
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed = df.iloc[chosen_idx]
This is of course not considering your block structure. If you want a 50 item sample from block i
for example, you can do:
这当然没有考虑您的块结构。i
例如,如果您想要来自块的 50 个项目样本,您可以执行以下操作:
import numpy as np
block_start_idx = 1000 * i
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]
回答by Andy Hayden
You can use the sample
method*:
您可以使用sample
方法*:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])
In [12]: df.sample(2)
Out[12]:
A B
0 1 2
2 5 6
In [13]: df.sample(2)
Out[13]:
A B
3 7 8
0 1 2
*On one of the section DataFrames.
*在 DataFrames 部分之一。
Note: If you have a larger sample size that the size of the DataFrame this will raise an error unless you sample with replacement.
注意:如果您的样本大小大于 DataFrame 的大小,除非您进行替换采样,否则这将引发错误。
In [14]: df.sample(5)
ValueError: Cannot take a larger sample than population when 'replace=False'
In [15]: df.sample(5, replace=True)
Out[15]:
A B
0 1 2
1 3 4
2 5 6
3 7 8
1 3 4
回答by GeneralCode
This is a nice place for recursion.
这是递归的好地方。
def main2():
rows = 8 # say you have 8 rows, real data will need len(rows) for int
rands = []
for i in range(rows):
gen = fun(rands)
rands.append(gen)
print(rands) # now range through random values
def fun(rands):
gen = np.random.randint(0, 8)
if gen in rands:
a = fun(rands)
return a
else: return gen
if __name__ == "__main__":
main2()
output: [6, 0, 7, 1, 3, 5, 4, 2]
output: [6, 0, 7, 1, 3, 5, 4, 2]