pandas 如何将数据帧拆分为多个数据帧,其中每个数据帧包含相等但随机的数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44031697/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I split a dataframe into multiple dataframes where each dataframe contains equal but random data
提问by Anil K
How do I split a dataframe into multiple dataframes where each dataframe contains equal but random data? It is not based on a specific column.
如何将数据帧拆分为多个数据帧,其中每个数据帧都包含相等但随机的数据?它不是基于特定的列。
For instance, I have one 100 rows and 30 columns in a dataframe. I want to divide this data into 5 lots. I should have 20 records in each of the dataframe with same 30 columns and there is no duplication across all the 5 lots and the way I pick the rows should be random.. I don't want the random picking on a single column.
例如,我在一个数据框中有 100 行和 30 列。我想把这些数据分成 5 批。我应该在每个数据框中有 20 条记录,具有相同的 30 列,并且所有 5 个批次都没有重复,而且我选择行的方式应该是随机的。我不想在单列上随机选择。
One way I thought I will use index and numpy and divide them into lots and use that to split the dataframe. Wanted to see if someone has an easy and pandas way of doing it.
我认为我将使用 index 和 numpy 并将它们分成很多部分并使用它来分割数据帧的一种方法。想看看是否有人有一种简单的Pandas方法来做到这一点。
回答by Patrick Hingston
If you do not care about the new dataframes potentially containing some of the same information, you could use sample
where frac
specifies the fraction of the dataframe that you desire
如果您不关心可能包含某些相同信息的新数据帧,您可以使用sample
wherefrac
指定您想要的数据帧的分数
df1 = df.sample(frac=0.5) # df1 is now a random sample of half the dataframe
EDIT:
编辑:
If you want to avoid duplicates, you can use shuffle
from sklearn
如果你想避免重复,你可以使用shuffle
fromsklearn
from sklearn.utils import shuffle
df = shuffle(df)
df1 = df[0:3]
df2 = df[3:6]
回答by SimplySnee
Depending on your need, you could use pandas.DataFrame.sample()to randomly sample your original data frame, df.
根据您的需要,您可以使用pandas.DataFrame.sample()对原始数据框 df 进行随机采样。
df1 = df.sample(n=3)
df2 = df.sample(n=3)
gives you two subsets, each with 3 samples. Equal number of records and random.
给你两个子集,每个子集有 3 个样本。记录数和随机数相等。