Python/Pandas - 将 Pandas DataFrame 划分为 10 个不相交、大小相同的子集
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38570268/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python/Pandas - partitioning a pandas DataFrame in 10 disjoint, equally-sized subsets
提问by Tomas
I want to partition a pandas DataFrame into ten disjoint, equally-sized, randomly composed subsets.
我想将 Pandas DataFrame 划分为十个不相交、大小相同、随机组成的子集。
I know I can randomly sample one tenth of the original pandas DataFrame using:
我知道我可以使用以下方法随机采样原始 Pandas DataFrame 的十分之一:
partition_1 = pandas.DataFrame.sample(frac=(1/10))
However, how can I obtain the other nine partitions? If I'd do pandas.DataFrame.sample(frac=(1/10))
again, there exists the possibility that my subsets are not disjoint.
但是,我如何获得其他九个分区?如果我再做pandas.DataFrame.sample(frac=(1/10))
一次,我的子集可能不会不相交。
Thanks for the help!
谢谢您的帮助!
回答by Merlin
Starting with this.
以此开始。
dfm = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo']*2,
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three']*2})
A B
0 foo one
1 bar one
2 foo two
3 bar three
4 foo two
5 bar two
6 foo one
7 foo three
8 foo one
9 bar one
10 foo two
11 bar three
12 foo two
13 bar two
14 foo one
15 foo three
Usage:
Change "4" to "10", use [i] to get the slices.
np.random.seed(32) # for reproducible results.
np.array_split(dfm.reindex(np.random.permutation(dfm.index)),4)[1]
A B
2 foo two
5 bar two
10 foo two
12 foo two
np.array_split(dfm.reindex(np.random.permutation(dfm.index)),4)[3]
A B
13 foo two
11 bar three
0 foo one
7 foo three
回答by SerialDev
use np.random.permutations
:
使用np.random.permutations
:
df.loc[np.random.permutation(df.index)]
df.loc[np.random.permutation(df.index)]
it will shuffle the dataframe and keep column names, after you could split the dataframe into 10.
在您可以将数据帧拆分为 10 个之后,它将打乱数据帧并保留列名。
回答by Alberto Garcia-Raboso
Say df
is your dataframe, and you want N_PARTITIONS
partitions of roughly equal size (they will be of exactlyequal size if len(df)
is divisible by N_PARTITIONS
).
假设df
是您的数据框,并且您需要N_PARTITIONS
大小大致相同的分区(如果可以被 整除,它们的大小将完全相同)。len(df)
N_PARTITIONS
Use np.random.permutation
to permute the array np.arange(len(df))
. Then take slices of that array with step N_PARTITIONS
, and extract the corresponding rows of your dataframe with .iloc[]
.
使用np.random.permutation
以排列的阵列np.arange(len(df))
。然后使用 stepN_PARTITIONS
获取该数组的切片,并使用 提取数据帧的相应行.iloc[]
。
import numpy as np
permuted_indices = np.random.permutation(len(df))
dfs = []
for i in range(N_PARTITIONS):
dfs.append(df.iloc[permuted_indices[i::N_PARTITIONS]])
Since you are on Python 2.7, it might be better to switch range(N_PARTITIONS)
by xrange(N_PARTITIONS)
to get an iterator instead of a list.
既然你是在Python 2.7版,它可能是更好的开关range(N_PARTITIONS)
通过xrange(N_PARTITIONS)
获得一个迭代器,而不是一个列表。