Python/Pandas - 将 Pandas DataFrame 划分为 10 个不相交、大小相同的子集

Question

提问by Tomas

I want to partition a pandas DataFrame into ten disjoint, equally-sized, randomly composed subsets.

我想将 Pandas DataFrame 划分为十个不相交、大小相同、随机组成的子集。

I know I can randomly sample one tenth of the original pandas DataFrame using:

我知道我可以使用以下方法随机采样原始 Pandas DataFrame 的十分之一：

partition_1 = pandas.DataFrame.sample(frac=(1/10))

However, how can I obtain the other nine partitions? If I'd do pandas.DataFrame.sample(frac=(1/10))again, there exists the possibility that my subsets are not disjoint.

但是，我如何获得其他九个分区？如果我再做pandas.DataFrame.sample(frac=(1/10))一次，我的子集可能不会不相交。

Thanks for the help!

谢谢您的帮助！

Answer 1

回答by Merlin

Starting with this.

以此开始。

 dfm = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',  'foo', 'bar', 'foo', 'foo']*2,
                      'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three']*2}) 

     A      B
0   foo    one
1   bar    one
2   foo    two
3   bar  three
4   foo    two
5   bar    two
6   foo    one
7   foo  three
8   foo    one
9   bar    one
10  foo    two
11  bar  three
12  foo    two
13  bar    two
14  foo    one
15  foo  three

Usage: 
Change "4" to "10", use [i] to get the slices.  

np.random.seed(32) # for reproducible results.
np.array_split(dfm.reindex(np.random.permutation(dfm.index)),4)[1]
      A    B
2   foo  two
5   bar  two
10  foo  two
12  foo  two

np.array_split(dfm.reindex(np.random.permutation(dfm.index)),4)[3]

     A      B
13  foo    two
11  bar  three
0   foo    one
7   foo  three

Answer 2

回答by SerialDev

use np.random.permutations:

使用np.random.permutations：

df.loc[np.random.permutation(df.index)]

it will shuffle the dataframe and keep column names, after you could split the dataframe into 10.

在您可以将数据帧拆分为 10 个之后，它将打乱数据帧并保留列名。

Answer 3

回答by Alberto Garcia-Raboso

Say dfis your dataframe, and you want N_PARTITIONSpartitions of roughly equal size (they will be of exactlyequal size if len(df)is divisible by N_PARTITIONS).

假设df是您的数据框，并且您需要N_PARTITIONS大小大致相同的分区（如果可以被整除，它们的大小将完全相同）。len(df)N_PARTITIONS

Use np.random.permutationto permute the array np.arange(len(df)). Then take slices of that array with step N_PARTITIONS, and extract the corresponding rows of your dataframe with .iloc[].

使用np.random.permutation以排列的阵列np.arange(len(df))。然后使用 stepN_PARTITIONS获取该数组的切片，并使用提取数据帧的相应行.iloc[]。

import numpy as np

permuted_indices = np.random.permutation(len(df))

dfs = []
for i in range(N_PARTITIONS):
    dfs.append(df.iloc[permuted_indices[i::N_PARTITIONS]])

Since you are on Python 2.7, it might be better to switch range(N_PARTITIONS)by xrange(N_PARTITIONS)to get an iterator instead of a list.

既然你是在Python 2.7版，它可能是更好的开关range(N_PARTITIONS)通过xrange(N_PARTITIONS)获得一个迭代器，而不是一个列表。

Python/Pandas - 将 Pandas DataFrame 划分为 10 个不相交、大小相同的子集

提问by Tomas

回答by Merlin

回答by SerialDev

回答by Alberto Garcia-Raboso

相关推荐

最近更新

标签

Python/Pandas - 将 Pandas DataFrame 划分为 10 个不相交、大小相同的子集

提问by Tomas

回答by Merlin

回答by SerialDev

回答by Alberto Garcia-Raboso

相关推荐

pandas 在两个不同的文件中转储和加载莳萝（泡菜）

Numpy & Pandas：从 Pandas 直方图返回直方图值？

使用 Pandas 转换 Excel 样式的日期

如何将 Pandas DataFrame 存储为 HDF5 PyTables 表（或 CArray、EArray 等）？

相关推荐

最近更新

标签