在大型 DataFrame 上对 Pandas 进行排列的有效方法

Question

提问by Einar

Currently I have a pandas DataFrame like this:

目前我有一个像这样的Pandas数据帧：

 ID                    A1      A2       A3       B1       B2       B3
 Ku8QhfS0n_hIOABXuE    6.343   6.304    6.410    6.287    6.403    6.279
 fqPEquJRRlSVSfL.8A    6.752   6.681    6.680    6.677    6.525    6.739
 ckiehnugOno9d7vf1Q    6.297   6.248    6.524    6.382    6.316    6.453
 x57Vw5B5Fbt5JUnQkI    6.268   6.451    6.379    6.371    6.458    6.333

This DataFrame is used with a statistic which then requires a permutation test (EDIT: to be precise, randompermutation). The indices of each column need to be shuffled (sampled) 100 times. To give an idea of the size, the number of rows can be around 50,000.

这个 DataFrame 与一个统计数据一起使用，然后需要一个排列测试（编辑：准确地说，随机排列）。每列的索引需要打乱（采样）100 次。要了解大小，行数可以约为 50,000。

EDIT: The permutation is along the rows, i.e. shuffle the index for each column.

编辑：排列是沿着行，即洗牌每列的索引。

The biggest issue here is one of performance. I want to permute things in a fast way.

这里最大的问题是性能之一。我想以快速的方式排列事物。

An example I had in mind was:

我想到的一个例子是：

import random
import joblib

def permutation(dataframe):
    return dataframe.apply(random.sample, axis=1, k=len(dataframe))

permute = joblib.delayed(permutation)
pool = joblib.Parallel(n_jobs=-2) # all cores minus 1
result = pool(permute(dataframe) for item in range(100))

The issue here is that by doing this, the test is notstable: apparently the permutation works, but it is not as "random" as it would without being done in parallel, and thus there's a loss of stability in the results when I use the permuted data in follow-up calculations.

这里的问题是，通过这样做，测试是不是稳定：显然置换的作品，但它不是“随机”，因为它会不被并行进行，因而有稳定的，当我使用的结果的损失后续计算中的置换数据。

So my only "solution" was to precalculate all indices for all columns prior to doing the paralel code, which slows things down considerably.

所以我唯一的“解决方案”是在执行并行代码之前预先计算所有列的所有索引，这会大大减慢速度。

My questions are:

我的问题是：

Is there a more efficient way to do this permutation? (not necessarily parallel)
Is the parallel approach (using multiple processes, not threads) feasible?

有没有更有效的方法来进行这种排列？（不一定平行）
并行方法（使用多个进程，而不是线程）是否可行？

EDIT: To make things clearer, here's what should happen for example to column A1 after one shuffling:

编辑：为了让事情更清楚，以下是一次洗牌后 A1 列应该发生的情况：

Ku8QhfS0n_hIOABXuE    6.268   
fqPEquJRRlSVSfL.8A    6.343
ckiehnugOno9d7vf1Q    6.752
x57Vw5B5Fbt5JUnQk     6.297

(i.e. the row values were moving around).

（即行值四处移动）。

EDIT2: Here's what I'm using now:

EDIT2：这是我现在使用的：

def _generate_indices(indices, columns, nperm):

    random.seed(1234567890)
    num_genes = indices.size

    for item in range(nperm):

        permuted = pandas.DataFrame(
            {column: random.sample(genes, num_genes) for column in columns},
             index=range(genes.size)
        )

        yield permuted

(in short, building a DataFrame of resampled indices for each column)

（简而言之，为每列构建一个重采样索引的 DataFrame）

And later on (yes, I know it's pretty ugly):

后来（是的，我知道它很丑）：

 # Data is the original DataFrame
 # Indices one of the results of that generator

 permuted = dict()

 for column in data.columns:

    value = data[column]
    permuted[column] = value[indices[column].values].values

 permuted_table = pandas.DataFrame(permuted, index=data.index)

Answer 1

回答by spencerlyon2

How about this:

这个怎么样：

In [1]: import numpy as np; import pandas as pd

In [2]: df = pd.DataFrame(np.random.randn(50000, 10))

In [3]: def shuffle(df, n):
   ....:     for i in n:
   ....:         np.random.shuffle(df.values)
   ....:     return df


In [4]: df.head()
Out[4]:
          0         1         2         3         4         5         6         7         8         9
0  0.329588 -0.513814 -1.267923  0.691889 -0.319635 -1.468145 -0.441789  0.004142 -0.362073 -0.555779
1  0.495670  2.460727  1.174324  1.115692  1.214057 -0.843138  0.217075  0.495385  1.568166  0.252299
2 -0.898075  0.994281 -0.281349 -0.104684 -1.686646  0.651502 -1.466679 -1.256705  1.354484  0.626840
3  1.158388 -1.227794 -0.462005 -1.790205  0.399956 -1.631035 -1.707944 -1.126572 -0.892759  1.396455
4 -0.049915  0.006599 -1.099983  0.775028 -0.694906 -1.376802 -0.152225  1.413212  0.050213 -0.209760

In [5]: shuffle(df, 1).head(5)
Out[5]:
          0         1         2         3         4         5         6         7         8         9
0  2.044131  0.072214 -0.304449  0.201148  1.462055  0.538476 -0.059249 -0.133299  2.925301  0.529678
1  0.036957  0.214003 -1.042905 -0.029864  1.616543  0.840719  0.104798 -0.766586 -0.723782 -0.088239
2 -0.025621  0.657951  1.132175 -0.815403  0.548210 -0.029291  0.575587  0.032481 -0.261873  0.010381
3  1.396024  0.859455 -1.514801  0.353378  1.790324  0.286164 -0.765518  1.363027 -0.868599 -0.082818
4 -0.026649 -0.090119 -2.289810 -0.701342 -0.116262 -0.674597 -0.580760 -0.895089 -0.663331  0.

In [6]: %timeit shuffle(df, 100)
Out[6]:
1 loops, best of 3: 14.4 s per loop

This does what you need it to. The only question is whether or not it is fast enough.

这可以满足您的需求。唯一的问题是它是否足够快。

Update

更新

Per the comments by @Einar I have changed my solution.

根据@Einar 的评论，我改变了我的解决方案。

In[7]: def shuffle2(df, n):
           ind = df.index
           for i in range(n):
               sampler = np.random.permutation(df.shape[0])
               new_vals = df.take(sampler).values
               df = pd.DataFrame(new_vals, index=ind)
           return df

In [8]: df.head()
Out[8]: 
          0         1         2         3         4         5         6         7         8         9
0 -0.175006 -0.462306  0.565517 -0.309398  1.100570  0.656627  1.207535 -0.221079 -0.933068 -0.192759
1  0.388165  0.155480 -0.015188  0.868497  1.102662 -0.571818 -0.994005  0.600943  2.205520 -0.294121
2  0.281605 -1.637529  2.238149  0.987409 -1.979691 -0.040130  1.121140  1.190092 -0.118919  0.790367
3  1.054509  0.395444  1.239756 -0.439000  0.146727 -1.705972  0.627053 -0.547096 -0.818094 -0.056983
4  0.209031 -0.233167 -1.900261 -0.678022 -0.064092 -1.562976 -1.516468  0.512461  1.058758 -0.206019

In [9]: shuffle2(df, 1).head()
Out[9]: 
          0         1         2         3         4         5         6         7         8         9
0  0.054355  0.129432 -0.805284 -1.713622 -0.610555 -0.874039 -0.840880  0.593901  0.182513 -1.981521
1  0.624562  1.097495 -0.428710 -0.133220  0.675428  0.892044  0.752593 -0.702470  0.272386 -0.193440
2  0.763551 -0.505923  0.206675  0.561456  0.441514 -0.743498 -1.462773 -0.061210 -0.435449 -2.677681
3  1.149586 -0.003552  2.496176 -0.089767  0.246546 -1.333184  0.524872 -0.527519  0.492978 -0.829365
4 -1.893188  0.728737  0.361983 -0.188709 -0.809291  2.093554  0.396242  0.402482  1.884082  1.373781

In [10]: timeit shuffle2(df, 100)
1 loops, best of 3: 2.47 s per loop

在大型 DataFrame 上对 Pandas 进行排列的有效方法

提问by Einar

回答by spencerlyon2

Update

更新

相关推荐

最近更新

标签

在大型 DataFrame 上对 Pandas 进行排列的有效方法

提问by Einar

回答by spencerlyon2

Update

更新

相关推荐

使用 Python Pandas 使用通配符名称搜索对所有列求和

pandas 使用特定的开始时间重新采样每小时的 TimeSeries

如何在 Pandas DataFrames 中切片多索引列？

在 Python pandas 中将 DataFrame 添加到面板

相关推荐

最近更新

标签