在 Pandas 中将列拆分为多行的快速方法

Question

提问by neversaint

I have the following data frame:

我有以下数据框：

import pandas as pd
df = pd.DataFrame({ 'gene':["foo",
                            "bar // lal",
                            "qux",
                            "woz"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
df = df[["gene","cell1","cell2"]]
df

That looks like this:

看起来像这样：

Out[6]:
         gene  cell1  cell2
0         foo      5     12
1  bar // lal      9     90
2         qux      1     13
3         woz      7     87

What I want to do is to split the 'gene' column so that it result like this:

我想要做的是拆分“基因”列，使其结果如下：

         gene  cell1  cell2
         foo      5     12
         bar      9     90
         lal      9     90
         qux      1     13
         woz      7     87

My current approach is this:

我目前的做法是这样的：

import pandas as pd
import timeit

def create():
    df = pd.DataFrame({ 'gene':["foo",
                            "bar // lal",
                            "qux",
                            "woz"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
    df = df[["gene","cell1","cell2"]]

    s = df["gene"].str.split(' // ').apply(pd.Series,1).stack()
    s.index = s.index.droplevel(-1)
    s.name = "Genes"
    del df["gene"]
    df.join(s)


if __name__ == '__main__':
    print(timeit.timeit("create()", setup="from __main__ import create", number=100))
    # 0.608163118362

This is veryslow. In reality I have around 40K lines to check and process.

这是非常缓慢的。实际上，我有大约 40K 行需要检查和处理。

What's the fast implementation of that?

什么是快速实施？

Answer 1

回答by DSM

TBH I think we need a fast built-in way of normalizing elements like this.. although since I've been out of the loop for a bit for all I know there is one by now, and I just don't know it. :-) In the meantime I've been using methods like this:

TBH 我认为我们需要一种快速的内置方式来规范这样的元素......尽管因为我已经脱离了循环，我知道现在有一个，我只是不知道。:-) 与此同时，我一直在使用这样的方法：

def create(n):
    df = pd.DataFrame({ 'gene':["foo",
                                "bar // lal",
                                "qux",
                                "woz"], 
                        'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
    df = df[["gene","cell1","cell2"]]
    df = pd.concat([df]*n)
    df = df.reset_index(drop=True)
    return df

def orig(df):
    s = df["gene"].str.split(' // ').apply(pd.Series,1).stack()
    s.index = s.index.droplevel(-1)
    s.name = "Genes"
    del df["gene"]
    return df.join(s)

def faster(df):
    s = df["gene"].str.split(' // ', expand=True).stack()
    i = s.index.get_level_values(0)
    df2 = df.loc[i].copy()
    df2["gene"] = s.values
    return df2

which gives me

这给了我

>>> df = create(1)
>>> df
         gene  cell1  cell2
0         foo      5     12
1  bar // lal      9     90
2         qux      1     13
3         woz      7     87
>>> %time orig(df.copy())
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 10.2 ms
   cell1  cell2 Genes
0      5     12   foo
1      9     90   bar
1      9     90   lal
2      1     13   qux
3      7     87   woz
>>> %time faster(df.copy())
CPU times: user 16 ms, sys: 0 ns, total: 16 ms
Wall time: 12.4 ms
  gene  cell1  cell2
0  foo      5     12
1  bar      9     90
1  lal      9     90
2  qux      1     13
3  woz      7     87

for comparable speeds at low sizes, and

对于小尺寸的可比速度，以及

>>> df = create(10000)
>>> %timeit z = orig(df.copy())
1 loops, best of 3: 14.2 s per loop
>>> %timeit z = faster(df.copy())
1 loops, best of 3: 231 ms per loop

a 60-fold speedup in the larger case. Note that the only reason I'm using df.copy()here is because origis destructive.

在较大的情况下加速了 60 倍。请注意，我在df.copy()这里使用的唯一原因是因为orig它具有破坏性。

Answer 2

回答by George Liu

We can first split the column, expand it, stack it and then join it back to the original df like below:

我们可以先拆分列，展开它，堆叠它，然后将其连接回原始 df，如下所示：

df.drop('gene', axis=1).join(df['gene'].str.split('//', expand=True).stack().reset_index(level=1, drop=True).rename('gene'))

which gives us this:

这给了我们这个：

    cell1   cell2   gene
0   5   12  foo
1   9   90  bar
1   9   90  lal
2   1   13  qux
3   7   87  woz

Answer 3

回答by U10-Forward

Or use:

或使用：

df.join(pd.DataFrame(df.gene.str.split(',', expand=True).stack().reset_index(level=1, drop=True)
                ,columns=['gene '])).drop('gene',1).rename(columns=str.strip).reset_index(drop=True)

Output:

输出：

   gene  cell1  cell2
0   foo      5     12
1   bar      9     90
2   lal      9     90
3   qux      1     13
4   woz      7     87

在 Pandas 中将列拆分为多行的快速方法

提问by neversaint

回答by DSM

回答by George Liu

回答by U10-Forward

相关推荐

最近更新

标签

在 Pandas 中将列拆分为多行的快速方法

提问by neversaint

回答by DSM

回答by George Liu

回答by U10-Forward

相关推荐

python pandas dataframe head() 什么都不显示

pandas 熊猫空数据框

pandas 使用 Python 删除 HDF 存储中的键/表

pandas 在最近的时间戳上合并两个熊猫数据帧

相关推荐

最近更新

标签