在 Pandas 中将列拆分为多行的快速方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33622470/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fast way to split column into multiple rows in Pandas
提问by neversaint
I have the following data frame:
我有以下数据框:
import pandas as pd
df = pd.DataFrame({ 'gene':["foo",
"bar // lal",
"qux",
"woz"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
df = df[["gene","cell1","cell2"]]
df
That looks like this:
看起来像这样:
Out[6]:
gene cell1 cell2
0 foo 5 12
1 bar // lal 9 90
2 qux 1 13
3 woz 7 87
What I want to do is to split the 'gene' column so that it result like this:
我想要做的是拆分“基因”列,使其结果如下:
gene cell1 cell2
foo 5 12
bar 9 90
lal 9 90
qux 1 13
woz 7 87
My current approach is this:
我目前的做法是这样的:
import pandas as pd
import timeit
def create():
df = pd.DataFrame({ 'gene':["foo",
"bar // lal",
"qux",
"woz"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
df = df[["gene","cell1","cell2"]]
s = df["gene"].str.split(' // ').apply(pd.Series,1).stack()
s.index = s.index.droplevel(-1)
s.name = "Genes"
del df["gene"]
df.join(s)
if __name__ == '__main__':
print(timeit.timeit("create()", setup="from __main__ import create", number=100))
# 0.608163118362
This is veryslow. In reality I have around 40K lines to check and process.
这是非常缓慢的。实际上,我有大约 40K 行需要检查和处理。
What's the fast implementation of that?
什么是快速实施?
回答by DSM
TBH I think we need a fast built-in way of normalizing elements like this.. although since I've been out of the loop for a bit for all I know there is one by now, and I just don't know it. :-) In the meantime I've been using methods like this:
TBH 我认为我们需要一种快速的内置方式来规范这样的元素......尽管因为我已经脱离了循环,我知道现在有一个,我只是不知道。:-) 与此同时,我一直在使用这样的方法:
def create(n):
df = pd.DataFrame({ 'gene':["foo",
"bar // lal",
"qux",
"woz"],
'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
df = df[["gene","cell1","cell2"]]
df = pd.concat([df]*n)
df = df.reset_index(drop=True)
return df
def orig(df):
s = df["gene"].str.split(' // ').apply(pd.Series,1).stack()
s.index = s.index.droplevel(-1)
s.name = "Genes"
del df["gene"]
return df.join(s)
def faster(df):
s = df["gene"].str.split(' // ', expand=True).stack()
i = s.index.get_level_values(0)
df2 = df.loc[i].copy()
df2["gene"] = s.values
return df2
which gives me
这给了我
>>> df = create(1)
>>> df
gene cell1 cell2
0 foo 5 12
1 bar // lal 9 90
2 qux 1 13
3 woz 7 87
>>> %time orig(df.copy())
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 10.2 ms
cell1 cell2 Genes
0 5 12 foo
1 9 90 bar
1 9 90 lal
2 1 13 qux
3 7 87 woz
>>> %time faster(df.copy())
CPU times: user 16 ms, sys: 0 ns, total: 16 ms
Wall time: 12.4 ms
gene cell1 cell2
0 foo 5 12
1 bar 9 90
1 lal 9 90
2 qux 1 13
3 woz 7 87
for comparable speeds at low sizes, and
对于小尺寸的可比速度,以及
>>> df = create(10000)
>>> %timeit z = orig(df.copy())
1 loops, best of 3: 14.2 s per loop
>>> %timeit z = faster(df.copy())
1 loops, best of 3: 231 ms per loop
a 60-fold speedup in the larger case. Note that the only reason I'm using df.copy()
here is because orig
is destructive.
在较大的情况下加速了 60 倍。请注意,我在df.copy()
这里使用的唯一原因是因为orig
它具有破坏性。
回答by George Liu
We can first split the column, expand it, stack it and then join it back to the original df like below:
我们可以先拆分列,展开它,堆叠它,然后将其连接回原始 df,如下所示:
df.drop('gene', axis=1).join(df['gene'].str.split('//', expand=True).stack().reset_index(level=1, drop=True).rename('gene'))
which gives us this:
这给了我们这个:
cell1 cell2 gene
0 5 12 foo
1 9 90 bar
1 9 90 lal
2 1 13 qux
3 7 87 woz
回答by U10-Forward
Or use:
或使用:
df.join(pd.DataFrame(df.gene.str.split(',', expand=True).stack().reset_index(level=1, drop=True)
,columns=['gene '])).drop('gene',1).rename(columns=str.strip).reset_index(drop=True)
Output:
输出:
gene cell1 cell2
0 foo 5 12
1 bar 9 90
2 lal 9 90
3 qux 1 13
4 woz 7 87