pandas 在熊猫数据框中将单元格拆分为多行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/50731229/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Split cell into multiple rows in pandas dataframe
提问by Nobel
I have a dataframe contains orders data, each order has multiple packages stored as comma separated string [package
& package_code
] columns
我有一个包含订单数据的数据框,每个订单都有多个包存储为逗号分隔的字符串 [ package
& package_code
] 列
I want to split the packages data and create a row for each package including its order details
我想拆分包裹数据并为每个包裹创建一行,包括其订单详细信息
Here is a sample input dataframe:
这是一个示例输入数据框:
import pandas as pd
df = pd.DataFrame({"order_id":[1,3,7],"order_date":["20/5/2018","22/5/2018","23/5/2018"], "package":["p1,p2,p3","p4","p5,p6"],"package_code":["#111,#222,#333","#444","#555,#666"]})
And this is what I am trying to achieve as output:
How can I do that with pandas?
我怎么能用Pandas做到这一点?
采纳答案by jpp
Here's one way using numpy.repeat
and itertools.chain
. Conceptually, this is exactly what you want to do: repeat some values, chain others. Recommended for small numbers of columns, otherwise stack
based methods may fare better.
这是使用numpy.repeat
and的一种方法itertools.chain
。从概念上讲,这正是您想要做的:重复某些值,链接其他值。推荐用于少量列,否则stack
基于方法可能会更好。
import numpy as np
from itertools import chain
# return list from series of comma-separated strings
def chainer(s):
return list(chain.from_iterable(s.str.split(',')))
# calculate lengths of splits
lens = df['package'].str.split(',').map(len)
# create new dataframe, repeating or chaining as appropriate
res = pd.DataFrame({'order_id': np.repeat(df['order_id'], lens),
'order_date': np.repeat(df['order_date'], lens),
'package': chainer(df['package']),
'package_code': chainer(df['package_code'])})
print(res)
order_id order_date package package_code
0 1 20/5/2018 p1 #111
0 1 20/5/2018 p2 #222
0 1 20/5/2018 p3 #333
1 3 22/5/2018 p4 #444
2 7 23/5/2018 p5 #555
2 7 23/5/2018 p6 #666
回答by cs95
pandas >= 0.25
Pandas >= 0.25
Assuming all splittable columns have the same number of comma separated items, you can split on comma and then use Series.explode
on each column:
假设所有可拆分的列具有相同数量的逗号分隔项,您可以在逗号Series.explode
上拆分,然后在每列上使用:
(df.set_index(['order_id', 'order_date'])
.apply(lambda x: x.str.split(',').explode())
.reset_index())
order_id order_date package package_code
0 1 20/5/2018 p1 #111
1 1 20/5/2018 p2 #222
2 1 20/5/2018 p3 #333
3 3 22/5/2018 p4 #444
4 7 23/5/2018 p5 #555
5 7 23/5/2018 p6 #666
Details
细节
Set the columns not to be touched as the index,
将不被触及的列设置为索引,
df.set_index(['order_id', 'order_date'])
package package_code
order_id order_date
1 20/5/2018 p1,p2,p3 #111,#222,#333
3 22/5/2018 p4 #444
7 23/5/2018 p5,p6 #555,#666
The next step is a 2-step process: Split on comma to get a column of lists, then call explode
to explode the list values into their own rows.
下一步是一个两步过程:用逗号分割以获得一列列表,然后调用explode
将列表值分解为它们自己的行。
_.apply(lambda x: x.str.split(',').explode())
package package_code
order_id order_date
1 20/5/2018 p1 #111
20/5/2018 p2 #222
20/5/2018 p3 #333
3 22/5/2018 p4 #444
7 23/5/2018 p5 #555
23/5/2018 p6 #666
Finally, reset the index.
最后,重置索引。
_.reset_index()
order_id order_date package package_code
0 1 20/5/2018 p1 #111
1 1 20/5/2018 p2 #222
2 1 20/5/2018 p3 #333
3 3 22/5/2018 p4 #444
4 7 23/5/2018 p5 #555
5 7 23/5/2018 p6 #666
pandas <= 0.24
Pandas <= 0.24
This should work for any number of columns like this. The essence is a little stack-unstacking magic with str.split
.
这应该适用于任意数量的列。本质是使用 的一点点堆栈解除魔法str.split
。
(df.set_index(['order_date', 'order_id'])
.stack()
.str.split(',', expand=True)
.stack()
.unstack(-2)
.reset_index(-1, drop=True)
.reset_index()
)
order_date order_id package package_code
0 20/5/2018 1 p1 #111
1 20/5/2018 1 p2 #222
2 20/5/2018 1 p3 #333
3 22/5/2018 3 p4 #444
4 23/5/2018 7 p5 #555
5 23/5/2018 7 p6 #666
There is another performant alternative involving chain
, but you'd need to explicitly chain and repeat every column (a bit of a problem with a lot of columns). Choose whatever fits the description of your problem best, as there's no single answer.
还有另一个涉及 的高性能替代方案chain
,但您需要明确地链接和重复每一列(很多列有点问题)。选择最适合您的问题描述的任何内容,因为没有唯一的答案。
Details
细节
First, set the columns that are not to be touched as the index.
首先,将不要触摸的列设置为索引。
df.set_index(['order_date', 'order_id'])
package package_code
order_date order_id
20/5/2018 1 p1,p2,p3 #111,#222,#333
22/5/2018 3 p4 #444
23/5/2018 7 p5,p6 #555,#666
Next, stack
the rows.
接下来stack
是行。
_.stack()
order_date order_id
20/5/2018 1 package p1,p2,p3
package_code #111,#222,#333
22/5/2018 3 package p4
package_code #444
23/5/2018 7 package p5,p6
package_code #555,#666
dtype: object
We have a series now. So call str.split
on comma.
我们现在有一个系列。所以叫str.split
逗号。
_.str.split(',', expand=True)
0 1 2
order_date order_id
20/5/2018 1 package p1 p2 p3
package_code #111 #222 #333
22/5/2018 3 package p4 None None
package_code #444 None None
23/5/2018 7 package p5 p6 None
package_code #555 #666 None
We need to get rid of NULL values, so call stack
again.
我们需要去掉 NULL 值,所以stack
再次调用。
_.stack()
order_date order_id
20/5/2018 1 package 0 p1
1 p2
2 p3
package_code 0 #111
1 #222
2 #333
22/5/2018 3 package 0 p4
package_code 0 #444
23/5/2018 7 package 0 p5
1 p6
package_code 0 #555
1 #666
dtype: object
We're almost there. Now we want the second last level of the index to become our columns, so unstack using unstack(-2)
(unstack
on the second last level)
我们快到了。现在我们希望索引的倒数第二个级别成为我们的列,因此使用unstack(-2)
(unstack
在倒数第二个级别上) unstack
_.unstack(-2)
package package_code
order_date order_id
20/5/2018 1 0 p1 #111
1 p2 #222
2 p3 #333
22/5/2018 3 0 p4 #444
23/5/2018 7 0 p5 #555
1 p6 #666
Get rid of the superfluous last level using reset_index
:
使用reset_index
以下方法摆脱多余的最后一层:
_.reset_index(-1, drop=True)
package package_code
order_date order_id
20/5/2018 1 p1 #111
1 p2 #222
1 p3 #333
22/5/2018 3 p4 #444
23/5/2018 7 p5 #555
7 p6 #666
And finally,
最后,
_.reset_index()
order_date order_id package package_code
0 20/5/2018 1 p1 #111
1 20/5/2018 1 p2 #222
2 20/5/2018 1 p3 #333
3 22/5/2018 3 p4 #444
4 23/5/2018 7 p5 #555
5 23/5/2018 7 p6 #666
回答by Heraknos
Have a look at today's pandas release 0.25 : https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#series-explode-to-split-list-like-values-to-rows
看看今天的Pandas发布 0.25:https: //pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#series-explode-to-split-list-like-values-to-rows
df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1}, {'var1': 'd,e,f', 'var2': 2}])
df.assign(var1=df.var1.str.split(',')).explode('var1').reset_index(drop=True)
回答by YOBEN_S
Close to cold's method :-)
接近冷的方法:-)
df.set_index(['order_date','order_id']).apply(lambda x : x.str.split(',')).stack().apply(pd.Series).stack().unstack(level=2).reset_index(level=[0,1])
Out[538]:
order_date order_id package package_code
0 20/5/2018 1 p1 #111
1 20/5/2018 1 p2 #222
2 20/5/2018 1 p3 #333
0 22/5/2018 3 p4 #444
0 23/5/2018 7 p5 #555
1 23/5/2018 7 p6 #666