pandas 在熊猫数据框中将单元格拆分为多行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50731229/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:39:37  来源:igfitidea点击:

Split cell into multiple rows in pandas dataframe

pythonpandasdataframe

提问by Nobel

I have a dataframe contains orders data, each order has multiple packages stored as comma separated string [package& package_code] columns

我有一个包含订单数据的数据框,每个订单都有多个包存储为逗号分隔的字符串 [ package& package_code] 列

I want to split the packages data and create a row for each package including its order details

我想拆分包裹数据并为每个包裹创建一行,包括其订单详细信息

Here is a sample input dataframe:

这是一个示例输入数据框:

import pandas as pd
df = pd.DataFrame({"order_id":[1,3,7],"order_date":["20/5/2018","22/5/2018","23/5/2018"], "package":["p1,p2,p3","p4","p5,p6"],"package_code":["#111,#222,#333","#444","#555,#666"]})

Input Dataframe

输入数据框

And this is what I am trying to achieve as output: Output

这就是我想要实现的输出: 输出

How can I do that with pandas?

我怎么能用Pandas做到这一点?

采纳答案by jpp

Here's one way using numpy.repeatand itertools.chain. Conceptually, this is exactly what you want to do: repeat some values, chain others. Recommended for small numbers of columns, otherwise stackbased methods may fare better.

这是使用numpy.repeatand的一种方法itertools.chain。从概念上讲,这正是您想要做的:重复某些值,链接其他值。推荐用于少量列,否则stack基于方法可能会更好。

import numpy as np
from itertools import chain

# return list from series of comma-separated strings
def chainer(s):
    return list(chain.from_iterable(s.str.split(',')))

# calculate lengths of splits
lens = df['package'].str.split(',').map(len)

# create new dataframe, repeating or chaining as appropriate
res = pd.DataFrame({'order_id': np.repeat(df['order_id'], lens),
                    'order_date': np.repeat(df['order_date'], lens),
                    'package': chainer(df['package']),
                    'package_code': chainer(df['package_code'])})

print(res)

   order_id order_date package package_code
0         1  20/5/2018      p1         #111
0         1  20/5/2018      p2         #222
0         1  20/5/2018      p3         #333
1         3  22/5/2018      p4         #444
2         7  23/5/2018      p5         #555
2         7  23/5/2018      p6         #666

回答by cs95

pandas >= 0.25

Pandas >= 0.25

Assuming all splittable columns have the same number of comma separated items, you can split on comma and then use Series.explodeon each column:

假设所有可拆分的列具有相同数量的逗号分隔项,您可以在逗号Series.explode上拆分,然后在每列上使用:

(df.set_index(['order_id', 'order_date'])
   .apply(lambda x: x.str.split(',').explode())
   .reset_index())                                                   

   order_id order_date package package_code
0         1  20/5/2018      p1         #111
1         1  20/5/2018      p2         #222
2         1  20/5/2018      p3         #333
3         3  22/5/2018      p4         #444
4         7  23/5/2018      p5         #555
5         7  23/5/2018      p6         #666

Details

细节

Set the columns not to be touched as the index,

将不被触及的列设置为索引,

df.set_index(['order_id', 'order_date'])                                                                                                             

                      package    package_code
order_id order_date                          
1        20/5/2018   p1,p2,p3  #111,#222,#333
3        22/5/2018         p4            #444
7        23/5/2018      p5,p6       #555,#666

The next step is a 2-step process: Split on comma to get a column of lists, then call explodeto explode the list values into their own rows.

下一步是一个两步过程:用逗号分割以获得一列列表,然后调用explode将列表值分解为它们自己的行。

_.apply(lambda x: x.str.split(',').explode())                                                                                                        

                    package package_code
order_id order_date                     
1        20/5/2018       p1         #111
         20/5/2018       p2         #222
         20/5/2018       p3         #333
3        22/5/2018       p4         #444
7        23/5/2018       p5         #555
         23/5/2018       p6         #666

Finally, reset the index.

最后,重置索引。

_.reset_index()                                                                                                                                      

   order_id order_date package package_code
0         1  20/5/2018      p1         #111
1         1  20/5/2018      p2         #222
2         1  20/5/2018      p3         #333
3         3  22/5/2018      p4         #444
4         7  23/5/2018      p5         #555
5         7  23/5/2018      p6         #666


pandas <= 0.24

Pandas <= 0.24

This should work for any number of columns like this. The essence is a little stack-unstacking magic with str.split.

这应该适用于任意数量的列。本质是使用 的一点点堆栈解除魔法str.split

(df.set_index(['order_date', 'order_id'])
   .stack()
   .str.split(',', expand=True)
   .stack()
   .unstack(-2)
   .reset_index(-1, drop=True)
   .reset_index()
)

  order_date  order_id package package_code
0  20/5/2018         1      p1         #111
1  20/5/2018         1      p2         #222
2  20/5/2018         1      p3         #333
3  22/5/2018         3      p4         #444
4  23/5/2018         7      p5         #555
5  23/5/2018         7      p6         #666

There is another performant alternative involving chain, but you'd need to explicitly chain and repeat every column (a bit of a problem with a lot of columns). Choose whatever fits the description of your problem best, as there's no single answer.

还有另一个涉及 的高性能替代方案chain,但您需要明确地链接和重复每一列(很多列有点问题)。选择最适合您的问题描述的任何内容,因为没有唯一的答案。

Details

细节

First, set the columns that are not to be touched as the index.

首先,将不要触摸的列设置为索引。

df.set_index(['order_date', 'order_id'])

                      package    package_code
order_date order_id                          
20/5/2018  1         p1,p2,p3  #111,#222,#333
22/5/2018  3               p4            #444
23/5/2018  7            p5,p6       #555,#666

Next, stackthe rows.

接下来stack是行。

_.stack()

order_date  order_id              
20/5/2018   1         package               p1,p2,p3
                      package_code    #111,#222,#333
22/5/2018   3         package                     p4
                      package_code              #444
23/5/2018   7         package                  p5,p6
                      package_code         #555,#666
dtype: object

We have a series now. So call str.spliton comma.

我们现在有一个系列。所以叫str.split逗号。

_.str.split(',', expand=True)

                                     0     1     2
order_date order_id                               
20/5/2018  1        package         p1    p2    p3
                    package_code  #111  #222  #333
22/5/2018  3        package         p4  None  None
                    package_code  #444  None  None
23/5/2018  7        package         p5    p6  None
                    package_code  #555  #666  None

We need to get rid of NULL values, so call stackagain.

我们需要去掉 NULL 值,所以stack再次调用。

_.stack()

order_date  order_id                 
20/5/2018   1         package       0      p1
                                    1      p2
                                    2      p3
                      package_code  0    #111
                                    1    #222
                                    2    #333
22/5/2018   3         package       0      p4
                      package_code  0    #444
23/5/2018   7         package       0      p5
                                    1      p6
                      package_code  0    #555
                                    1    #666
dtype: object

We're almost there. Now we want the second last level of the index to become our columns, so unstack using unstack(-2)(unstackon the second last level)

我们快到了。现在我们希望索引的倒数第二个级别成为我们的列,因此使用unstack(-2)(unstack在倒数第二个级别上) unstack

_.unstack(-2)

                      package package_code
order_date order_id                       
20/5/2018  1        0      p1         #111
                    1      p2         #222
                    2      p3         #333
22/5/2018  3        0      p4         #444
23/5/2018  7        0      p5         #555
                    1      p6         #666

Get rid of the superfluous last level using reset_index:

使用reset_index以下方法摆脱多余的最后一层:

_.reset_index(-1, drop=True)

                    package package_code
order_date order_id                     
20/5/2018  1             p1         #111
           1             p2         #222
           1             p3         #333
22/5/2018  3             p4         #444
23/5/2018  7             p5         #555
           7             p6         #666

And finally,

最后,

_.reset_index()

  order_date  order_id package package_code
0  20/5/2018         1      p1         #111
1  20/5/2018         1      p2         #222
2  20/5/2018         1      p3         #333
3  22/5/2018         3      p4         #444
4  23/5/2018         7      p5         #555
5  23/5/2018         7      p6         #666

回答by Heraknos

Have a look at today's pandas release 0.25 : https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#series-explode-to-split-list-like-values-to-rows

看看今天的Pandas发布 0.25:https: //pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#series-explode-to-split-list-like-values-to-rows

df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1}, {'var1': 'd,e,f', 'var2': 2}])
df.assign(var1=df.var1.str.split(',')).explode('var1').reset_index(drop=True)

回答by YOBEN_S

Close to cold's method :-)

接近冷的方法:-)

df.set_index(['order_date','order_id']).apply(lambda x : x.str.split(',')).stack().apply(pd.Series).stack().unstack(level=2).reset_index(level=[0,1])
Out[538]: 
  order_date  order_id package package_code
0  20/5/2018         1      p1         #111
1  20/5/2018         1      p2         #222
2  20/5/2018         1      p3         #333
0  22/5/2018         3      p4         #444
0  23/5/2018         7      p5         #555
1  23/5/2018         7      p6         #666