Python 如何在 Pandas DataFrame 中取消嵌套(爆炸)一列?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/53218931/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 20:16:34  来源:igfitidea点击:

How to unnest (explode) a column in a pandas DataFrame?

pythonpandasdataframe

提问by YOBEN_S

I have the following DataFrame where one of the columns is an object (list type cell):

我有以下数据帧,其中一列是对象(列表类型单元格):

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
df
Out[458]: 
   A       B
0  1  [1, 2]
1  2  [1, 2]

My expected output is:

我的预期输出是:

   A  B
0  1  1
1  1  2
3  2  1
4  2  2

What should I do to achieve this?

我应该怎么做才能实现这一目标?



Related question

相关问题

pandas: When cell contents are lists, create a row for each element in the list

pandas:当单元格内容为列表时,为列表中的每个元素创建一行

Good question and answer but only handle one column with list(In my answer the self-def function will work for multiple columns, also the accepted answer is use the most time consuming apply, which is not recommended, check more info When should I ever want to use pandas apply() in my code?)

很好的问题和答案,但只处理带有列表的一列(在我的回答中,self-def 函数适用于多列,而且接受的答案是使用最耗时的apply,不推荐,查看更多信息我应该什么时候想要在我的代码中使用 pandas apply()?

回答by YOBEN_S

As a user with both Rand python, I have seen this type of question a couple of times.

作为同时使用R和的用户python,我已经多次看到此类问题。

In R, they have the built-in function from package tidyrcalled unnest. But in Python(pandas) there is no built-in function for this type of question.

在 R 中,它们具有tidyr名为unnest. 但是在Python( pandas) 中没有针对此类问题的内置函数。

I know objectcolumns typealways make the data hard to convert with a pandas' function. When I received the data like this , the first thing that came to mind was to 'flatten' or unnest the columns .

我知道objecttype总是使数据难以用pandas' 函数进行转换。当我收到这样的数据时,首先想到的是“展平”或取消嵌套列。

I am using pandasand pythonfunctions for this type of question. If you are worried about the speed of the above solutions, check user3483203's answer , since he is using numpyand most of the time numpyis faster . I recommend Cpythonand numbaif speed matters in your case.

我正在为此类问题使用pandaspython函数。如果您担心上述解决方案的速度,请查看 user3483203 的答案,因为他正在使用numpy并且大部分时间numpy都更快。我建议Cpythonnumba如果速度对您来说很重要。



Method 0 [pandas >= 0.25]
Starting from pandas 0.25, if you only need to explode onecolumn, you can use the explodefunction:

方法0 [pandas >= 0.25]
pandas 0.25开始,如果只需要爆列,可以使用explode函数:

df.explode('B')

       A  B
    0  1  1
    1  1  2
    0  2  1
    1  2  2


Method 1
apply + pd.Series(easy to understand but in terms of performance not recommended . )

方法一
apply + pd.Series(简单易懂但不推荐在性能方面。)

df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
Out[463]: 
   A  B
0  1  1
1  1  2
0  2  1
1  2  2


Method 2
Using repeatwith DataFrameconstructor , re-create your dataframe (good at performance, not good at multiple columns )

方法 2与构造函数一起
使用,重新创建您的数据框(性能良好,不擅长多列)repeatDataFrame

df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
df
Out[465]: 
   A  B
0  1  1
0  1  2
1  2  1
1  2  2

Method 2.1
for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .


例如方法 2.1除了 A 我们还有 A.1 .....An 如果我们仍然使用上面的方法(方法 2),我们很难一一重新创建列。

Solution : joinor mergewith the indexafter 'unnest' the single columns

解决方案:joinmerge使用indexafter 'unnest' 单列

s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]: 
   B  A
0  1  1
0  2  1
1  1  2
1  2  2

If you need the column order exactly the same as before, add reindexat the end.

如果您需要与以前完全相同的列顺序,请reindex在末尾添加。

s.join(df.drop('B',1),how='left').reindex(columns=df.columns)


Method 3
recreate the list

方法 3
重新创建list

pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]: 
   A  B
0  1  1
1  1  2
2  2  1
3  2  2

If more than two columns, use

如果超过两列,请使用

s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]: 
   0  1  A       B
0  0  1  1  [1, 2]
1  0  2  1  [1, 2]
2  1  1  2  [1, 2]
3  1  2  2  [1, 2]


Method 4
using reindexor loc

方法 4
使用reindexloc

df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]: 
   A  B
0  1  1
0  1  2
1  2  1
1  2  2

#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


Method 5
when the list only contains unique values:


当列表只包含唯一值时的方法5

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]: 
   B  A
0  1  1
1  2  1
2  3  2
3  4  2


Method 6
using numpyfor high performance:

方法6
采用numpy高性能:

newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0],columns=df.columns)
   A  B
0  1  1
1  1  2
2  2  1
3  2  2


Method 7
using base function itertoolscycleand chain: Pure python solution just for fun

方法 7
使用基函数itertoolscyclechain:纯 python 解决方案只是为了好玩

from itertools import cycle,chain
l=df.values.tolist()
l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
   A  B
0  1  1
1  1  2
2  2  1
3  2  2


Generalizing to multiple columns

推广到多列

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
df
Out[592]: 
   A       B       C
0  1  [1, 2]  [1, 2]
1  2  [3, 4]  [3, 4]

Self-def function:

自定义功能:

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')


unnesting(df,['B','C'])
Out[609]: 
   B  C  A
0  1  1  1
0  2  2  1
1  3  3  2
1  4  4  2


Column-wise Unnesting

按列取消嵌套

All above method is talking about the verticalunnesting and explode , If you do need expend the list horizontal, Check with pd.DataFrameconstructor

以上所有方法都在谈论垂直取消嵌套和爆炸,如果您确实需要水平展开列表,请与pd.DataFrame构造函数检查

df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix('B_'))
Out[33]: 
   A       B       C  B_0  B_1
0  1  [1, 2]  [1, 2]    1    2
1  2  [3, 4]  [3, 4]    3    4

Updated function

更新功能

def unnesting(df, explode, axis):
    if axis==1:
        idx = df.index.repeat(df[explode[0]].str.len())
        df1 = pd.concat([
            pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
        df1.index = idx

        return df1.join(df.drop(explode, 1), how='left')
    else :
        df1 = pd.concat([
                         pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
        return df1.join(df.drop(explode, 1), how='left')

Test Output

测试输出

unnesting(df, ['B','C'], axis=0)
Out[36]: 
   B0  B1  C0  C1  A
0   1   2   1   2  1
1   3   4   3   4  2

回答by user3483203

Option 1

选项1

If all of the sublists in the other column are the same length, numpycan be an efficient option here:

如果另一列中的所有子列表的长度相同,numpy则这里是一个有效的选择:

vals = np.array(df.B.values.tolist())    
a = np.repeat(df.A, vals.shape[1])

pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

   A  B
0  1  1
1  1  2
2  2  1
3  2  2


Option 2

选项 2

If the sublists have different length, you need an additional step:

如果子列表的长度不同,则需要额外的步骤:

vals = df.B.values.tolist()
rs = [len(r) for r in vals]    
a = np.repeat(df.A, rs)

pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

   A  B
0  1  1
1  1  2
2  2  1
3  2  2


Option 3

选项 3

I took a shot at generalizing this to work to flatten Ncolumns and tile Mcolumns, I'll work later on making it more efficient:

我尝试将其概括为扁平N列和平铺M列,稍后我将努力使其更有效率:

df = pd.DataFrame({'A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
                   'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C']})

   A          B          C  D
0  1     [1, 2]  [1, 2, 3]  A
1  2  [1, 2, 3]     [1, 2]  B
2  3        [1]     [1, 2]  C

def unnest(df, tile, explode):
    vals = df[explode].sum(1)
    rs = [len(r) for r in vals]
    a = np.repeat(df[tile].values, rs, axis=0)
    b = np.concatenate(vals.values)
    d = np.column_stack((a, b))
    return pd.DataFrame(d, columns = tile +  ['_'.join(explode)])

unnest(df, ['A', 'D'], ['B', 'C'])

    A  D B_C
0   1  A   1
1   1  A   2
2   1  A   1
3   1  A   2
4   1  A   3
5   2  B   1
6   2  B   2
7   2  B   3
8   2  B   1
9   2  B   2
10  3  C   1
11  3  C   1
12  3  C   2


Functions

职能

def wen1(df):
    return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0: 'B'})

def wen2(df):
    return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})

def wen3(df):
    s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))
    return s.join(df.drop('B', 1), how='left')

def wen4(df):
    return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

def chris1(df):
    vals = np.array(df.B.values.tolist())
    a = np.repeat(df.A, vals.shape[1])
    return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

def chris2(df):
    vals = df.B.values.tolist()
    rs = [len(r) for r in vals]
    a = np.repeat(df.A.values, rs)
    return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Timings

时间安排

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit

res = pd.DataFrame(
       index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
       columns=[10, 50, 100, 500, 1000, 5000, 10000],
       dtype=float
)

for f in res.index:
    for c in res.columns:
        df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
        df = pd.concat([df]*c)
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

Performance

表现

enter image description here

在此处输入图片说明

回答by joelostblom

Exploding a list-like column has been simplified significantly in pandas 0.25with the addition of the explode()method:

通过添加以下方法,在 pandas 0.25 中显着简化了类似列表的列的分解explode()

df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
df.explode('B')

Out:

出去:

   A  B
0  1  1
0  1  2
1  2  1
1  2  2

回答by Dani Mesejo

One alternative is to apply the meshgrid recipeover the rows of the columns to unnest:

一种替代方法是在列的行上应用meshgrid 配方取消嵌套

import numpy as np
import pandas as pd


def unnest(frame, explode):
    def mesh(values):
        return np.array(np.meshgrid(*values)).T.reshape(-1, len(values))

    data = np.vstack(mesh(row) for row in frame[explode].values)
    return pd.DataFrame(data=data, columns=explode)


df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
print(unnest(df, ['A', 'B']))  # base
print()

df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [3, 4]], 'C': [[1, 2], [3, 4]]})
print(unnest(df, ['A', 'B', 'C']))  # multiple columns
print()

df = pd.DataFrame({'A': [1, 2, 3], 'B': [[1, 2], [1, 2, 3], [1]],
                   'C': [[1, 2, 3], [1, 2], [1, 2]], 'D': ['A', 'B', 'C']})

print(unnest(df, ['A', 'B']))  # uneven length lists
print()
print(unnest(df, ['D', 'B']))  # different types
print()

Output

输出

   A  B
0  1  1
1  1  2
2  2  1
3  2  2

   A  B  C
0  1  1  1
1  1  2  1
2  1  1  2
3  1  2  2
4  2  3  3
5  2  4  3
6  2  3  4
7  2  4  4

   A  B
0  1  1
1  1  2
2  2  1
3  2  2
4  2  3
5  3  1

   D  B
0  A  1
1  A  2
2  B  1
3  B  2
4  B  3
5  C  1

回答by ayorgo

My 5 cents:

我的 5 美分:

df[['B', 'B2']] = pd.DataFrame(df['B'].values.tolist())

df[['A', 'B']].append(df[['A', 'B2']].rename(columns={'B2': 'B'}),
                      ignore_index=True)

and another 5

和另外 5 个

df[['B1', 'B2']] = pd.DataFrame([*df['B']]) # if values.tolist() is too boring

(pd.wide_to_long(df.drop('B', 1), 'B', 'A', '')
 .reset_index(level=1, drop=True)
 .reset_index())

both resulting in the same

两者都导致相同

   A  B
0  1  1
1  2  1
2  1  2
3  2  2

回答by Ze Tang

Because normally sublist length are different and join/merge is far more computational expensive. I retested the method for different length sublist and more normal columns.

因为通常子列表长度不同,并且加入/合并的计算成本要高得多。我重新测试了不同长度子列表和更正常列的方法。

MultiIndex should be also a easier way to write and has near the same performances as numpy way.

MultiIndex 也应该是一种更简单的编写方式,并且具有与 numpy 方式几乎相同的性能。

Surprisingly, in my implementation comprehension way has the best performance.

出人意料的是,在我的实现理解方式中有着最好的表现。

def stack(df):
    return df.set_index(['A', 'C']).B.apply(pd.Series).stack()


def comprehension(df):
    return pd.DataFrame([x + [z] for x, y in zip(df[['A', 'C']].values.tolist(), df.B) for z in y])


def multiindex(df):
    return pd.DataFrame(np.concatenate(df.B.values), index=df.set_index(['A', 'C']).index.repeat(df.B.str.len()))


def array(df):
    return pd.DataFrame(
        np.column_stack((
            np.repeat(df[['A', 'C']].values, df.B.str.len(), axis=0),
            np.concatenate(df.B.values)
        ))
    )


import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit

res = pd.DataFrame(
    index=[
        'stack',
        'comprehension',
        'multiindex',
        'array',
    ],
    columns=[1000, 2000, 5000, 10000, 20000, 50000],
    dtype=float
)

for f in res.index:
    for c in res.columns:
        df = pd.DataFrame({'A': list('abc'), 'C': list('def'), 'B': [['g', 'h', 'i'], ['j', 'k'], ['l']]})
        df = pd.concat([df] * c)
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=20)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

Performance

表现

Relative time of each method

每种方法的相对时间

回答by Markus Dutschke

I generalized the problem a bit to be applicable to more columns.

我将问题概括为适用于更多列。

Summary of what my solution does:

我的解决方案所做的总结:

In[74]: df
Out[74]: 
    A   B             C             columnD
0  A1  B1  [C1.1, C1.2]                D1
1  A2  B2  [C2.1, C2.2]  [D2.1, D2.2, D2.3]
2  A3  B3            C3        [D3.1, D3.2]

In[75]: dfListExplode(df,['C','columnD'])
Out[75]: 
    A   B     C columnD
0  A1  B1  C1.1    D1
1  A1  B1  C1.2    D1
2  A2  B2  C2.1    D2.1
3  A2  B2  C2.1    D2.2
4  A2  B2  C2.1    D2.3
5  A2  B2  C2.2    D2.1
6  A2  B2  C2.2    D2.2
7  A2  B2  C2.2    D2.3
8  A3  B3    C3    D3.1
9  A3  B3    C3    D3.2

Complete example:

完整示例:

The actual explosion is performed in 3 lines.The rest is cosmetics (multi column explosion, handling of strings instead of lists in the explosion column, ...).

实际的爆炸分 3 行进行。其余的是化妆品(多列爆炸,处理字符串而不是爆炸列中的列表,......)。

import pandas as pd
import numpy as np

df=pd.DataFrame( {'A': ['A1','A2','A3'],
                  'B': ['B1','B2','B3'],
                  'C': [ ['C1.1','C1.2'],['C2.1','C2.2'],'C3'],
                  'columnD': [ 'D1',['D2.1','D2.2', 'D2.3'],['D3.1','D3.2']],
                  })
print('df',df, sep='\n')

def dfListExplode(df, explodeKeys):
    if not isinstance(explodeKeys, list):
        explodeKeys=[explodeKeys]
    # recursive handling of explodeKeys
    if len(explodeKeys)==0:
        return df
    elif len(explodeKeys)==1:
        explodeKey=explodeKeys[0]
    else:
        return dfListExplode( dfListExplode(df, explodeKeys[:1]), explodeKeys[1:])
    # perform explosion/unnesting for key: explodeKey
    dfPrep=df[explodeKey].apply(lambda x: x if isinstance(x,list) else [x]) #casts all elements to a list
    dfIndExpl=pd.DataFrame([[x] + [z] for x, y in zip(dfPrep.index,dfPrep.values) for z in y ], columns=['explodedIndex',explodeKey])
    dfMerged=dfIndExpl.merge(df.drop(explodeKey, axis=1), left_on='explodedIndex', right_index=True)
    dfReind=dfMerged.reindex(columns=list(df))
    return dfReind

dfExpl=dfListExplode(df,['C','columnD'])
print('dfExpl',dfExpl, sep='\n')

Credits to WeNYoBen's answer

学分WeNYoBen的答案

回答by U10-Forward

Something pretty not recommended (at least work in this case):

不推荐的东西(至少在这种情况下有效):

df=pd.concat([df]*2).sort_index()
it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
df['B']=df['B'].apply(lambda x:next(it))

concat+ sort_index+ iter+ apply+ next.

concat+ sort_index+ iter+ apply+ next

Now:

现在:

print(df)

Is:

是:

   A  B
0  1  1
0  1  2
1  2  1
1  2  2

If care about index:

如果关心索引:

df=df.reset_index(drop=True)

Now:

现在:

print(df)

Is:

是:

   A  B
0  1  1
1  1  2
2  2  1
3  2  2

回答by Ben Pap

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})

pd.concat([df['A'], pd.DataFrame(df['B'].values.tolist())], axis = 1)\
  .melt(id_vars = 'A', value_name = 'B')\
  .dropna()\
  .drop('variable', axis = 1)

    A   B
0   1   1
1   2   1
2   1   2
3   2   2

Any opinions on this method I thought of? or is doing both concat and melt considered too "expensive"?

对我想到的这种方法有什么意见吗?或者同时做 concat 和melt 被认为太“贵”了?

回答by piRSquared

Problem Setup

问题设置

Assume there are multiple columns with different length objects within it

假设有多个具有不同长度对象的列

df = pd.DataFrame({
    'A': [1, 2],
    'B': [[1, 2], [3, 4]],
    'C': [[1, 2], [3, 4, 5]]
})

df

   A       B          C
0  1  [1, 2]     [1, 2]
1  2  [3, 4]  [3, 4, 5]

When the lengths are the same, it is easy for us to assume that the varying elements coincide and should be "zipped" together.

当长度相同时,我们很容易假设不同的元素重合并且应该“压缩”在一起。

   A       B          C
0  1  [1, 2]     [1, 2]  # Typical to assume these should be zipped [(1, 1), (2, 2)]
1  2  [3, 4]  [3, 4, 5]

However, the assumption gets challenged when we see different length objects, should we "zip", if so, how do we handle the excess in one of the objects. OR, maybe we want the product of all of the objects. This will get big fast, but might be what is wanted.

然而,当我们看到不同长度的物体时,假设会受到挑战,我们是否应该“压缩”,如果是这样,我们如何处理其中一个物体的多余部分。 或者,也许我们想要所有对象的乘积。这会很快变大,但可能是我们想要的。

   A       B          C
0  1  [1, 2]     [1, 2]
1  2  [3, 4]  [3, 4, 5]  # is this [(3, 3), (4, 4), (None, 5)]?

OR

或者

   A       B          C
0  1  [1, 2]     [1, 2]
1  2  [3, 4]  [3, 4, 5]  # is this [(3, 3), (3, 4), (3, 5), (4, 3), (4, 4), (4, 5)]


The Function

功能

This function gracefully handles zipor productbased on a parameter and assumes to zipaccording to the length of the longest object with zip_longest

这个函数优雅地处理zipproduct基于一个参数并假设zip根据最长对象的长度zip_longest

from itertools import zip_longest, product

def xplode(df, explode, zipped=True):
    method = zip_longest if zipped else product

    rest = {*df} - {*explode}

    zipped = zip(zip(*map(df.get, rest)), zip(*map(df.get, explode)))
    tups = [tup + exploded
     for tup, pre in zipped
     for exploded in method(*pre)]

    return pd.DataFrame(tups, columns=[*rest, *explode])[[*df]]


Zipped

压缩

xplode(df, ['B', 'C'])

   A    B  C
0  1  1.0  1
1  1  2.0  2
2  2  3.0  3
3  2  4.0  4
4  2  NaN  5


Product

产品

xplode(df, ['B', 'C'], zipped=False)

   A  B  C
0  1  1  1
1  1  1  2
2  1  2  1
3  1  2  2
4  2  3  3
5  2  3  4
6  2  3  5
7  2  4  3
8  2  4  4
9  2  4  5


New Setup

新设置

Varying up the example a bit

稍微改变一下例子

df = pd.DataFrame({
    'A': [1, 2],
    'B': [[1, 2], [3, 4]],
    'C': 'C',
    'D': [[1, 2], [3, 4, 5]],
    'E': [('X', 'Y', 'Z'), ('W',)]
})

df

   A       B  C          D          E
0  1  [1, 2]  C     [1, 2]  (X, Y, Z)
1  2  [3, 4]  C  [3, 4, 5]       (W,)


Zipped

压缩

xplode(df, ['B', 'D', 'E'])

   A    B  C    D     E
0  1  1.0  C  1.0     X
1  1  2.0  C  2.0     Y
2  1  NaN  C  NaN     Z
3  2  3.0  C  3.0     W
4  2  4.0  C  4.0  None
5  2  NaN  C  5.0  None


Product

产品

xplode(df, ['B', 'D', 'E'], zipped=False)

    A  B  C  D  E
0   1  1  C  1  X
1   1  1  C  1  Y
2   1  1  C  1  Z
3   1  1  C  2  X
4   1  1  C  2  Y
5   1  1  C  2  Z
6   1  2  C  1  X
7   1  2  C  1  Y
8   1  2  C  1  Z
9   1  2  C  2  X
10  1  2  C  2  Y
11  1  2  C  2  Z
12  2  3  C  3  W
13  2  3  C  4  W
14  2  3  C  5  W
15  2  4  C  3  W
16  2  4  C  4  W
17  2  4  C  5  W