在 Pandas DataFrame 中取消嵌套（分解）多个列表列的有效方法

Question

提问by Moh

I am reading multiple JSON objects into one DataFrame. The problem is that some of the columns are lists. Also, the data is very big and because of that I cannot use the available solutions on the internet. They are very slow and memory-inefficient

我正在将多个 JSON 对象读入一个 DataFrame。问题是某些列是列表。此外，数据非常大，因此我无法使用互联网上的可用解决方案。它们非常缓慢且内存效率低下

Here is how my data looks like:

这是我的数据的样子：

df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]})
    A       B          C           D           E
0   x1  [v1, v2]    [c1, c2]    [d1, d2]    [e1, e2]
1   x2  [v3, v4]    [c3, c4]    [d3, d4]    [e3, e4]
2   x3  [v5, v6]    [c5, c6]    [d5, d6]    [e5, e6]
3   x4  [v7, v8]    [c7, c8]    [d7, d8]    [e7, e8]

And this is the shape of my data: (441079, 12)

这是我的数据的形状：(441079, 12)

My desired output is:

我想要的输出是：

    A       B          C           D           E
0   x1      v1         c1         d1          e1
0   x1      v2         c2         d2          e2
1   x2      v3         c3         d3          e3
1   x2      v4         c4         d4          e4
.....

EDIT: After being marked as duplicate, I would like to stress on the fact that in this question I was looking for an efficientmethod of exploding multiple columns. Therefore the approved answer is able to explode an arbitrary number of columns on very large datasets efficiently. Something that the answers to the other question failed to do (and that was the reason I asked this question after testing those solutions).

编辑：在被标记为重复后，我想强调一个事实，即在这个问题中，我正在寻找一种分解多列的有效方法。因此，批准的答案能够在非常大的数据集上有效地分解任意数量的列。另一个问题的答案未能做到的事情（这就是我在测试这些解决方案后问这个问题的原因）。

Answer 1

采纳答案by MaxU

def explode(df, lst_cols, fill_value=''):
    # make sure `lst_cols` is a list
    if lst_cols and not isinstance(lst_cols, list):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)

    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()

    if (lens > 0).all():
        # ALL lists in cells aren't empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .loc[:, df.columns]
    else:
        # at least one list in cells is empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
          .loc[:, df.columns]

Usage:

用法：

In [82]: explode(df, lst_cols=list('BCDE'))
Out[82]:
    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

Answer 2

回答by cs95

pandas >= 0.25

Pandas >= 0.25

Assuming all columns have the same number of lists, you can call Series.explodeon each column.

假设所有列具有相同数量的列表，您可以调用Series.explode每一列。

df.set_index(['A']).apply(pd.Series.explode).reset_index()

    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

The idea is to set as the index all columns that must NOTbe exploded first, then reset the index after.

这个想法是将所有不能首先分解的列设置为索引，然后重置索引。

It's also faster.

它也更快。

%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
%%timeit
(df.set_index('A')
   .apply(lambda x: x.apply(pd.Series).stack())
   .reset_index()
   .drop('level_1', 1))


2.22 ms ± 98.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.14 ms ± 329 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Answer 3

回答by Zero

Use set_indexon Aand on remaining columns applyand stackthe values. All of this condensed into a single liner.

使用set_index上A和剩余的列apply和stack值。所有这些都浓缩成一个单一的衬垫。

In [1253]: (df.set_index('A')
              .apply(lambda x: x.apply(pd.Series).stack())
              .reset_index()
              .drop('level_1', 1))
Out[1253]:
    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

在 Pandas DataFrame 中取消嵌套（分解）多个列表列的有效方法

提问by Moh

采纳答案by MaxU

回答by cs95

pandas >= 0.25

Pandas >= 0.25

回答by Zero

相关推荐

最近更新

标签

在 Pandas DataFrame 中取消嵌套（分解）多个列表列的有效方法

提问by Moh

采纳答案by MaxU

回答by cs95

pandas >= 0.25

Pandas >= 0.25

回答by Zero

相关推荐

pandas “pandas_datareader”中的“get_data_yahoo”返回空数据帧

pandas 行子集的一列上的熊猫标准偏差

pandas 修改熊猫图的日期刻度

pandas 在python pandas数据帧中将字符串转换为日期格式

相关推荐

最近更新

标签