Python Pandas 列表列,为每个列表元素创建一行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27263805/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:35:09  来源:igfitidea点击:

Pandas column of lists, create a row for each list element

pythonpandaslist

提问by Marius

I have a dataframe where some cells contain lists of multiple values. Rather than storing multiple values in a cell, I'd like to expand the dataframe so that each item in the list gets its own row (with the same values in all other columns). So if I have:

我有一个数据框,其中一些单元格包含多个值的列表。我不想在一个单元格中存储多个值,而是想扩展数据框,以便列表中的每个项目都有自己的行(在所有其他列中具有相同的值)。所以如果我有:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {'trial_num': [1, 2, 3, 1, 2, 3],
     'subject': [1, 1, 1, 2, 2, 2],
     'samples': [list(np.random.randn(3).round(2)) for i in range(6)]
    }
)

df
Out[10]: 
                 samples  subject  trial_num
0    [0.57, -0.83, 1.44]        1          1
1    [-0.01, 1.13, 0.36]        1          2
2   [1.18, -1.46, -0.94]        1          3
3  [-0.08, -4.22, -2.05]        2          1
4     [0.72, 0.79, 0.53]        2          2
5    [0.4, -0.32, -0.13]        2          3

How do I convert to long form, e.g.:

如何转换为长格式,例如:

   subject  trial_num  sample  sample_num
0        1          1    0.57           0
1        1          1   -0.83           1
2        1          1    1.44           2
3        1          2   -0.01           0
4        1          2    1.13           1
5        1          2    0.36           2
6        1          3    1.18           0
# etc.

The index is not important, it's OK to set existing columns as the index and the final ordering isn't important.

索引并不重要,将现有列设置为索引即可,最终排序并不重要。

采纳答案by MaxU

lst_col = 'samples'

r = pd.DataFrame({
      col:np.repeat(df[col].values, df[lst_col].str.len())
      for col in df.columns.drop(lst_col)}
    ).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]

Result:

结果:

In [103]: r
Out[103]:
    samples  subject  trial_num
0      0.10        1          1
1     -0.20        1          1
2      0.05        1          1
3      0.25        1          2
4      1.32        1          2
5     -0.17        1          2
6      0.64        1          3
7     -0.22        1          3
8     -0.71        1          3
9     -0.03        2          1
10    -0.65        2          1
11     0.76        2          1
12     1.77        2          2
13     0.89        2          2
14     0.65        2          2
15    -0.98        2          3
16     0.65        2          3
17    -0.30        2          3

PS here you may find a bit more generic solution

PS在这里您可能会找到更通用的解决方案



UPDATE:some explanations: IMO the easiest way to understand this code is to try to execute it step-by-step:

更新:一些解释:IMO 理解此代码的最简单方法是尝试逐步执行它:

in the following line we are repeating values in one column Ntimes where N- is the length of the corresponding list:

在下一行中,我们在一列中重复值,N其中N- 是相应列表的长度:

In [10]: np.repeat(df['trial_num'].values, df[lst_col].str.len())
Out[10]: array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int64)

this can be generalized for all columns, containing scalar values:

这可以推广到所有包含标量值的列:

In [11]: pd.DataFrame({
    ...:           col:np.repeat(df[col].values, df[lst_col].str.len())
    ...:           for col in df.columns.drop(lst_col)}
    ...:         )
Out[11]:
    trial_num  subject
0           1        1
1           1        1
2           1        1
3           2        1
4           2        1
5           2        1
6           3        1
..        ...      ...
11          1        2
12          2        2
13          2        2
14          2        2
15          3        2
16          3        2
17          3        2

[18 rows x 2 columns]

using np.concatenate()we can flatten all values in the listcolumn (samples) and get a 1D vector:

使用np.concatenate()我们可以展平列list( samples)中的所有值并获得一维向量:

In [12]: np.concatenate(df[lst_col].values)
Out[12]: array([-1.04, -0.58, -1.32,  0.82, -0.59, -0.34,  0.25,  2.09,  0.12,  0.83, -0.88,  0.68,  0.55, -0.56,  0.65, -0.04,  0.36, -0.31])

putting all this together:

把所有这些放在一起:

In [13]: pd.DataFrame({
    ...:           col:np.repeat(df[col].values, df[lst_col].str.len())
    ...:           for col in df.columns.drop(lst_col)}
    ...:         ).assign(**{lst_col:np.concatenate(df[lst_col].values)})
Out[13]:
    trial_num  subject  samples
0           1        1    -1.04
1           1        1    -0.58
2           1        1    -1.32
3           2        1     0.82
4           2        1    -0.59
5           2        1    -0.34
6           3        1     0.25
..        ...      ...      ...
11          1        2     0.68
12          2        2     0.55
13          2        2    -0.56
14          2        2     0.65
15          3        2    -0.04
16          3        2     0.36
17          3        2    -0.31

[18 rows x 3 columns]

using pd.DataFrame()[df.columns]will guarantee that we are selecting columns in the original order...

usingpd.DataFrame()[df.columns]将保证我们按原始顺序选择列...

回答by Roman Pekar

A bit longer than I expected:

比我预期的要长一点:

>>> df
                samples  subject  trial_num
0  [-0.07, -2.9, -2.44]        1          1
1   [-1.52, -0.35, 0.1]        1          2
2  [-0.17, 0.57, -0.65]        1          3
3  [-0.82, -1.06, 0.47]        2          1
4   [0.79, 1.35, -0.09]        2          2
5   [1.17, 1.14, -1.79]        2          3
>>>
>>> s = df.apply(lambda x: pd.Series(x['samples']),axis=1).stack().reset_index(level=1, drop=True)
>>> s.name = 'sample'
>>>
>>> df.drop('samples', axis=1).join(s)
   subject  trial_num  sample
0        1          1   -0.07
0        1          1   -2.90
0        1          1   -2.44
1        1          2   -1.52
1        1          2   -0.35
1        1          2    0.10
2        1          3   -0.17
2        1          3    0.57
2        1          3   -0.65
3        2          1   -0.82
3        2          1   -1.06
3        2          1    0.47
4        2          2    0.79
4        2          2    1.35
4        2          2   -0.09
5        2          3    1.17
5        2          3    1.14
5        2          3   -1.79

If you want sequential index, you can apply reset_index(drop=True)to the result.

如果你想要顺序索引,你可以申请reset_index(drop=True)结果。

update:

更新

>>> res = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack()
>>> res = res.reset_index()
>>> res.columns = ['subject','trial_num','sample_num','sample']
>>> res
    subject  trial_num  sample_num  sample
0         1          1           0    1.89
1         1          1           1   -2.92
2         1          1           2    0.34
3         1          2           0    0.85
4         1          2           1    0.24
5         1          2           2    0.72
6         1          3           0   -0.96
7         1          3           1   -2.72
8         1          3           2   -0.11
9         2          1           0   -1.33
10        2          1           1    3.13
11        2          1           2   -0.65
12        2          2           0    0.10
13        2          2           1    0.65
14        2          2           2    0.15
15        2          3           0    0.64
16        2          3           1   -0.10
17        2          3           2   -0.76

回答by Marius

Trying to work through Roman Pekar's solution step-by-step to understand it better, I came up with my own solution, which uses meltto avoid some of the confusing stacking and index resetting. I can't say that it's obviously a clearer solution though:

尝试逐步解决 Roman Pekar 的解决方案以更好地理解它,我提出了自己的解决方案,melt用于避免一些令人困惑的堆叠和索引重置。我不能说这显然是一个更清晰的解决方案:

items_as_cols = df.apply(lambda x: pd.Series(x['samples']), axis=1)
# Keep original df index as a column so it's retained after melt
items_as_cols['orig_index'] = items_as_cols.index

melted_items = pd.melt(items_as_cols, id_vars='orig_index', 
                       var_name='sample_num', value_name='sample')
melted_items.set_index('orig_index', inplace=True)

df.merge(melted_items, left_index=True, right_index=True)

Output (obviously we can drop the original samples column now):

输出(显然我们现在可以删除原始样本列):

                 samples  subject  trial_num sample_num  sample
0    [1.84, 1.05, -0.66]        1          1          0    1.84
0    [1.84, 1.05, -0.66]        1          1          1    1.05
0    [1.84, 1.05, -0.66]        1          1          2   -0.66
1    [-0.24, -0.9, 0.65]        1          2          0   -0.24
1    [-0.24, -0.9, 0.65]        1          2          1   -0.90
1    [-0.24, -0.9, 0.65]        1          2          2    0.65
2    [1.15, -0.87, -1.1]        1          3          0    1.15
2    [1.15, -0.87, -1.1]        1          3          1   -0.87
2    [1.15, -0.87, -1.1]        1          3          2   -1.10
3   [-0.8, -0.62, -0.68]        2          1          0   -0.80
3   [-0.8, -0.62, -0.68]        2          1          1   -0.62
3   [-0.8, -0.62, -0.68]        2          1          2   -0.68
4    [0.91, -0.47, 1.43]        2          2          0    0.91
4    [0.91, -0.47, 1.43]        2          2          1   -0.47
4    [0.91, -0.47, 1.43]        2          2          2    1.43
5  [-1.14, -0.24, -0.91]        2          3          0   -1.14
5  [-1.14, -0.24, -0.91]        2          3          1   -0.24
5  [-1.14, -0.24, -0.91]        2          3          2   -0.91

回答by behzad.nouri

you can also use pd.concatand pd.meltfor this:

您还可以为此使用pd.concatpd.melt

>>> objs = [df, pd.DataFrame(df['samples'].tolist())]
>>> pd.concat(objs, axis=1).drop('samples', axis=1)
   subject  trial_num     0     1     2
0        1          1 -0.49 -1.00  0.44
1        1          2 -0.28  1.48  2.01
2        1          3 -0.52 -1.84  0.02
3        2          1  1.23 -1.36 -1.06
4        2          2  0.54  0.18  0.51
5        2          3 -2.18 -0.13 -1.35
>>> pd.melt(_, var_name='sample_num', value_name='sample', 
...         value_vars=[0, 1, 2], id_vars=['subject', 'trial_num'])
    subject  trial_num sample_num  sample
0         1          1          0   -0.49
1         1          2          0   -0.28
2         1          3          0   -0.52
3         2          1          0    1.23
4         2          2          0    0.54
5         2          3          0   -2.18
6         1          1          1   -1.00
7         1          2          1    1.48
8         1          3          1   -1.84
9         2          1          1   -1.36
10        2          2          1    0.18
11        2          3          1   -0.13
12        1          1          2    0.44
13        1          2          2    2.01
14        1          3          2    0.02
15        2          1          2   -1.06
16        2          2          2    0.51
17        2          3          2   -1.35

last, if you need you can sort base on the first the first three columns.

最后,如果您需要,您可以根据前三列进行排序。

回答by Charles Davis

For those looking for a version of Roman Pekar's answer that avoids manual column naming:

对于那些正在寻找避免手动列命名的 Roman Pekar 答案版本的人:

column_to_explode = 'samples'
res = (df
       .set_index([x for x in df.columns if x != column_to_explode])[column_to_explode]
       .apply(pd.Series)
       .stack()
       .reset_index())
res = res.rename(columns={
          res.columns[-2]:'exploded_{}_index'.format(column_to_explode),
          res.columns[-1]: '{}_exploded'.format(column_to_explode)})

回答by Michael Silverstein

I found the easiest way was to:

我发现最简单的方法是:

  1. Convert the samplescolumn into a DataFrame
  2. Joining with the original df
  3. Melting
  1. samples列转换为 DataFrame
  2. 加入原来的df
  3. 融化

Shown here:

显示在这里:

    df.samples.apply(lambda x: pd.Series(x)).join(df).\
melt(['subject','trial_num'],[0,1,2],var_name='sample')

        subject  trial_num sample  value
    0         1          1      0  -0.24
    1         1          2      0   0.14
    2         1          3      0  -0.67
    3         2          1      0  -1.52
    4         2          2      0  -0.00
    5         2          3      0  -1.73
    6         1          1      1  -0.70
    7         1          2      1  -0.70
    8         1          3      1  -0.29
    9         2          1      1  -0.70
    10        2          2      1  -0.72
    11        2          3      1   1.30
    12        1          1      2  -0.55
    13        1          2      2   0.10
    14        1          3      2  -0.44
    15        2          1      2   0.13
    16        2          2      2  -1.44
    17        2          3      2   0.73

It's worth noting that this may have only worked because each trial has the same number of samples (3). Something more clever may be necessary for trials of different sample sizes.

值得注意的是,这可能只是有效,因为每个试验都有相同数量的样本 (3)。对于不同样本量的试验,可能需要更聪明的方法。

回答by Khris

Very late answer but I want to add this:

很晚的答案,但我想补充一点:

A fast solution using vanilla Python that also takes care of the sample_numcolumn in OP's example. On my own large dataset with over 10 million rows and a result with 28 million rows this only takes about 38 seconds. The accepted solution completely breaks down with that amount of data and leads to a memory erroron my system that has 128GB of RAM.

使用 vanilla Python 的快速解决方案,它也处理sample_numOP 示例中的列。在我自己的超过 1000 万行的大型数据集和 2800 万行的结果中,这只需要大约 38 秒。接受的解决方案会因如此大量的数据而完全崩溃,并导致memory error我的系统上出现 128GB 的​​ RAM。

df = df.reset_index(drop=True)
lstcol = df.lstcol.values
lstcollist = []
indexlist = []
countlist = []
for ii in range(len(lstcol)):
    lstcollist.extend(lstcol[ii])
    indexlist.extend([ii]*len(lstcol[ii]))
    countlist.extend([jj for jj in range(len(lstcol[ii]))])
df = pd.merge(df.drop("lstcol",axis=1),pd.DataFrame({"lstcol":lstcollist,"lstcol_num":countlist},
index=indexlist),left_index=True,right_index=True).reset_index(drop=True)

回答by cs95

Pandas >= 0.25

熊猫 >= 0.25

Series and DataFrame methods define a .explode()method that explodes lists into separate rows. See the docs section on Exploding a list-like column.

Series 和 DataFrame 方法定义了.explode()一种将列表分解为单独行的方法。请参阅有关Exploding a list-like column的文档部分。

df = pd.DataFrame({
    'var1': [['a', 'b', 'c'], ['d', 'e',], [], np.nan], 
    'var2': [1, 2, 3, 4]
})
df
        var1  var2
0  [a, b, c]     1
1     [d, e]     2
2         []     3
3        NaN     4

df.explode('var1')

  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
2  NaN     3  # empty list converted to NaN
3  NaN     4  # NaN entry preserved as-is

# to reset the index to be monotonically increasing...
df.explode('var1').reset_index(drop=True)

  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5  NaN     3
6  NaN     4

Note that this also handles mixed columns of lists and scalars, as well as empty lists and NaNs appropriately (this is a drawback of repeat-based solutions).

请注意,这也适当地处理列表和标量的混合列,以及空列表和 NaN(这是repeat基于解决方案的缺点)。

However, you should note that explodeonly works on a single column(for now).

但是,您应该注意,它explode仅适用于单个列(目前)。

P.S.: if you are looking to explode a column of strings, you need to split on a separator first, then use explode. See this (very much) related answer by me.

PS:如果您要分解一列字符串,则需要先在分隔符上拆分,然后使用explode. 看到我这个(非常)相关的答案。

回答by Tapas

import pandas as pd
df = pd.DataFrame([{'Product': 'Coke', 'Prices': [100,123,101,105,99,94,98]},{'Product': 'Pepsi', 'Prices': [101,104,104,101,99,99,99]}])
print(df)
df = df.assign(Prices=df.Prices.str.split(',')).explode('Prices')
print(df)

Try this in pandas >=0.25 version

在 Pandas >=0.25 版本中试试这个

回答by Rémy Pétremand

Also very late, but here is an answer from Karvy1 that worked well for me if you don't have pandas >=0.25 version: https://stackoverflow.com/a/52511166/10740287

也很晚了,但这里是 Karvy1 的答案,如果您没有 pandas >=0.25 版本,它对我来说效果很好:https://stackoverflow.com/a/52511166/10740287

For the example above you may write:

对于上面的例子,你可以写:

data = [(row.subject, row.trial_num, sample) for row in df.itertuples() for sample in row.samples]
data = pd.DataFrame(data, columns=['subject', 'trial_num', 'samples'])


Speed test:

速度测试:

%timeit data = pd.DataFrame([(row.subject, row.trial_num, sample) for row in df.itertuples() for sample in row.samples], columns=['subject', 'trial_num', 'samples'])

1.33 ms ± 74.8 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

每个循环 1.33 ms ± 74.8 μs(7 次运行的平均值 ± 标准偏差,每次 1000 次循环)

%timeit data = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack().reset_index()

4.9 ms ± 189 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

每个循环 4.9 ms ± 189 μs(7 次运行的平均值 ± 标准偏差,每次 100 次循环)

%timeit data = pd.DataFrame({col:np.repeat(df[col].values, df['samples'].str.len())for col in df.columns.drop('samples')}).assign(**{'samples':np.concatenate(df['samples'].values)})

1.38 ms ± 25 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

每个循环 1.38 ms ± 25 μs(7 次运行的平均值 ± 标准偏差,每次 1000 次循环)