在 Pandas DataFrame 中拆分列表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28442358/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:56:08  来源:igfitidea点击:

Splitting a List inside a Pandas DataFrame

pythoncsvpandas

提问by Eric D. Brown

I have a csv file that contains a number of columns. Using pandas, I read this csv file into a dataframe and have a datetime index and five or six other columns.

我有一个包含许多列的 csv 文件。使用 Pandas,我将这个 csv 文件读入一个数据帧,并有一个日期时间索引和五六个其他列。

One of the columns is a list of timestamps (example below with index)

其中一列是时间戳列表(下面带有索引的示例)

CreateDate     TimeStamps
4/1/11         [Timestamp('2012-02-29 00:00:00'), Timestamp('2012-03-31 00:00:00'), Timestamp('2012-04-25 00:00:00'), Timestamp('2012-06-30 00:00:00')]
4/2/11         [Timestamp('2014-01-31 00:00:00')]
6/8/11         [Timestamp('2012-08-31 00:00:00'), Timestamp('2012-09-30 00:00:00'), Timestamp('2012-11-07 00:00:00'), Timestamp('2013-01-10 00:00:00'), Timestamp('2013-07-25 00:00:00')]

What I'd like to do is convert the timestamp column into separate rows for each timestamp listed. For example, for row 1 it would convert to 4 rows and row 2 would convert to 1 row. I realize I'd need to reset the index to be able to do this, which is fine.

我想做的是将时间戳列转换为列出的每个时间戳的单独行。例如,对于第 1 行,它将转换为 4 行,第 2 行将转换为 1 行。我意识到我需要重置索引才能做到这一点,这很好。

Everything I've tried just ends up getting out into left field (taking the values and create a list outside of pandas, etc)

我尝试过的一切最终都进入了左场(获取值并在Pandas之外创建一个列表等)

Any suggestions appreciated.

任何建议表示赞赏。

回答by cwharland

If you want to stay in pure pandas you can throw in a tricky groupbyand applywhich ends up boiling down to a one liner if you don't count the column rename.

如果你想留在纯Pandas中,你可以投入一个棘手的问题groupbyapply如果你不计算列重命名,它最终会沸腾成一个衬管。

In [1]: import pandas as pd

In [2]: d = {'date': ['4/1/11', '4/2/11'], 'ts': [[pd.Timestamp('2012-02-29 00:00:00'), pd.Timestamp('2012-03-31 00:00:00'), pd.Timestamp('2012-04-25 00:00:00'), pd.Timestamp('2012-06-30 00:00:00')], [pd.Timestamp('2014-01-31 00:00:00')]]}

In [3]: df = pd.DataFrame(d)

In [4]: df.head()
Out[4]: 
     date                                                 ts
0  4/1/11  [2012-02-29 00:00:00, 2012-03-31 00:00:00, 201...
1  4/2/11                              [2014-01-31 00:00:00]

In [5]: df_new = df.groupby('date').ts.apply(lambda x: pd.DataFrame(x.values[0])).reset_index().drop('level_1', axis = 1)

In [6]: df_new.columns = ['date','ts']

In [7]: df_new.head()
Out[7]: 
     date         ts
0  4/1/11 2012-02-29
1  4/1/11 2012-03-31
2  4/1/11 2012-04-25
3  4/1/11 2012-06-30
4  4/2/11 2014-01-31

Since the goal is to take the value of a column (in this case date) and repeat it for all values of the multiple rows you intend to create from the list it's useful to think of pandas indexing.

由于目标是获取列的值(在本例中为日期)并为您打算从列表中创建的多行的所有值重复它,因此考虑Pandas索引很有用。

We want the date to become the single index for the new rows so we use groupbywhich puts the desired row value into an index. Then inside that operation I want to split only this list for this date which is what applywill do for us.

我们希望日期成为新行的单个索引,因此我们使用groupby将所需的行值放入索引中。然后在该操作中,我只想拆分此日期的列表,这apply对我们有用。

I'm passing applya pandas Serieswhich consists of a single list but I can access that list via a .values[0]which pushes the sole row of the Seriesto an array with a single entry.

我正在传递apply一个Series包含单个列表的Pandas,但我可以通过 a 访问该列表,该列表.values[0]将 的唯一行推Series送到具有单个条目的数组。

To turn the list into a set of rows that will be passed back to the indexed date I can just make it a DataFrame. This incurs the penalty of picking up an extra index but we end up dropping that. We could make this an index itself but that would preclude dupe values.

要将列表转换为一组将传递回索引日期的行,我可以将其设为DataFrame. 这会导致选择额外索引的惩罚,但我们最终放弃了它。我们可以将其作为索引本身,但这会排除重复值。

Once this is passed back out I have a multi-index but I can force this into the row format we desire by reset_index. Then we simply drop the unwanted index.

一旦将其传回,我就有了一个多索引,但我可以将其强制转换为我们想要的行格式reset_index。然后我们简单地删除不需要的索引。

It sounds involved but really we're just leverage the natural behaviors of pandas functions to avoid explicitly iterating or looping.

这听起来很复杂,但实际上我们只是利用 Pandas 函数的自然行为来避免显式迭代或循环。

Speed wise this tends to be pretty good and since it relies on applyany parallelization tricks that work with applywork here.

速度明智这往往是相当不错的,因为它依赖于在这里apply工作的任何并行化技巧apply

Optionally if you want it to be robust to multiple dates each with a nested list:

(可选)如果您希望它对多个日期都具有嵌套列表的健壮性:

df_new = df.groupby('date').ts.apply(lambda x: pd.DataFrame([item for sublist in x.values for item in sublist]))

at which point the one liner is getting dense and you should probably throw into a function.

在这一点上,一个班轮变得密集,你可能应该投入一个函数。

回答by ari

The way I did it was split the list into seperate columns, and then melted it to put each timestamp in a separate row.

我这样做的方法是将列表拆分为单独的列,然后melt编辑它以将每个时间戳放在单独的行中。

In [48]: df = pd.DataFrame([[1,2,[1,2,4]],[4,5,[1,3]],],columns=['a','b','TimeStamp'])
    ...: df
Out[48]: 
   a  b  TimeStamp
0  1  2  [1, 2, 4]
1  4  5     [1, 3]

You can convert the column to a list and then back to a DataFrameto split it into columns:

您可以将列转换为列表,然后再转换回 aDataFrame以将其拆分为列:

In [53]: TScolumns = pd.DataFrame(df.TimeStamp.tolist(), )
    ...: TScolumns
Out[53]: 
   0  1   2
0  1  2   4
1  1  3 NaN

And then splice it onto the original dataframe

然后将其拼接到原始数据帧上

In [90]: df = df.drop('TimeStamp',axis=1)
In [58]: split = pd.concat([df, TScolumns], axis=1)
    ...: split
Out[58]: 
   a  b  0  1   2
0  1  2  1  2   4
1  4  5  1  3 NaN

Finally, use meltto get it into the shape you want:

最后,用melt它把它变成你想要的形状:

In [89]: pd.melt(split, id_vars=['a', 'b'], value_name='TimeStamp')
Out[89]: 
   a  b variable  TimeStamp
0  1  2        0          1
1  4  5        0          1
2  1  2        1          2
3  4  5        1          3
4  1  2        2          4
5  4  5        2        NaN

回答by nick_eu

This doesn't feel very pythonic, but it works (provided your createDate is unique!)

这感觉不是很 Pythonic,但它有效(前提是您的 createDate 是独一无二的!)

Apply will only return more rows than it gets with a groupby, so we're going to use groupby artificially (i.e. groupby a column of unique values, so each group is one line).

Apply 只会返回比使用 groupby 得到的行更多的行,因此我们将人为地使用 groupby(即 groupby 一列唯一值,因此每个组都是一行)。

def splitRows(x):

    # Extract the actual list of time-stamps. 
    theList = x.TimeStamps.iloc[0]

    # Each row will be a dictionary in this list.
    listOfNewRows = list()

    # Iterate over items in list of timestamps, 
    # putting each one in a dictionary to later convert to a row, 
    # then adding the dictionary to a list. 

    for i in theList:
        newRow = dict()
        newRow['CreateDate'] = x.CreateDate.iloc[0]
        newRow['TimeStamps'] = i
        listOfNewRows.append(newRow)

    # Now convert these dictionaries into rows in a new dataframe and return it. 
    return pd.DataFrame(listOfNewRows)


df.groupby('CreateDate', as_index = False, group_keys = False).apply(splitRows)

Followup: If CreateDate is NOT unique, you can just reset the index to a new column and groupby that.

跟进:如果 CreateDate 不是唯一的,您可以将索引重置为一个新列并分组。

回答by Vor

Probably not the best way from performance perspective, but still, you can leverage itertoolspackage:

从性能的角度来看,可能不是最好的方法,但是,您仍然可以利用itertools包:

from pandas import DataFrame, Timestamp
import itertools

d = {'date': ['4/1/11', '4/2/11'], 'ts': [[Timestamp('2012-02-29 00:00:00'), Timestamp('2012-03-31 00:00:00'), Timestamp('2012-04-25 00:00:00'), Timestamp('2012-06-30 00:00:00')], [Timestamp('2014-01-31 00:00:00')]]}
df = DataFrame(d)

res = df.to_dict()
data = []
for x in res['date'].keys():
  data.append(itertools.izip_longest([res['date'][x]], res['ts'][x], fillvalue=res['date'][x]))

new_data = list(itertools.chain.from_iterable(data))
df2 = DataFrame(new_data, columns=['date', 'timestamp'])
print df2

Will print :

将打印:

     date  timestamp
0  4/1/11 2012-02-29
1  4/1/11 2012-03-31
2  4/1/11 2012-04-25
3  4/1/11 2012-06-30
4  4/2/11 2014-01-31