将 Pandas Dataframe 单元格中的嵌套数组值拆分为多行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38372016/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:35:07  来源:igfitidea点击:

Split nested array values from Pandas Dataframe cell over multiple rows

pythonnumpypandasdataframe

提问by Philip O'Brien

I have a Pandas DataFrameof the following form

我有以下形式的Pandas DataFrame

enter image description here

在此处输入图片说明

There is one row per ID per year (2008 - 2015). For the columns Max Temp, Min Temp, and Raineach cell contains an array of values corresponding to a day in that year, i.e. for the frame above

每年(2008 - 2015)每个 ID 有一行。对于列Max Temp, Min Temp, 和Rain每个单元格都包含与该年中的某一天相对应的值数组,即上面的框架

  • frame3.iloc[0]['Max Temp'][0]is the value for January 1st 2011
  • frame3.iloc[0]['Max Temp'][364]is the value for December 31st 2011.
  • frame3.iloc[0]['Max Temp'][0]是 2011 年 1 月 1 日的值
  • frame3.iloc[0]['Max Temp'][364]是 2011 年 12 月 31 日的值。

I'm aware this is badly structured, but this is the data I have to deal with. It is stored in MongoDB in this way (where one of these rows equates to a document in Mongo).

我知道这是错误的结构,但这是我必须处理的数据。它以这种方式存储在 MongoDB 中(其中这些行之一相当于 Mongo 中的文档)。

I want to split these nested arrays, so that instead of one row per ID per year, I have one row per ID per day. While splitting the array, however, I would also like to create a new column to capture the day of the year, based on the current array index. I would then use this day, plus the Yearcolumn to create a DatetimeIndex

我想拆分这些嵌套数组,这样每个 ID 每年只有一行,而不是每个 ID 每天一行。但是,在拆分数组时,我还想根据当前数组索引创建一个新列来捕获一年中的哪一天。然后我将使用这一天,加上该Year列来创建一个DatetimeIndex

enter image description here

在此处输入图片说明

I searched here for relevant answers, but only found this onewhich doesn't really help me.

我在这里搜索了相关答案,但只找到了这个并没有真正帮助我的答案。

回答by ptrj

You can run .apply(pd.Series)for each of your columns, then stackand concatenate the results.

您可以.apply(pd.Series)为每一列运行,然后stack连接结果。

For a series

对于一个系列

s = pd.Series([[0, 1], [2, 3, 4]], index=[2011, 2012])

s
Out[103]: 
2011       [0, 1]
2012    [2, 3, 4]
dtype: object

it works as follows

它的工作原理如下

s.apply(pd.Series).stack()
Out[104]: 
2011  0    0.0
      1    1.0
2012  0    2.0
      1    3.0
      2    4.0
dtype: float64

The elements of the series have different length (it matters because 2012 was a leap year). The intermediate series, i.e. before stack, had a NaNvalue that has been later dropped.

该系列的元素有不同的长度(这很重要,因为 2012 年是闰年)。中间系列,即之前的stack,有一个NaN后来被删除的值。

Now, let's take a frame:

现在,让我们举一个框架:

a = list(range(14))
b = list(range(20, 34))

df = pd.DataFrame({'ID': [11111, 11111, 11112, 11112],
                   'Year': [2011, 2012, 2011, 2012],
                   'A': [a[:3], a[3:7], a[7:10], a[10:14]],
                   'B': [b[:3], b[3:7], b[7:10], b[10:14]]})

df
Out[108]: 
                  A                 B     ID  Year
0         [0, 1, 2]      [20, 21, 22]  11111  2011
1      [3, 4, 5, 6]  [23, 24, 25, 26]  11111  2012
2         [7, 8, 9]      [27, 28, 29]  11112  2011
3  [10, 11, 12, 13]  [30, 31, 32, 33]  11112  2012

Then we can run:

然后我们可以运行:

# set an index (each column will inherit it)
df2 = df.set_index(['ID', 'Year'])
# the trick
unnested_lst = []
for col in df2.columns:
    unnested_lst.append(df2[col].apply(pd.Series).stack())
result = pd.concat(unnested_lst, axis=1, keys=df2.columns)

and get:

并得到:

result
Out[115]: 
                 A     B
ID    Year              
11111 2011 0   0.0  20.0
           1   1.0  21.0
           2   2.0  22.0
      2012 0   3.0  23.0
           1   4.0  24.0
           2   5.0  25.0
           3   6.0  26.0
11112 2011 0   7.0  27.0
           1   8.0  28.0
           2   9.0  29.0
      2012 0  10.0  30.0
           1  11.0  31.0
           2  12.0  32.0
           3  13.0  33.0

The rest (datetime index) is more less straightforward. For example:

其余的(日期时间索引)不太直接。例如:

# DatetimeIndex
years = pd.to_datetime(result.index.get_level_values(1).astype(str))
# TimedeltaIndex
days = pd.to_timedelta(result.index.get_level_values(2), unit='D')
# If the above line doesn't work (a bug in pandas), try this:
# days = result.index.get_level_values(2).astype('timedelta64[D]')

# the sum is again a DatetimeIndex
dates = years + days
dates.name = 'Date'

new_index = pd.MultiIndex.from_arrays([result.index.get_level_values(0), dates])

result.index = new_index

result
Out[130]: 
                     A     B
ID    Date                  
11111 2011-01-01   0.0  20.0
      2011-01-02   1.0  21.0
      2011-01-03   2.0  22.0
      2012-01-01   3.0  23.0
      2012-01-02   4.0  24.0
      2012-01-03   5.0  25.0
      2012-01-04   6.0  26.0
11112 2011-01-01   7.0  27.0
      2011-01-02   8.0  28.0
      2011-01-03   9.0  29.0
      2012-01-01  10.0  30.0
      2012-01-02  11.0  31.0
      2012-01-03  12.0  32.0
      2012-01-04  13.0  33.0