将 Pandas Dataframe 单元格中的嵌套数组值拆分为多行

Question

提问by Philip O'Brien

I have a Pandas DataFrameof the following form

There is one row per ID per year (2008 - 2015). For the columns Max Temp, Min Temp, and Raineach cell contains an array of values corresponding to a day in that year, i.e. for the frame above

每年（2008 - 2015）每个 ID 有一行。对于列Max Temp, Min Temp, 和Rain每个单元格都包含与该年中的某一天相对应的值数组，即上面的框架

frame3.iloc[0]['Max Temp'][0]is the value for January 1st 2011
frame3.iloc[0]['Max Temp'][364]is the value for December 31st 2011.

frame3.iloc[0]['Max Temp'][0]是 2011 年 1 月 1 日的值
frame3.iloc[0]['Max Temp'][364]是 2011 年 12 月 31 日的值。

I'm aware this is badly structured, but this is the data I have to deal with. It is stored in MongoDB in this way (where one of these rows equates to a document in Mongo).

我知道这是错误的结构，但这是我必须处理的数据。它以这种方式存储在 MongoDB 中（其中这些行之一相当于 Mongo 中的文档）。

I want to split these nested arrays, so that instead of one row per ID per year, I have one row per ID per day. While splitting the array, however, I would also like to create a new column to capture the day of the year, based on the current array index. I would then use this day, plus the Yearcolumn to create a DatetimeIndex

我想拆分这些嵌套数组，这样每个 ID 每年只有一行，而不是每个 ID 每天一行。但是，在拆分数组时，我还想根据当前数组索引创建一个新列来捕获一年中的哪一天。然后我将使用这一天，加上该Year列来创建一个DatetimeIndex

I searched here for relevant answers, but only found this onewhich doesn't really help me.

我在这里搜索了相关答案，但只找到了这个并没有真正帮助我的答案。

Answer 1

回答by ptrj

You can run .apply(pd.Series)for each of your columns, then stackand concatenate the results.

您可以.apply(pd.Series)为每一列运行，然后stack连接结果。

For a series

对于一个系列

s = pd.Series([[0, 1], [2, 3, 4]], index=[2011, 2012])

s
Out[103]: 
2011       [0, 1]
2012    [2, 3, 4]
dtype: object

it works as follows

它的工作原理如下

s.apply(pd.Series).stack()
Out[104]: 
2011  0    0.0
      1    1.0
2012  0    2.0
      1    3.0
      2    4.0
dtype: float64

The elements of the series have different length (it matters because 2012 was a leap year). The intermediate series, i.e. before stack, had a NaNvalue that has been later dropped.

该系列的元素有不同的长度（这很重要，因为 2012 年是闰年）。中间系列，即之前的stack，有一个NaN后来被删除的值。

Now, let's take a frame:

现在，让我们举一个框架：

a = list(range(14))
b = list(range(20, 34))

df = pd.DataFrame({'ID': [11111, 11111, 11112, 11112],
                   'Year': [2011, 2012, 2011, 2012],
                   'A': [a[:3], a[3:7], a[7:10], a[10:14]],
                   'B': [b[:3], b[3:7], b[7:10], b[10:14]]})

df
Out[108]: 
                  A                 B     ID  Year
0         [0, 1, 2]      [20, 21, 22]  11111  2011
1      [3, 4, 5, 6]  [23, 24, 25, 26]  11111  2012
2         [7, 8, 9]      [27, 28, 29]  11112  2011
3  [10, 11, 12, 13]  [30, 31, 32, 33]  11112  2012

Then we can run:

然后我们可以运行：

# set an index (each column will inherit it)
df2 = df.set_index(['ID', 'Year'])
# the trick
unnested_lst = []
for col in df2.columns:
    unnested_lst.append(df2[col].apply(pd.Series).stack())
result = pd.concat(unnested_lst, axis=1, keys=df2.columns)

and get:

并得到：

result
Out[115]: 
                 A     B
ID    Year              
11111 2011 0   0.0  20.0
           1   1.0  21.0
           2   2.0  22.0
      2012 0   3.0  23.0
           1   4.0  24.0
           2   5.0  25.0
           3   6.0  26.0
11112 2011 0   7.0  27.0
           1   8.0  28.0
           2   9.0  29.0
      2012 0  10.0  30.0
           1  11.0  31.0
           2  12.0  32.0
           3  13.0  33.0

The rest (datetime index) is more less straightforward. For example:

其余的（日期时间索引）不太直接。例如：

# DatetimeIndex
years = pd.to_datetime(result.index.get_level_values(1).astype(str))
# TimedeltaIndex
days = pd.to_timedelta(result.index.get_level_values(2), unit='D')
# If the above line doesn't work (a bug in pandas), try this:
# days = result.index.get_level_values(2).astype('timedelta64[D]')

# the sum is again a DatetimeIndex
dates = years + days
dates.name = 'Date'

new_index = pd.MultiIndex.from_arrays([result.index.get_level_values(0), dates])

result.index = new_index

result
Out[130]: 
                     A     B
ID    Date                  
11111 2011-01-01   0.0  20.0
      2011-01-02   1.0  21.0
      2011-01-03   2.0  22.0
      2012-01-01   3.0  23.0
      2012-01-02   4.0  24.0
      2012-01-03   5.0  25.0
      2012-01-04   6.0  26.0
11112 2011-01-01   7.0  27.0
      2011-01-02   8.0  28.0
      2011-01-03   9.0  29.0
      2012-01-01  10.0  30.0
      2012-01-02  11.0  31.0
      2012-01-03  12.0  32.0
      2012-01-04  13.0  33.0

将 Pandas Dataframe 单元格中的嵌套数组值拆分为多行

提问by Philip O'Brien

回答by ptrj

相关推荐

最近更新

标签

将 Pandas Dataframe 单元格中的嵌套数组值拆分为多行

提问by Philip O'Brien

回答by ptrj

相关推荐

initialize pandas DataFrame with defined dtypes

pandas 熊猫中的“反合并”（Python）

pandas 熊猫：增加日期时间

pandas Python 错误：TypeError：无法将 dtyped [float64] 数组与 [bool] 类型的标量进行比较

相关推荐

最近更新

标签