Python 构建 3D Pandas DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24290495/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:22:53  来源:igfitidea点击:

Constructing 3D Pandas DataFrame

pythonpandas

提问by tlnagy

I'm having difficulty constructing a 3D DataFrame in Pandas. I want something like this

我在 Pandas 中构建 3D DataFrame 有困难。我想要这样的东西

A               B               C
start    end    start    end    start    end ...
7        20     42       52     90       101
11       21                     213      34
56       74                     9        45
45       12

Where A, B, etc are the top-level descriptors and startand endare subdescriptors. The numbers that follow are in pairs and there aren't the same number of pairs for A, Betc. Observe that Ahas four such pairs, Bhas only 1, and Chas 3.

其中AB等是顶级描述符,startend是子描述符。后面的数字是成对的AB等的对数不相同。 观察A有四个这样的对,B只有 1,C有 3。

I'm not sure how to proceed in constructing this DataFrame. Modifying thisexample didn't give me the designed output:

我不确定如何继续构建这个 DataFrame。修改这个例子没有给我设计的输出:

import numpy as np
import pandas as pd

A = np.array(['one', 'one', 'two', 'two', 'three', 'three'])
B = np.array(['start', 'end']*3)
C = [np.random.randint(10, 99, 6)]*6
df = pd.DataFrame(zip(A, B, C), columns=['A', 'B', 'C'])
df.set_index(['A', 'B'], inplace=True)
df

yielded:

产生:

                C
 A          B   
 one        start   [22, 19, 16, 20, 63, 54]
              end   [22, 19, 16, 20, 63, 54]
 two        start   [22, 19, 16, 20, 63, 54]
              end   [22, 19, 16, 20, 63, 54]
 three      start   [22, 19, 16, 20, 63, 54]
              end   [22, 19, 16, 20, 63, 54]

Is there any way of breaking up the lists in C into their own columns?

有没有办法将 C 中的列表分解成它们自己的列?

EDIT: The structure of my Cis important. It looks like the following:

编辑: my 的结构C很重要。它看起来像下面这样:

 C = [[7,11,56,45], [20,21,74,12], [42], [52], [90,213,9], [101, 34, 45]]

And the desired output is the one at the top. It represents the starting and ending points of subsequences within a certain sequence (A, B. Care the different sequences). Depending on the sequence itself, there are a differing number of subsequences that satisfy a given condition I'm looking for. As a result, there are a differing number of start:end pairs for A, B, etc

所需的输出是顶部的输出。它表示某个序列中子序列的起点和终点(A, B.C是不同的序列)。根据序列本身,有不同数量的子序列满足我正在寻找的给定条件。其结果是,有不同数量的启动:为结束对AB

采纳答案by chrisb

First, I think you need to fill C to represent missing values

首先,我认为你需要填充 C 来表示缺失值

In [341]: max_len = max(len(sublist) for sublist in C)
In [344]: for sublist in C:
     ...:     sublist.extend([np.nan] * (max_len - len(sublist)))

In [345]: C
Out[345]: 
[[7, 11, 56, 45],
 [20, 21, 74, 12],
 [42, nan, nan, nan],
 [52, nan, nan, nan],
 [90, 213, 9, nan],
 [101, 34, 45, nan]]

Then, convert to a numpy array, transpose, and pass to the DataFrame constructor along with the columns.

然后,转换为 numpy 数组,转置,并与列一起传递给 DataFrame 构造函数。

In [288]: C = np.array(C)
In [289]: df = pd.DataFrame(data=C.T, columns=pd.MultiIndex.from_tuples(zip(A,B)))

In [349]: df
Out[349]: 
     one         two       three     
   start  end  start  end  start  end
0      7   20     42   52     90  101
1     11   21    NaN  NaN    213   34
2     56   74    NaN  NaN      9   45
3     45   12    NaN  NaN    NaN  NaN

回答by user3684792

Can't you just use a panel?

不能只用面板吗?

import numpy as np
import pandas as pd

A = ['one', 'two' ,'three']
B = ['start','end']
C = [np.random.randint(10, 99, 2)]*6
df = pd.DataFrame(C,columns=B  )
p={}
for a in A:
    p[a]=df
panel= pd.Panel(p)
print panel['one']

回答by scottclowe

As @Aaron mentioned in a comment above, panels have been deprecated. Also, @tlnagy mentioned his dataset would be likely to expand to more than 3 dimensions in the future.

正如@Aaron 在上面的评论中提到的,面板已被弃用。此外,@tlnagy 提到他的数据集将来可能会扩展到 3 维以上。

This sounds like a good use-case for the xarraypackage, which provides semantically labelled arrays of arbitrarily many dimensions. Pandas and xarray have strong conversion support, and panels have been deprecated in favour of using xarray.

这听起来像是xarray包的一个很好的用例,它提供了任意多维的语义标记数组。Pandas 和 xarray 具有强大的转换支持,并且面板已被弃用以支持使用 xarray。

Initial setup of the problem.

问题的初始设置。

import numpy as np

A = np.array([[7,11,56,45], [20,21,74,12]]).T
B = np.array([[42], [52]]).T
C = np.array([[90,213,9], [101, 34, 45]]).T

You can then create a three dimensional xarray.DataArray object like so:

然后,您可以像这样创建一个三维 xarray.DataArray 对象:

import xarray

output_as_dataarray = xarray.concat(
    [xarray.DataArray(X, 
                      dims=['record', 'edge'],
                      coords={'record': range(X.shape[0]),
                              'edge': ['start', 'end']},
                     ) for X in (A, B, C)],
    dim='descriptor',
).assign_coords(descriptor=['A', 'B', 'C'])

We turn our three 2D numpy arrays into xarray.DataArray objects, and then concatenate them together along a new dimension.

我们将三个 2D numpy 数组转换为 xarray.DataArray 对象,然后沿新维度将它们连接在一起。

Our output looks like so:

我们的输出看起来像这样:

<xarray.DataArray (descriptor: 3, record: 4, edge: 2)>
array([[[  7.,  20.],
        [ 11.,  21.],
        [ 56.,  74.],
        [ 45.,  12.]],

       [[ 42.,  52.],
        [ nan,  nan],
        [ nan,  nan],
        [ nan,  nan]],

       [[ 90., 101.],
        [213.,  34.],
        [  9.,  45.],
        [ nan,  nan]]])
Coordinates:
  * record      (record) int64 0 1 2 3
  * edge        (edge) <U5 'start' 'end'
  * descriptor  (descriptor) <U1 'A' 'B' 'C'