Python 构建 3D Pandas DataFrame

Question

提问by tlnagy

I'm having difficulty constructing a 3D DataFrame in Pandas. I want something like this

我在 Pandas 中构建 3D DataFrame 有困难。我想要这样的东西

A               B               C
start    end    start    end    start    end ...
7        20     42       52     90       101
11       21                     213      34
56       74                     9        45
45       12

Where A, B, etc are the top-level descriptors and startand endare subdescriptors. The numbers that follow are in pairs and there aren't the same number of pairs for A, Betc. Observe that Ahas four such pairs, Bhas only 1, and Chas 3.

其中A、B等是顶级描述符，start和end是子描述符。后面的数字是成对的A，B等的对数不相同。观察A有四个这样的对，B只有 1，C有 3。

I'm not sure how to proceed in constructing this DataFrame. Modifying thisexample didn't give me the designed output:

我不确定如何继续构建这个 DataFrame。修改这个例子没有给我设计的输出：

import numpy as np
import pandas as pd

A = np.array(['one', 'one', 'two', 'two', 'three', 'three'])
B = np.array(['start', 'end']*3)
C = [np.random.randint(10, 99, 6)]*6
df = pd.DataFrame(zip(A, B, C), columns=['A', 'B', 'C'])
df.set_index(['A', 'B'], inplace=True)
df

yielded:

产生：

                C
 A          B   
 one        start   [22, 19, 16, 20, 63, 54]
              end   [22, 19, 16, 20, 63, 54]
 two        start   [22, 19, 16, 20, 63, 54]
              end   [22, 19, 16, 20, 63, 54]
 three      start   [22, 19, 16, 20, 63, 54]
              end   [22, 19, 16, 20, 63, 54]

Is there any way of breaking up the lists in C into their own columns?

有没有办法将 C 中的列表分解成它们自己的列？

EDIT: The structure of my Cis important. It looks like the following:

编辑： my 的结构C很重要。它看起来像下面这样：

 C = [[7,11,56,45], [20,21,74,12], [42], [52], [90,213,9], [101, 34, 45]]

And the desired output is the one at the top. It represents the starting and ending points of subsequences within a certain sequence (A, B. Care the different sequences). Depending on the sequence itself, there are a differing number of subsequences that satisfy a given condition I'm looking for. As a result, there are a differing number of start:end pairs for A, B, etc

所需的输出是顶部的输出。它表示某个序列中子序列的起点和终点（A, B.C是不同的序列）。根据序列本身，有不同数量的子序列满足我正在寻找的给定条件。其结果是，有不同数量的启动：为结束对A，B等

Answer 1

采纳答案by chrisb

First, I think you need to fill C to represent missing values

首先，我认为你需要填充 C 来表示缺失值

In [341]: max_len = max(len(sublist) for sublist in C)
In [344]: for sublist in C:
     ...:     sublist.extend([np.nan] * (max_len - len(sublist)))

In [345]: C
Out[345]: 
[[7, 11, 56, 45],
 [20, 21, 74, 12],
 [42, nan, nan, nan],
 [52, nan, nan, nan],
 [90, 213, 9, nan],
 [101, 34, 45, nan]]

Then, convert to a numpy array, transpose, and pass to the DataFrame constructor along with the columns.

然后，转换为 numpy 数组，转置，并与列一起传递给 DataFrame 构造函数。

In [288]: C = np.array(C)
In [289]: df = pd.DataFrame(data=C.T, columns=pd.MultiIndex.from_tuples(zip(A,B)))

In [349]: df
Out[349]: 
     one         two       three     
   start  end  start  end  start  end
0      7   20     42   52     90  101
1     11   21    NaN  NaN    213   34
2     56   74    NaN  NaN      9   45
3     45   12    NaN  NaN    NaN  NaN

Answer 2

回答by user3684792

Can't you just use a panel?

不能只用面板吗？

import numpy as np
import pandas as pd

A = ['one', 'two' ,'three']
B = ['start','end']
C = [np.random.randint(10, 99, 2)]*6
df = pd.DataFrame(C,columns=B  )
p={}
for a in A:
    p[a]=df
panel= pd.Panel(p)
print panel['one']

Answer 3

回答by scottclowe

As @Aaron mentioned in a comment above, panels have been deprecated. Also, @tlnagy mentioned his dataset would be likely to expand to more than 3 dimensions in the future.

正如@Aaron 在上面的评论中提到的，面板已被弃用。此外，@tlnagy 提到他的数据集将来可能会扩展到 3 维以上。

This sounds like a good use-case for the xarraypackage, which provides semantically labelled arrays of arbitrarily many dimensions. Pandas and xarray have strong conversion support, and panels have been deprecated in favour of using xarray.

这听起来像是xarray包的一个很好的用例，它提供了任意多维的语义标记数组。Pandas 和 xarray 具有强大的转换支持，并且面板已被弃用以支持使用 xarray。

Initial setup of the problem.

问题的初始设置。

import numpy as np

A = np.array([[7,11,56,45], [20,21,74,12]]).T
B = np.array([[42], [52]]).T
C = np.array([[90,213,9], [101, 34, 45]]).T

You can then create a three dimensional xarray.DataArray object like so:

然后，您可以像这样创建一个三维 xarray.DataArray 对象：

import xarray

output_as_dataarray = xarray.concat(
    [xarray.DataArray(X, 
                      dims=['record', 'edge'],
                      coords={'record': range(X.shape[0]),
                              'edge': ['start', 'end']},
                     ) for X in (A, B, C)],
    dim='descriptor',
).assign_coords(descriptor=['A', 'B', 'C'])

We turn our three 2D numpy arrays into xarray.DataArray objects, and then concatenate them together along a new dimension.

我们将三个 2D numpy 数组转换为 xarray.DataArray 对象，然后沿新维度将它们连接在一起。

Our output looks like so:

我们的输出看起来像这样：

<xarray.DataArray (descriptor: 3, record: 4, edge: 2)>
array([[[  7.,  20.],
        [ 11.,  21.],
        [ 56.,  74.],
        [ 45.,  12.]],

       [[ 42.,  52.],
        [ nan,  nan],
        [ nan,  nan],
        [ nan,  nan]],

       [[ 90., 101.],
        [213.,  34.],
        [  9.,  45.],
        [ nan,  nan]]])
Coordinates:
  * record      (record) int64 0 1 2 3
  * edge        (edge) <U5 'start' 'end'
  * descriptor  (descriptor) <U1 'A' 'B' 'C'

Python 构建 3D Pandas DataFrame

提问by tlnagy

采纳答案by chrisb

回答by user3684792

回答by scottclowe

相关推荐

最近更新

标签

Python 构建 3D Pandas DataFrame

提问by tlnagy

采纳答案by chrisb

回答by user3684792

回答by scottclowe

相关推荐

Python 内部服务器错误 Flask

Python 如何使用 anaconda conda 命令安装 PyPi 包

Python 熊猫在没有标题的表中读取

在 Cygwin 中为 Anaconda 永久设置 Python 路径

相关推荐

最近更新

标签