Python 构建 3D Pandas DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24290495/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Constructing 3D Pandas DataFrame
提问by tlnagy
I'm having difficulty constructing a 3D DataFrame in Pandas. I want something like this
我在 Pandas 中构建 3D DataFrame 有困难。我想要这样的东西
A B C
start end start end start end ...
7 20 42 52 90 101
11 21 213 34
56 74 9 45
45 12
Where A
, B
, etc are the top-level descriptors and start
and end
are subdescriptors. The numbers that follow are in pairs and there aren't the same number of pairs for A
, B
etc. Observe that A
has four such pairs, B
has only 1, and C
has 3.
其中A
、B
等是顶级描述符,start
和end
是子描述符。后面的数字是成对的A
,B
等的对数不相同。 观察A
有四个这样的对,B
只有 1,C
有 3。
I'm not sure how to proceed in constructing this DataFrame. Modifying thisexample didn't give me the designed output:
我不确定如何继续构建这个 DataFrame。修改这个例子没有给我设计的输出:
import numpy as np
import pandas as pd
A = np.array(['one', 'one', 'two', 'two', 'three', 'three'])
B = np.array(['start', 'end']*3)
C = [np.random.randint(10, 99, 6)]*6
df = pd.DataFrame(zip(A, B, C), columns=['A', 'B', 'C'])
df.set_index(['A', 'B'], inplace=True)
df
yielded:
产生:
C
A B
one start [22, 19, 16, 20, 63, 54]
end [22, 19, 16, 20, 63, 54]
two start [22, 19, 16, 20, 63, 54]
end [22, 19, 16, 20, 63, 54]
three start [22, 19, 16, 20, 63, 54]
end [22, 19, 16, 20, 63, 54]
Is there any way of breaking up the lists in C into their own columns?
有没有办法将 C 中的列表分解成它们自己的列?
EDIT: The structure of my C
is important. It looks like the following:
编辑: my 的结构C
很重要。它看起来像下面这样:
C = [[7,11,56,45], [20,21,74,12], [42], [52], [90,213,9], [101, 34, 45]]
And the desired output is the one at the top. It represents the starting and ending points of subsequences within a certain sequence (A
, B
. C
are the different sequences). Depending on the sequence itself, there are a differing number of subsequences that satisfy a given condition I'm looking for. As a result, there are a differing number of start:end pairs for A
, B
, etc
所需的输出是顶部的输出。它表示某个序列中子序列的起点和终点(A
, B
.C
是不同的序列)。根据序列本身,有不同数量的子序列满足我正在寻找的给定条件。其结果是,有不同数量的启动:为结束对A
,B
等
采纳答案by chrisb
First, I think you need to fill C to represent missing values
首先,我认为你需要填充 C 来表示缺失值
In [341]: max_len = max(len(sublist) for sublist in C)
In [344]: for sublist in C:
...: sublist.extend([np.nan] * (max_len - len(sublist)))
In [345]: C
Out[345]:
[[7, 11, 56, 45],
[20, 21, 74, 12],
[42, nan, nan, nan],
[52, nan, nan, nan],
[90, 213, 9, nan],
[101, 34, 45, nan]]
Then, convert to a numpy array, transpose, and pass to the DataFrame constructor along with the columns.
然后,转换为 numpy 数组,转置,并与列一起传递给 DataFrame 构造函数。
In [288]: C = np.array(C)
In [289]: df = pd.DataFrame(data=C.T, columns=pd.MultiIndex.from_tuples(zip(A,B)))
In [349]: df
Out[349]:
one two three
start end start end start end
0 7 20 42 52 90 101
1 11 21 NaN NaN 213 34
2 56 74 NaN NaN 9 45
3 45 12 NaN NaN NaN NaN
回答by user3684792
Can't you just use a panel?
不能只用面板吗?
import numpy as np
import pandas as pd
A = ['one', 'two' ,'three']
B = ['start','end']
C = [np.random.randint(10, 99, 2)]*6
df = pd.DataFrame(C,columns=B )
p={}
for a in A:
p[a]=df
panel= pd.Panel(p)
print panel['one']
回答by scottclowe
As @Aaron mentioned in a comment above, panels have been deprecated. Also, @tlnagy mentioned his dataset would be likely to expand to more than 3 dimensions in the future.
正如@Aaron 在上面的评论中提到的,面板已被弃用。此外,@tlnagy 提到他的数据集将来可能会扩展到 3 维以上。
This sounds like a good use-case for the xarraypackage, which provides semantically labelled arrays of arbitrarily many dimensions. Pandas and xarray have strong conversion support, and panels have been deprecated in favour of using xarray.
这听起来像是xarray包的一个很好的用例,它提供了任意多维的语义标记数组。Pandas 和 xarray 具有强大的转换支持,并且面板已被弃用以支持使用 xarray。
Initial setup of the problem.
问题的初始设置。
import numpy as np
A = np.array([[7,11,56,45], [20,21,74,12]]).T
B = np.array([[42], [52]]).T
C = np.array([[90,213,9], [101, 34, 45]]).T
You can then create a three dimensional xarray.DataArray object like so:
然后,您可以像这样创建一个三维 xarray.DataArray 对象:
import xarray
output_as_dataarray = xarray.concat(
[xarray.DataArray(X,
dims=['record', 'edge'],
coords={'record': range(X.shape[0]),
'edge': ['start', 'end']},
) for X in (A, B, C)],
dim='descriptor',
).assign_coords(descriptor=['A', 'B', 'C'])
We turn our three 2D numpy arrays into xarray.DataArray objects, and then concatenate them together along a new dimension.
我们将三个 2D numpy 数组转换为 xarray.DataArray 对象,然后沿新维度将它们连接在一起。
Our output looks like so:
我们的输出看起来像这样:
<xarray.DataArray (descriptor: 3, record: 4, edge: 2)>
array([[[ 7., 20.],
[ 11., 21.],
[ 56., 74.],
[ 45., 12.]],
[[ 42., 52.],
[ nan, nan],
[ nan, nan],
[ nan, nan]],
[[ 90., 101.],
[213., 34.],
[ 9., 45.],
[ nan, nan]]])
Coordinates:
* record (record) int64 0 1 2 3
* edge (edge) <U5 'start' 'end'
* descriptor (descriptor) <U1 'A' 'B' 'C'