Pandas 数据框或面板到 3d numpy 数组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23478297/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas Dataframe or Panel to 3d numpy array
提问by user2805751
Setup:
设置:
pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])
output:
输出:
c d e
a b
0.439502 0.115087 0.832546 0.760513 0.776555
0.609107 0.247642 0.031650 0.727773
0.995370 0.299640 0.053523 0.565753 0.857235
0.392132 0.832560 0.774653 0.213692
Each data series is grouped by the index ID aand brepresents a time index for the other features of a. Is there a way to get the pandas to produce a numpy 3d array that reflects the agroupings? Currently it reads the data as two dimensional so pdf.shapeoutputs (4, 5). What I would like is for the array to be of the variable form:
每个数据系列按索引 ID 分组,a并b代表 的其他特征的时间索引a。有没有办法让Pandas产生一个反映a分组的 numpy 3d 数组?目前它将数据读取为二维因此pdf.shape输出(4, 5)。我想要的是数组的变量形式:
array([[[-1.38655912, -0.90145951, -0.95106951, 0.76570984],
[-0.21004144, -2.66498267, -0.29255182, 1.43411576],
[-0.21004144, -2.66498267, -0.29255182, 1.43411576]],
[[ 0.0768149 , -0.7566995 , -2.57770951, 0.70834656],
[-0.99097395, -0.81592084, -1.21075386, 0.12361382]]])
Is there a native Pandas way to do this? Note that number of rows per agrouping in the actual data is variable, so I cannot just transpose or reshape pdf.values. If there isn't a native way, what's the best method for iteratively constructing the arrays from hundreds of thousands of rows and hundreds of columns?
有没有原生的 Pandas 方法来做到这一点?请注意,a实际数据中每个分组的行数是可变的,所以我不能只是 transpose 或 reshape pdf.values。如果没有本地方式,从数十万行和数百列迭代构建数组的最佳方法是什么?
采纳答案by Jeff
panel.values
will return a numpy array directly. this will by necessity be the highest acceptable dtype as everything is smushed into a single 3-d numpy array. It will be newarray and not a view of the pandas data (no matter the dtype).
将直接返回一个 numpy 数组。这将必然是最高可接受的 dtype,因为所有内容都被压缩到单个 3-d numpy 数组中。它将是 新数组,而不是 Pandas 数据的视图(无论 dtype)。
回答by Leo
I just had an extremely similar problem and solved it like this:
我刚刚遇到了一个非常相似的问题,并像这样解决了它:
a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))
output:
输出:
array([[[ 0.47780308, 0.93422319, 0.00526572, 0.41645868, 0.82089215],
[ 0.47780308, 0.15372096, 0.20948369, 0.76354447, 0.27743855]],
[[ 0.75146799, 0.39133973, 0.25182206, 0.78088926, 0.30276705],
[ 0.75146799, 0.42182369, 0.01166461, 0.00936464, 0.53208731]]])
verifying it is 3d, a3d.shape gives (2, 2, 5).
验证它是 3d,a3d.shape 给出 (2, 2, 5)。
Lastly, to make the newly created dimension the last dimension (instead of the first) then use:
最后,要使新创建的维度成为最后一个维度(而不是第一个维度),然后使用:
a3d = np.dstack(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))
which has a shape of (2, 5, 2)
其形状为 (2, 5, 2)
For cases where the data is ragged (as brought up by CharlesG in the comments) you can use something likethe following if you want to stick to a numpy solution. But be aware that the best strategy to deal with missing data varies from case to case. In this example we simply add zeros for the missing rows.
对于数据参差不齐的情况(如 CharlesG 在评论中提出的),如果您想坚持使用 numpy 解决方案,则可以使用如下所示的内容。但请注意,处理缺失数据的最佳策略因情况而异。在这个例子中,我们只是为缺失的行添加零。
Example setup with ragged shape:
形状参差不齐的示例设置:
pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])
dataframe:
数据框:
c d e
a b
0.460013 0.577535 0.299304 0.617103 0.378887
0.167907 0.244972 0.615077 0.311497
0.318823 0.640575 0.768187 0.652760 0.822311
0.424744 0.958405 0.659617 0.998765
0.077048 0.407182 0.758903 0.273737
One possible solution:
一种可能的解决方案:
n_max = pdf.groupby('a').size().max()
a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)
.apply(lambda x: np.pad(x, ((0, n_max-len(x)), (0, 0)), 'constant'))))
a3d.shape gives (2, 3, 5)
a3d.shape 给出 (2, 3, 5)

