从 Pandas 数据帧创建二维数组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33753323/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Create 2D array from Pandas dataframe
提问by mgutsche
Probably a very simple question, but I couldn't come up with a solution. I have a data frame with 9 columns and ~100000 rows. The data was extracted from an image, such that two columns ('row' and 'col') are referring to the pixel position of the data. How can I create a numpy array A such that the row and column points to another data entry in another column, e.g. 'grumpiness'?
可能是一个非常简单的问题,但我想不出解决方案。我有一个包含 9 列和 ~100000 行的数据框。数据是从图像中提取的,因此两列(“row”和“col”)指的是数据的像素位置。如何创建一个 numpy 数组 A,使得行和列指向另一列中的另一个数据条目,例如“脾气暴躁”?
A[row, col]
# 0.1232
I want to avoid a for loop or something similar.
我想避免 for 循环或类似的东西。
回答by Divakar
You could do something like this -
你可以做这样的事情 -
# Extract row and column information
rowIDs = df['row']
colIDs = df['col']
# Setup image array and set values into it from "grumpiness" column
A = np.zeros((rowIDs.max()+1,colIDs.max()+1))
A[rowIDs,colIDs] = df['grumpiness']
Sample run -
样品运行 -
>>> df
row col grumpiness
0 5 0 0.846412
1 0 1 0.703981
2 3 1 0.212358
3 0 2 0.101585
4 5 1 0.424694
5 5 2 0.473286
>>> A
array([[ 0. , 0.70398113, 0.10158488],
[ 0. , 0. , 0. ],
[ 0. , 0. , 0. ],
[ 0. , 0.21235838, 0. ],
[ 0. , 0. , 0. ],
[ 0.84641194, 0.42469369, 0.47328598]])
回答by jakevdp
One very quick and straightforward way to do this is to use a pivot_table
:
一种非常快速和直接的方法是使用pivot_table
:
>>> df
row col grumpiness
0 5 0 0.846412
1 0 1 0.703981
2 3 1 0.212358
3 0 2 0.101585
4 5 1 0.424694
5 5 2 0.473286
>>> df.pivot_table('grumpiness', 'row', 'col', fill_value=0)
col 0 1 2
row
0 0.000000 0.703981 0.101585
3 0.000000 0.212358 0.000000
5 0.846412 0.424694 0.473286
Note that if any full rows/cols are missing, it will leave them out, and if any row/col pair is repeated, it will average the results. That said, this will generally be much faster for larger datasets than an indexing-based approach.
请注意,如果缺少任何完整的行/列,它会将它们排除在外,如果重复任何行/列对,它将平均结果。也就是说,对于较大的数据集,这通常比基于索引的方法快得多。