pandas 将熊猫数据帧转换为带有标题和数据类型的 numpy 数组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49734441/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Converting pandas dataframe to numpy array with headers and dtypes
提问by GivenX
I have been trying to convert a pandas dataframe into a numpy array, carrying over the dtypes and header names for ease of reference. I need to do this as the processing on pandas is WAY too slow, numpy is 10 fold quicker. I have this code from SO that gives me what I need apart from that the result does not look like a standard numpy array - i.e. it does not show the columns numbers in the shape.
我一直在尝试将 Pandas 数据帧转换为 numpy 数组,携带数据类型和标题名称以便于参考。我需要这样做,因为对 Pandas 的处理太慢了,numpy 快了 10 倍。我有来自 SO 的这段代码,除了结果看起来不像标准的 numpy 数组之外,它还为我提供了我所需要的 - 即它不显示形状中的列数。
[In]:
df = pd.DataFrame(randn(10,3),columns=['Acol','Ccol','Bcol'])
arr_ip = [tuple(i) for i in df.as_matrix()]
dtyp = np.dtype(list(zip(df.dtypes.index, df.dtypes)))
dfnp= np.array(arr_ip, dtype=dtyp)
print(dfnp.shape)
dfnp
[Out]:
(10,) #expecting (10,3)
array([(-1.0645345 , 0.34590193, 0.15063829),
( 1.5010928 , 0.63312454, 2.38309797),
(-0.10203999, -0.40589525, 0.63262773),
( 0.92725915, 1.07961763, 0.60425353),
( 0.18905164, -0.90602597, -0.27692396),
(-0.48671514, 0.14182815, -0.64240004),
( 0.05012859, -0.01969079, -0.74910076),
( 0.71681329, -0.38473052, -0.57692395),
( 0.60363249, -0.0169229 , -0.16330232),
( 0.04078263, 0.55943898, -0.05783683)],
dtype=[('Acol', '<f8'), ('Ccol', '<f8'), ('Bcol', '<f8')])
Am I missing something or is there another way of doing this? I have many df's to convert and their dtypes and column names vary so I need this automated approach. I also need it to be efficient due to the large number of df's.
我错过了什么还是有其他方法可以做到这一点?我有很多 df 需要转换,它们的数据类型和列名各不相同,所以我需要这种自动化方法。由于 df 的数量很多,我还需要它来提高效率。
回答by jpp
Use df.to_records()
to convert your dataframe to a structured array.
使用df.to_records()
你的数据帧转换成结构化的阵列。
You can pass index=False
to remove index from your result.
您可以通过index=False
从结果中删除索引。
import numpy as np
df = pd.DataFrame(np.random.rand(10,3),columns=['Acol','Ccol','Bcol'])
res = df.to_records(index=False)
# rec.array([(0.12448699852020828, 0.7621451848466592, 0.0958529943831431),
# (0.14534869167076214, 0.695297214355628, 0.3753874117495527),
# (0.09890006207909052, 0.46364777245941025, 0.10216301104094272),
# (0.3467673672203968, 0.4264108141950761, 0.1475998692158026),
# (0.9272619907467186, 0.3116253419608288, 0.5681628329642517),
# (0.34509767424461246, 0.5533523959180552, 0.02145207648054681),
# (0.7982313824847291, 0.563383955627413, 0.35286630304880684),
# (0.9574060540226251, 0.21296949881671157, 0.8882413119348652),
# (0.0892793829627454, 0.6157843461905468, 0.8310360916075473),
# (0.4691016244437851, 0.7007146447236033, 0.6672404967622088)],
# dtype=[('Acol', '<f8'), ('Ccol', '<f8'), ('Bcol', '<f8')])
A structured array will always have one dimension. That can't be changed.
结构化数组将始终具有一维。那是无法改变的。
But you can get the shape via:
但是您可以通过以下方式获得形状:
res.view(np.float64).reshape(len(res), -1).shape # (10, 3)
For performance, if you are manipulating data, you are better off using numpy.array
via df.values
and recording your column names in a dictionary with integer keys.
为了提高性能,如果您正在操作数据,最好使用numpy.array
viadf.values
并将列名记录在带有整数键的字典中。