如何创建超过 2 个维度的 Pandas 数据框?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36760414/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to create pandas dataframes with more than 2 dimensions?
提问by O.rka
I want to be able to create n-dimensional
dataframes.I've heard of a method for 3D dataframes using panels
in pandas
but, if possible, I would like to extend the dimensions past 3 dims by combining different datasets into a super dataframe
我希望能够创建n-dimensional
数据框。我听说过一种用于 3D 数据帧的方法panels
,pandas
但如果可能的话,我想通过将不同的数据集组合成一个超级数据帧来将维度扩展到 3 个维度
I tried this but I cannot figure out how to use these methods with my test dataset -> Constructing 3D Pandas DataFrame
我试过了,但我不知道如何将这些方法用于我的测试数据集 -> 构建 3D Pandas DataFrame
Also, this did not help for my case -> Pandas Dataframe or Panel to 3d numpy array
此外,这对我的情况没有帮助 -> Pandas Dataframe 或 Panel to 3d numpy array
I made a random test dataset with arbitrary axis data trying to mimic a real situation; there are 3 axis (i.e. patients, years, and samples). I tried adding a bunch of dataframes to a list and then making a dataframe with that but it didn't work :( I even tried a panel
as in the 2nd link above but I couldn't get it to work either.
我用任意轴数据制作了一个随机测试数据集,试图模拟真实情况;有 3 个轴(即患者、年份和样本)。我尝试将一堆数据框添加到列表中,然后用它制作一个数据框,但它不起作用:(我什至尝试了panel
上面第二个链接中的一个,但我也无法让它工作。
Does anybody know how to create a N-dimensional pandas dataframe w/ labels?
有人知道如何创建带有标签的 N 维Pandas数据框吗?
The first way I tried:
我尝试的第一种方法:
#Reproducibility
np.random.seed(1618033)
#Set 3 axis labels/dims
axis_1 = np.arange(2000,2010) #Years
axis_2 = np.arange(0,20) #Samples
axis_3 = np.array(["patient_%d" % i for i in range(0,3)]) #Patients
#Create random 3D array to simulate data from dims above
A_3D = np.random.random((years.size, samples.size, len(patients))) #(10, 20, 3)
#Create empty list to store 2D dataframes (axis_2=rows, axis_3=columns) along axis_1
list_of_dataframes=[]
#Iterate through all of the year indices
for i in range(axis_1.size):
#Create dataframe of (samples, patients)
DF_slice = pd.DataFrame(A_3D[i,:,:],index=axis_2,columns=axis_3)
list_of_dataframes.append(DF_slice)
# print(DF_slice) #preview of the 2D dataframes "slice" of the 3D array
# patient_0 patient_1 patient_2
# 0 0.727753 0.154701 0.205916
# 1 0.796355 0.597207 0.897153
# 2 0.603955 0.469707 0.580368
# 3 0.365432 0.852758 0.293725
# 4 0.906906 0.355509 0.994513
# 5 0.576911 0.336848 0.265967
# ...
# 19 0.583495 0.400417 0.020099
# DF_3D = pd.DataFrame(list_of_dataframes,index=axis_2, columns=axis_1)
# Error
# Shape of passed values is (1, 10), indices imply (10, 20)
2nd way I tried:
我尝试的第二种方式:
DF = pd.DataFrame(axis_3,columns=axis_2)
#Error:
#Shape of passed values is (1, 3), indices imply (20, 3)
# p={}
# for i in axis_1:
# p[i]=DF
# panel= pd.Panel(p)
I could do something like this I guess, but I really like pandas
and would rather use one of their methods if one exists:
我想我可以做这样的事情,但我真的很喜欢pandas
并且宁愿使用他们的方法之一,如果存在的话:
#Set data for query
query_year = 2007
query_sample = 15
query_patient = "patient_1"
#Index based on query
A_3D[
(axis_1 == query_year).argmax(),
(axis_2 == query_sample).argmax(),
(axis_3 == query_patient).argmax()
]
#0.1231212416981845
It would be awesome to access the data in this way:
以这种方式访问数据会很棒:
DF_3D[query_year][query_sample][query_patient]
#Where DF_3D[query_year] would give a list of 2D arrays (row=sample, col=patient)
# DF_3D[query_year][query_sample] would give a 1D vector/list of patient data for a particular year, of a particular sample.
# and DF_3D[query_year][query_sample][query_patient] would be a particular sample of a particular patient of a particular year
回答by Alexander
Rather than using an n-dimensional Panel, you are probably better off using a two dimensional representation of data, but using MultiIndexes for the index, column or both.
与使用 n 维面板相比,您可能最好使用数据的二维表示,但将 MultiIndexes 用于索引、列或两者。
For example:
例如:
np.random.seed(1618033)
#Set 3 axis labels/dims
years = np.arange(2000,2010) #Years
samples = np.arange(0,20) #Samples
patients = np.array(["patient_%d" % i for i in range(0,3)]) #Patients
#Create random 3D array to simulate data from dims above
A_3D = np.random.random((years.size, samples.size, len(patients))) #(10, 20, 3)
# Create the MultiIndex from years, samples and patients.
midx = pd.MultiIndex.from_product([years, samples, patients])
# Create sample data for each patient, and add the MultiIndex.
patient_data = pd.DataFrame(np.random.randn(len(midx), 3), index = midx)
>>> patient_data.head()
0 1 2
2000 0 patient_0 -0.128005 0.371413 -0.078591
patient_1 -0.378728 -2.003226 -0.024424
patient_2 1.339083 0.408708 1.724094
1 patient_0 -0.997879 -0.251789 -0.976275
patient_1 0.131380 -0.901092 1.456144
Once you have data in this form, it is relatively easy to juggle it around. For example:
一旦你有了这种形式的数据,处理它就相对容易了。例如:
>>> patient_data.unstack(level=0).head() # Years.
0 ... 2
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ... 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
0 patient_0 -0.128005 0.051558 1.251120 0.666061 -1.048103 0.259231 1.535370 0.156281 -0.609149 0.360219 ... -0.078591 -2.305314 -2.253770 0.865997 0.458720 1.479144 -0.214834 -0.791904 0.800452 0.235016
patient_1 -0.378728 -0.117470 -0.306892 0.810256 2.702960 -0.748132 -1.449984 -0.195038 1.151445 0.301487 ... -0.024424 0.114843 0.143700 1.732072 0.602326 1.465946 -1.215020 0.648420 0.844932 -1.261558
patient_2 1.339083 -0.915771 0.246077 0.820608 -0.935617 -0.449514 -1.105256 -0.051772 -0.671971 0.213349 ... 1.724094 0.835418 0.000819 1.149556 -0.318513 -0.450519 -0.694412 -1.535343 1.035295 0.627757
1 patient_0 -0.997879 -0.242597 1.028464 2.093807 1.380361 0.691210 -2.420800 1.593001 0.925579 0.540447 ... -0.976275 1.928454 -0.626332 -0.049824 -0.912860 0.225834 0.277991 0.326982 -0.520260 0.788685
patient_1 0.131380 0.398155 -1.671873 -1.329554 -0.298208 -0.525148 0.897745 -0.125233 -0.450068 -0.688240 ... 1.456144 -0.503815 -1.329334 0.475751 -0.201466 0.604806 -0.640869 -1.381123 0.524899 0.041983
In order to select the data, please refere to the docs for MultiIndexing.
为了选择数据,请参阅MultiIndexing的文档。
回答by Charlie
An alternative approach (to Alexander) that is derived from the structure of the input data is:
从输入数据的结构派生的另一种方法(对Alexander)是:
np.random.seed(1618033)
#Set 3 axis labels/dims
years = np.arange(2000,2010) #Years
samples = np.arange(0,20) #Samples
patients = np.array(["patient_%d" % i for i in range(0,3)]) #Patients
#Create random 3D array to simulate data from dims above
A_3D = np.random.random((years.size, samples.size, len(patients))) #(10, 20, 3)
# Reshape data to 2 dimensions
maj_dim = 1
for dim in A_3D.shape[:-1]:
maj_dim = maj_dim*dim
new_dims = (maj_dim, A_3D.shape[-1])
A_3D = A_3D.reshape(new_dims)
# Create the MultiIndex from years, samples and patients.
midx = pd.MultiIndex.from_product([years, samples])
# Note that Cartesian product order is the same as the
# C-order used by default in ``reshape``.
# Create sample data for each patient, and add the MultiIndex.
patient_data = pd.DataFrame(data = A_3D,
index = midx,
columns = patients)
>>>> patient_data.head()
patient_0 patient_1 patient_2
2000 0 0.727753 0.154701 0.205916
1 0.796355 0.597207 0.897153
2 0.603955 0.469707 0.580368
3 0.365432 0.852758 0.293725
4 0.906906 0.355509 0.994513