pandas 无法将 DataFrame 保存到 HDF5(“对象头消息太大”)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16639503/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Unable to save DataFrame to HDF5 ("object header message is too large")
提问by Amelio Vazquez-Reina
I have a DataFrame in Pandas:
我在 Pandas 中有一个 DataFrame:
In [7]: my_df
Out[7]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 34 entries, 0 to 0
Columns: 2661 entries, airplane to zoo
dtypes: float64(2659), object(2)
When I try to save this to disk:
当我尝试将其保存到磁盘时:
store = pd.HDFStore(p_full_h5)
store.append('my_df', my_df)
I get:
我得到:
File "H5A.c", line 254, in H5Acreate2
unable to create attribute
File "H5A.c", line 503, in H5A_create
unable to create attribute in object header
File "H5Oattribute.c", line 347, in H5O_attr_create
unable to create new attribute in header
File "H5Omessage.c", line 224, in H5O_msg_append_real
unable to create new message
File "H5Omessage.c", line 1945, in H5O_msg_alloc
unable to allocate space for message
File "H5Oalloc.c", line 1142, in H5O_alloc
object header message is too large
End of HDF5 error back trace
Can't set attribute 'non_index_axes' in node:
/my_df(Group) u''.
Why?
为什么?
Note:In case it matters, the DataFrame column names are simple small strings:
注意:万一重要,DataFrame 列名称是简单的小字符串:
In[12]: max([len(x) for x in list(my_df.columns)])
Out{12]: 47
This is all with Pandas 0.11 and the latest stable version of IPython, Python and HDF5.
Pandas 0.11 和最新的稳定版 IPython、Python 和 HDF5 就是如此。
回答by BW0
HDF5 has a header limit of 64kb for all metadata of the columns. This include name, types, etc. When you go about roughly 2000 columns, you will run out of space to store all the metadata. This is a fundamental limitation of pytables. I don't think they will make workarounds on their side any time soon. You will either have to split the table up or choose another storage format.
HDF5 对列的所有元数据的标题限制为 64kb。这包括名称、类型等。当您使用大约 2000 列时,您将耗尽存储所有元数据的空间。这是 pytables 的一个基本限制。我认为他们不会很快在他们这边做出解决方法。您将不得不拆分表格或选择另一种存储格式。
回答by Volker L.
Although this thread is more than 5 years old the problem is still relevant. It′s still not possible to save a DataFrame with more than 2000 columns as one table into a HDFStore. Using format='fixed'isn′t an option if one wants to choose which columns to read from the HDFStore later.
尽管这个线程已经超过 5 年了,但问题仍然存在。仍然无法将超过 2000 列的 DataFrame 作为一个表保存到 HDFStore 中。format='fixed'如果您想选择稍后从 HDFStore 读取哪些列,则使用不是一种选择。
Here is a function that splits the DataFrame into smaller ones and stores them as seperate tables. Additionally a pandas.Seriesis put to the HDFStore that contains the information to which table a column belongs.
这是一个将 DataFrame 拆分为较小的并将它们存储为单独表的函数。此外,将 apandas.Series放入包含列所属表的信息的 HDFStore。
def wideDf_to_hdf(filename, data, columns=None, maxColSize=2000, **kwargs):
"""Write a `pandas.DataFrame` with a large number of columns
to one HDFStore.
Parameters
-----------
filename : str
name of the HDFStore
data : pandas.DataFrame
data to save in the HDFStore
columns: list
a list of columns for storing. If set to `None`, all
columns are saved.
maxColSize : int (default=2000)
this number defines the maximum possible column size of
a table in the HDFStore.
"""
import numpy as np
from collections import ChainMap
store = pd.HDFStore(filename, **kwargs)
if columns is None:
columns = data.columns
colSize = columns.shape[0]
if colSize > maxColSize:
numOfSplits = np.ceil(colSize / maxColSize).astype(int)
colsSplit = [
columns[i * maxColSize:(i + 1) * maxColSize]
for i in range(numOfSplits)
]
_colsTabNum = ChainMap(*[
dict(zip(columns, ['data{}'.format(num)] * colSize))
for num, columns in enumerate(colsSplit)
])
colsTabNum = pd.Series(dict(_colsTabNum)).sort_index()
for num, cols in enumerate(colsSplit):
store.put('data{}'.format(num), data[cols], format='table')
store.put('colsTabNum', colsTabNum, format='fixed')
else:
store.put('data', data[columns], format='table')
store.close()
DataFrames stored into a HDFStore with the function above can be read with the following function.
使用上述函数存储到 HDFStore 中的数据帧可以使用以下函数读取。
def read_hdf_wideDf(filename, columns=None, **kwargs):
"""Read a `pandas.DataFrame` from a HDFStore.
Parameter
---------
filename : str
name of the HDFStore
columns : list
the columns in this list are loaded. Load all columns,
if set to `None`.
Returns
-------
data : pandas.DataFrame
loaded data.
"""
store = pd.HDFStore(filename)
data = []
colsTabNum = store.select('colsTabNum')
if colsTabNum is not None:
if columns is not None:
tabNums = pd.Series(
index=colsTabNum[columns].values,
data=colsTabNum[columns].data).sort_index()
for table in tabNums.unique():
data.append(
store.select(table, columns=tabsNum[table], **kwargs))
else:
for table in colsTabNum.unique():
data.append(store.select(table, **kwargs))
data = pd.concat(data, axis=1).sort_index(axis=1)
else:
data = store.select('data', columns=columns)
store.close()
return data
回答by Alleo
As of 2014, the hdf is updated
截至 2014 年,hdf 已更新
If you are using HDF5 1.8.0 or previous releases, there is a limit on the number of fields you can have in a compound datatype. This is due to the 64K limit on object header messages, into which datatypes are encoded. (However, you can create a lot of fields before it will fail. One user was able to create up to 1260 fields in a compound datatype before it failed.)
As for pandas, it can save Dataframe with arbirtary number of columns with format='fixed'option, format 'table' still raises the same error as in topic.
I've also tried h5py, and got the error of 'too large header' as well (though I had version > 1.8.0).
至于pandas,它可以使用format='fixed'选项保存具有任意列数的 Dataframe ,格式 'table' 仍然会引发与主题相同的错误。我也试过h5py,也得到了“标题太大”的错误(尽管我的版本 > 1.8.0)。
回答by Anurag Gupta
###USE get_weights AND set_weights TO SAVE AND LOAD MODEL, RESPECTIVELY.
##############################################################################
#Assuming that this is your model architecture. However, you may use
#whatever architecture, you want to (big or small; any).
def mymodel():
inputShape= (28, 28, 3);
model= Sequential()
model.add(Conv2D(20, 5, padding="same", input_shape=inputShape))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(500))
model.add(Activation('relu'))
model.add(Dense(2, activation= "softmax"))
return model
model.fit(....) #paramaters to start training your model
################################################################################
################################################################################
#once your model has been trained, you want to save your model in your PC
#use get_weights() command to get your model weights
weigh= model.get_weights()
#now, use pickle to save your model weights, instead of .h5
#for heavy model architectures, .h5 file is unsupported.
pklfile= "D:/modelweights.pkl"
try:
fpkl= open(pklfile, 'wb') #Python 3
pickle.dump(weigh, fpkl, protocol= pickle.HIGHEST_PROTOCOL)
fpkl.close()
except:
fpkl= open(pklfile, 'w') #Python 2
pickle.dump(weigh, fpkl, protocol= pickle.HIGHEST_PROTOCOL)
fpkl.close()
################################################################################
################################################################################
#in future, you may want to load your model back
#use pickle to load model weights
pklfile= "D:/modelweights.pkl"
try:
f= open(pklfile) #Python 2
weigh= pickle.load(f);
f.close();
except:
f= open(pklfile, 'rb') #Python 3
weigh= pickle.load(f);
f.close();
restoredmodel= mymodel()
#use set_weights to load the modelweights into the model architecture
restoredmodel.set_weights(weigh)
################################################################################
################################################################################
#now, you can do your testing and evaluation- predictions
y_pred= restoredmodel.predict(X)

