pandas 无法将 DataFrame 保存到 HDF5（“对象头消息太大”）

Question

提问by Amelio Vazquez-Reina

I have a DataFrame in Pandas:

我在 Pandas 中有一个 DataFrame：

In [7]: my_df
Out[7]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 34 entries, 0 to 0
Columns: 2661 entries, airplane to zoo
dtypes: float64(2659), object(2)

When I try to save this to disk:

当我尝试将其保存到磁盘时：

store = pd.HDFStore(p_full_h5)
store.append('my_df', my_df)

I get:

我得到：

  File "H5A.c", line 254, in H5Acreate2
    unable to create attribute
  File "H5A.c", line 503, in H5A_create
    unable to create attribute in object header
  File "H5Oattribute.c", line 347, in H5O_attr_create
    unable to create new attribute in header
  File "H5Omessage.c", line 224, in H5O_msg_append_real
    unable to create new message
  File "H5Omessage.c", line 1945, in H5O_msg_alloc
    unable to allocate space for message
  File "H5Oalloc.c", line 1142, in H5O_alloc
    object header message is too large

End of HDF5 error back trace

Can't set attribute 'non_index_axes' in node:
 /my_df(Group) u''.

Why?

为什么？

Note:In case it matters, the DataFrame column names are simple small strings:

注意：万一重要，DataFrame 列名称是简单的小字符串：

In[12]: max([len(x) for x in list(my_df.columns)])
Out{12]: 47

This is all with Pandas 0.11 and the latest stable version of IPython, Python and HDF5.

Pandas 0.11 和最新的稳定版 IPython、Python 和 HDF5 就是如此。

Answer 1

回答by BW0

HDF5 has a header limit of 64kb for all metadata of the columns. This include name, types, etc. When you go about roughly 2000 columns, you will run out of space to store all the metadata. This is a fundamental limitation of pytables. I don't think they will make workarounds on their side any time soon. You will either have to split the table up or choose another storage format.

HDF5 对列的所有元数据的标题限制为 64kb。这包括名称、类型等。当您使用大约 2000 列时，您将耗尽存储所有元数据的空间。这是 pytables 的一个基本限制。我认为他们不会很快在他们这边做出解决方法。您将不得不拆分表格或选择另一种存储格式。

Answer 2

回答by Volker L.

Although this thread is more than 5 years old the problem is still relevant. It′s still not possible to save a DataFrame with more than 2000 columns as one table into a HDFStore. Using format='fixed'isn′t an option if one wants to choose which columns to read from the HDFStore later.

尽管这个线程已经超过 5 年了，但问题仍然存在。仍然无法将超过 2000 列的 DataFrame 作为一个表保存到 HDFStore 中。format='fixed'如果您想选择稍后从 HDFStore 读取哪些列，则使用不是一种选择。

Here is a function that splits the DataFrame into smaller ones and stores them as seperate tables. Additionally a pandas.Seriesis put to the HDFStore that contains the information to which table a column belongs.

这是一个将 DataFrame 拆分为较小的并将它们存储为单独表的函数。此外，将 apandas.Series放入包含列所属表的信息的 HDFStore。

def wideDf_to_hdf(filename, data, columns=None, maxColSize=2000, **kwargs):
    """Write a `pandas.DataFrame` with a large number of columns
    to one HDFStore.

    Parameters
    -----------
    filename : str
        name of the HDFStore
    data : pandas.DataFrame
        data to save in the HDFStore
    columns: list
        a list of columns for storing. If set to `None`, all 
        columns are saved.
    maxColSize : int (default=2000)
        this number defines the maximum possible column size of 
        a table in the HDFStore.

    """
    import numpy as np
    from collections import ChainMap
    store = pd.HDFStore(filename, **kwargs)
    if columns is None:
        columns = data.columns
    colSize = columns.shape[0]
    if colSize > maxColSize:
        numOfSplits = np.ceil(colSize / maxColSize).astype(int)
        colsSplit = [
            columns[i * maxColSize:(i + 1) * maxColSize]
            for i in range(numOfSplits)
        ]
        _colsTabNum = ChainMap(*[
            dict(zip(columns, ['data{}'.format(num)] * colSize))
            for num, columns in enumerate(colsSplit)
        ])
        colsTabNum = pd.Series(dict(_colsTabNum)).sort_index()
        for num, cols in enumerate(colsSplit):
            store.put('data{}'.format(num), data[cols], format='table')
        store.put('colsTabNum', colsTabNum, format='fixed')
    else:
        store.put('data', data[columns], format='table')
    store.close()

DataFrames stored into a HDFStore with the function above can be read with the following function.

使用上述函数存储到 HDFStore 中的数据帧可以使用以下函数读取。

def read_hdf_wideDf(filename, columns=None, **kwargs):
    """Read a `pandas.DataFrame` from a HDFStore.

    Parameter
    ---------
    filename : str
        name of the HDFStore
    columns : list
        the columns in this list are loaded. Load all columns, 
        if set to `None`.

    Returns
    -------
    data : pandas.DataFrame
        loaded data.

    """
    store = pd.HDFStore(filename)
    data = []
    colsTabNum = store.select('colsTabNum')
    if colsTabNum is not None:
        if columns is not None:
            tabNums = pd.Series(
                index=colsTabNum[columns].values,
                data=colsTabNum[columns].data).sort_index()
            for table in tabNums.unique():
                data.append(
                    store.select(table, columns=tabsNum[table], **kwargs))
        else:
            for table in colsTabNum.unique():
                data.append(store.select(table, **kwargs))
        data = pd.concat(data, axis=1).sort_index(axis=1)
    else:
        data = store.select('data', columns=columns)
    store.close()
    return data

Answer 3

回答by Alleo

As of 2014, the hdf is updated

截至 2014 年，hdf 已更新

If you are using HDF5 1.8.0 or previous releases, there is a limit on the number 
of fields you can have in a compound datatype. 
This is due to the 64K limit on object header messages, into which datatypes are encoded. (However, you can create a lot of fields before it will fail.
One user was able to create up to 1260 fields in a compound datatype before it failed.)

As for pandas, it can save Dataframe with arbirtary number of columns with format='fixed'option, format 'table' still raises the same error as in topic. I've also tried h5py, and got the error of 'too large header' as well (though I had version > 1.8.0).

至于pandas，它可以使用format='fixed'选项保存具有任意列数的 Dataframe ，格式 'table' 仍然会引发与主题相同的错误。我也试过h5py，也得到了“标题太大”的错误（尽管我的版本 > 1.8.0）。

Answer 4

回答by Anurag Gupta

###USE get_weights AND set_weights TO SAVE AND LOAD MODEL, RESPECTIVELY.

##############################################################################

#Assuming that this is your model architecture. However, you may use 
#whatever architecture, you want to (big or small; any).
def mymodel():
    inputShape= (28, 28, 3);
    model= Sequential()
    model.add(Conv2D(20, 5, padding="same", input_shape=inputShape))
    model.add(Activation('relu'))
    model.add(Flatten())
    model.add(Dense(500))
    model.add(Activation('relu'))
    model.add(Dense(2, activation= "softmax"))
    return model
model.fit(....)    #paramaters to start training your model




################################################################################
################################################################################
#once your model has been trained, you want to save your model in your PC
#use get_weights() command to get your model weights
weigh= model.get_weights()

#now, use pickle to save your model weights, instead of .h5
#for heavy model architectures, .h5 file is unsupported.
pklfile= "D:/modelweights.pkl"
try:
    fpkl= open(pklfile, 'wb')    #Python 3     
    pickle.dump(weigh, fpkl, protocol= pickle.HIGHEST_PROTOCOL)
    fpkl.close()
except:
    fpkl= open(pklfile, 'w')    #Python 2      
    pickle.dump(weigh, fpkl, protocol= pickle.HIGHEST_PROTOCOL)
    fpkl.close()




################################################################################
################################################################################
#in future, you may want to load your model back
#use pickle to load model weights

pklfile= "D:/modelweights.pkl"
try:
    f= open(pklfile)     #Python 2

    weigh= pickle.load(f);                
    f.close();
except:

    f= open(pklfile, 'rb')     #Python 3                 
    weigh= pickle.load(f);                
    f.close();

restoredmodel= mymodel()
#use set_weights to load the modelweights into the model architecture
restoredmodel.set_weights(weigh)




################################################################################
################################################################################
#now, you can do your testing and evaluation- predictions
y_pred= restoredmodel.predict(X)

pandas 无法将 DataFrame 保存到 HDF5（“对象头消息太大”）

提问by Amelio Vazquez-Reina

回答by BW0

回答by Volker L.

回答by Alleo

回答by Anurag Gupta

相关推荐

最近更新

标签

pandas 无法将 DataFrame 保存到 HDF5（“对象头消息太大”）

提问by Amelio Vazquez-Reina

回答by BW0

回答by Volker L.

回答by Alleo

回答by Anurag Gupta

相关推荐

在 Python 中处理 Pandas DataFrames 列分区中的零

在 Pandas 中交换轴

pandas 为什么pandas groupby().transform() 需要唯一索引？

pandas python 熊猫索引 is_unique 不起作用

相关推荐

最近更新

标签