Python 如何使用 h5py 将数据附加到 hdf5 文件中的一个特定数据集

Question

提问by Midas.Inc

I am looking for a possibility to append data to an existing dataset inside a .h5file using Python (h5py).

我正在寻找.h5使用 Python ( h5py)将数据附加到文件内的现有数据集的可能性。

A short intro to my project: I try to train a CNN using medical image data. Because of the huge amount of data and heavy memory usage during the transformation of the data to NumPy arrays, I needed to split the "transformation" into a few data chunks: load and preprocess the first 100 medical images and save the NumPy arrays to hdf5 file, then load the next 100 datasets and append the existing .h5file, and so on.

我的项目的简短介绍：我尝试使用医学图像数据训练 CNN。由于在将数据转换为 NumPy 数组的过程中数据量巨大且内存使用量大，我需要将“转换”拆分为几个数据块：加载和预处理前 100 张医学图像并将 NumPy 数组保存到 hdf5文件，然后加载接下来的 100 个数据集并附加现有.h5文件，依此类推。

Now, I tried to store the first 100 transformed NumPy arrays as follows:

现在，我尝试按如下方式存储前 100 个转换后的 NumPy 数组：

import h5py
from LoadIPV import LoadIPV

X_train_data, Y_train_data, X_test_data, Y_test_data = LoadIPV()

with h5py.File('.\PreprocessedData.h5', 'w') as hf:
    hf.create_dataset("X_train", data=X_train_data, maxshape=(None, 512, 512, 9))
    hf.create_dataset("X_test", data=X_test_data, maxshape=(None, 512, 512, 9))
    hf.create_dataset("Y_train", data=Y_train_data, maxshape=(None, 512, 512, 1))
    hf.create_dataset("Y_test", data=Y_test_data, maxshape=(None, 512, 512, 1))

As can be seen, the transformed NumPy arrays are splitted into four different "groups" that are stored into the four hdf5datasets[X_train, X_test, Y_train, Y_test]. The LoadIPV()function performs the preprocessing of the medical image data.

可以看出，转换后的 NumPy 数组被分成四个不同的“组”，这些“组”存储在四个hdf5数据集中[X_train, X_test, Y_train, Y_test]。该LoadIPV()函数执行医学图像数据的预处理。

My problem is that I would like to store the next 100 NumPy arrays into the same .h5file into the existing datasets: that means that I would like to append to, for example, the existing X_traindataset of shape [100, 512, 512, 9]with the next 100 NumPy arrays, such that X_trainbecomes of shape [200, 512, 512, 9]. The same should work for the other three datasets X_test, Y_trainand Y_test.

我的问题是我想将接下来的 100 个 NumPy 数组存储到同一个.h5文件中的现有数据集中：这意味着我想附加到，例如，具有接下来的 100 个 NumPy 数组的现有X_train形状数据集[100, 512, 512, 9]，这样X_train变成了形状[200, 512, 512, 9]。这同样适用于其他三个数据集X_test，Y_train并且Y_test.

Answer 1

回答by Midas.Inc

I have found a solution that seems to work!

我找到了一个似乎有效的解决方案！

Have a look at this: incremental writes to hdf5 with h5py!

看看这个：用 h5py 增量写入 hdf5！

In order to append data to a specific dataset it is necessary to first resize the specific dataset in the corresponding axis and subsequently append the new data at the end of the "old" nparray.

为了将数据附加到特定数据集，必须首先在相应轴上调整特定数据集的大小，然后在“旧”nparray 的末尾附加新数据。

Thus, the solution looks like this:

因此，解决方案如下所示：

with h5py.File('.\PreprocessedData.h5', 'a') as hf:
    hf["X_train"].resize((hf["X_train"].shape[0] + X_train_data.shape[0]), axis = 0)
    hf["X_train"][-X_train_data.shape[0]:] = X_train_data

    hf["X_test"].resize((hf["X_test"].shape[0] + X_test_data.shape[0]), axis = 0)
    hf["X_test"][-X_test_data.shape[0]:] = X_test_data

    hf["Y_train"].resize((hf["Y_train"].shape[0] + Y_train_data.shape[0]), axis = 0)
    hf["Y_train"][-Y_train_data.shape[0]:] = Y_train_data

    hf["Y_test"].resize((hf["Y_test"].shape[0] + Y_test_data.shape[0]), axis = 0)
    hf["Y_test"][-Y_test_data.shape[0]:] = Y_test_data

However, note that you should create the dataset with maxshape=(None,), for example

但是，请注意，您应该使用maxshape=(None,)例如创建数据集

h5f.create_dataset('X_train', data=orig_data, compression="gzip", chunks=True, maxshape=(None,))

otherwise the dataset cannot be extended.

否则数据集无法扩展。

Python 如何使用 h5py 将数据附加到 hdf5 文件中的一个特定数据集

提问by Midas.Inc

回答by Midas.Inc

相关推荐

最近更新

标签

Python 如何使用 h5py 将数据附加到 hdf5 文件中的一个特定数据集

提问by Midas.Inc

回答by Midas.Inc

相关推荐

Python 混淆矩阵不支持多标签指标

Python。从 data.frame 获取结构

Python 如何在pyspark数据框中将字符串类型的列转换为int形式？

Python - 使用 Python 3 urllib 发出 POST 请求

相关推荐

最近更新

标签