Python 如何使用 h5py 将数据附加到 hdf5 文件中的一个特定数据集
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47072859/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to append data to one specific dataset in a hdf5 file with h5py
提问by Midas.Inc
I am looking for a possibility to append data to an existing dataset inside a .h5
file using Python (h5py
).
我正在寻找.h5
使用 Python ( h5py
)将数据附加到文件内的现有数据集的可能性。
A short intro to my project: I try to train a CNN using medical image data. Because of the huge amount of data and heavy memory usage during the transformation of the data to NumPy arrays, I needed to split the "transformation" into a few data chunks: load and preprocess the first 100 medical images and save the NumPy arrays to hdf5 file, then load the next 100 datasets and append the existing .h5
file, and so on.
我的项目的简短介绍:我尝试使用医学图像数据训练 CNN。由于在将数据转换为 NumPy 数组的过程中数据量巨大且内存使用量大,我需要将“转换”拆分为几个数据块:加载和预处理前 100 张医学图像并将 NumPy 数组保存到 hdf5文件,然后加载接下来的 100 个数据集并附加现有.h5
文件,依此类推。
Now, I tried to store the first 100 transformed NumPy arrays as follows:
现在,我尝试按如下方式存储前 100 个转换后的 NumPy 数组:
import h5py
from LoadIPV import LoadIPV
X_train_data, Y_train_data, X_test_data, Y_test_data = LoadIPV()
with h5py.File('.\PreprocessedData.h5', 'w') as hf:
hf.create_dataset("X_train", data=X_train_data, maxshape=(None, 512, 512, 9))
hf.create_dataset("X_test", data=X_test_data, maxshape=(None, 512, 512, 9))
hf.create_dataset("Y_train", data=Y_train_data, maxshape=(None, 512, 512, 1))
hf.create_dataset("Y_test", data=Y_test_data, maxshape=(None, 512, 512, 1))
As can be seen, the transformed NumPy arrays are splitted into four different "groups" that are stored into the four hdf5
datasets[X_train, X_test, Y_train, Y_test]
.
The LoadIPV()
function performs the preprocessing of the medical image data.
可以看出,转换后的 NumPy 数组被分成四个不同的“组”,这些“组”存储在四个hdf5
数据集中[X_train, X_test, Y_train, Y_test]
。该LoadIPV()
函数执行医学图像数据的预处理。
My problem is that I would like to store the next 100 NumPy arrays into the same .h5
file into the existing datasets: that means that I would like to append to, for example, the existing X_train
dataset of shape [100, 512, 512, 9]
with the next 100 NumPy arrays, such that X_train
becomes of shape [200, 512, 512, 9]
. The same should work for the other three datasets X_test
, Y_train
and Y_test
.
我的问题是我想将接下来的 100 个 NumPy 数组存储到同一个.h5
文件中的现有数据集中:这意味着我想附加到,例如,具有接下来的 100 个 NumPy 数组的现有X_train
形状数据集[100, 512, 512, 9]
,这样X_train
变成了形状[200, 512, 512, 9]
。这同样适用于其他三个数据集X_test
,Y_train
并且Y_test
.
回答by Midas.Inc
I have found a solution that seems to work!
我找到了一个似乎有效的解决方案!
Have a look at this: incremental writes to hdf5 with h5py!
看看这个:用 h5py 增量写入 hdf5!
In order to append data to a specific dataset it is necessary to first resize the specific dataset in the corresponding axis and subsequently append the new data at the end of the "old" nparray.
为了将数据附加到特定数据集,必须首先在相应轴上调整特定数据集的大小,然后在“旧”nparray 的末尾附加新数据。
Thus, the solution looks like this:
因此,解决方案如下所示:
with h5py.File('.\PreprocessedData.h5', 'a') as hf:
hf["X_train"].resize((hf["X_train"].shape[0] + X_train_data.shape[0]), axis = 0)
hf["X_train"][-X_train_data.shape[0]:] = X_train_data
hf["X_test"].resize((hf["X_test"].shape[0] + X_test_data.shape[0]), axis = 0)
hf["X_test"][-X_test_data.shape[0]:] = X_test_data
hf["Y_train"].resize((hf["Y_train"].shape[0] + Y_train_data.shape[0]), axis = 0)
hf["Y_train"][-Y_train_data.shape[0]:] = Y_train_data
hf["Y_test"].resize((hf["Y_test"].shape[0] + Y_test_data.shape[0]), axis = 0)
hf["Y_test"][-Y_test_data.shape[0]:] = Y_test_data
However, note that you should create the dataset with maxshape=(None,)
, for example
但是,请注意,您应该使用maxshape=(None,)
例如创建数据集
h5f.create_dataset('X_train', data=orig_data, compression="gzip", chunks=True, maxshape=(None,))
otherwise the dataset cannot be extended.
否则数据集无法扩展。