Python 合并hdf5文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18492273/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:53:02  来源:igfitidea点击:

Combining hdf5 files

pythonhdf5h5py

提问by Bitwise

I have a number of hdf5 files, each of which have a single dataset. The datasets are too large to hold in RAM. I would like to combine these files into a single file containing all datasets separately (i.e. notto concatenate the datasets into a single dataset).

我有许多 hdf5 文件,每个文件都有一个数据集。数据集太大而无法保存在 RAM 中。我想将这些文件合并成一个单独包含所有数据集的文件(即不要将数据集连接成一个数据集)。

One way to do this is to create a hdf5 file and then copy the datasets one by one. This will be slow and complicated because it will need to be buffered copy.

一种方法是创建一个 hdf5 文件,然后一一复制数据集。这将是缓慢而复杂的,因为它需要缓冲复制。

Is there a more simple way to do this? Seems like there should be, since it is essentially just creating a container file.

有没有更简单的方法来做到这一点?似乎应该有,因为它本质上只是创建一个容器文件。

I am using python/h5py.

我正在使用 python/h5py。

采纳答案by hBy2Py

One solution is to use the h5pyinterface to the low-level H5Ocopyfunctionof the HDF5 API, in particular the h5py.h5o.copyfunction:

一种解决方案是使用HDF5 APIh5py的低级H5Ocopy函数的接口,特别是h5py.h5o.copy函数

In [1]: import h5py as h5

In [2]: hf1 = h5.File("f1.h5")

In [3]: hf2 = h5.File("f2.h5")

In [4]: hf1.create_dataset("val", data=35)
Out[4]: <HDF5 dataset "val": shape (), type "<i8">

In [5]: hf1.create_group("g1")
Out[5]: <HDF5 group "/g1" (0 members)>

In [6]: hf1.get("g1").create_dataset("val2", data="Thing")
Out[6]: <HDF5 dataset "val2": shape (), type "|O8">

In [7]: hf1.flush()

In [8]: h5.h5o.copy(hf1.id, "g1", hf2.id, "newg1")

In [9]: h5.h5o.copy(hf1.id, "val", hf2.id, "newval")

In [10]: hf2.values()
Out[10]: [<HDF5 group "/newg1" (1 members)>, <HDF5 dataset "newval": shape (), type "<i8">]

In [11]: hf2.get("newval").value
Out[11]: 35

In [12]: hf2.get("newg1").values()
Out[12]: [<HDF5 dataset "val2": shape (), type "|O8">]

In [13]: hf2.get("newg1").get("val2").value
Out[13]: 'Thing'

The above was generated with h5pyversion 2.0.1-2+b1and iPython version 0.13.1-2+deb7u1atop Python version 2.7.3-4+deb7u1from a more-or-less vanilla install of Debian Wheezy. The files f1.h5and f2.h5did not exist prior to executing the above.Note that, per salotz, for Python 3 the dataset/group names need to bebytes(e.g.,b"val"), notstr.

以上是从 Debian Wheezy 或多或少的 vanilla 安装生成的Python版本之上的h5py版本2.0.1-2+b1和 iPython 版本。在执行上述操作之前,文件和不存在。0.13.1-2+deb7u12.7.3-4+deb7u1f1.h5f2.h5请注意,根据salotz,对于 Python 3,数据集/组名称必须是bytes(例如,b"val"),而不是str

The hf1.flush()in command [7]is crucial, as the low-level interface apparently will always draw from the version of the .h5file stored on disk, not that cached in memory. Copying datasets to/from groups not at the root of a Filecan be achieved by supplying the ID of that group using, e.g., hf1.get("g1").id.

hf1.flush()命令[7]是至关重要的,因为低层次的接口显然将始终从版本吸取.h5存储在磁盘上,而不是缓存在内存中的文件。不是的根从复制组数据集/File可以通过使用,例如提供该组的ID来实现,hf1.get("g1").id

Note that h5py.h5o.copywill fail with an exception (no clobber) if an object of the indicated name already exists in the destination location.

请注意,h5py.h5o.copy如果目标位置中已存在指定名称的对象,则将失败并出现异常(无破坏)。

回答by Bitwise

I found a non-python solution by using h5copyfrom the official hdf5 tools. h5copy can copy individual specified datasets from an hdf5 file into another existing hdf5 file.

我通过使用官方 hdf5 工具中的h5copy找到了一个非 python 解决方案。h5copy 可以将单个指定的数据集从 hdf5 文件复制到另一个现有的 hdf5 文件中。

If someone finds a python/h5py-based solution I would be glad to hear about it.

如果有人找到基于 python/h5py 的解决方案,我会很高兴听到它。

回答by Yossarian

This is actually one of the use-cases of HDF5. If you just want to be able to access all the datasets from a single file, and don't care how they're actually stored on disk, you can use external links. From the HDF5 website:

这实际上是 HDF5 的用例之一。如果您只想从单个文件访问所有数据集,而不关心它们实际存储在磁盘上的方式,您可以使用外部链接。从HDF5 网站

External links allow a group to include objects in another HDF5 file and enable the library to access those objects as if they are in the current file. In this manner, a group may appear to directly contain datasets, named datatypes, and even groups that are actually in a different file. This feature is implemented via a suite of functions that create and manage the links, define and retrieve paths to external objects, and interpret link names:

外部链接允许组将对象包含在另一个 HDF5 文件中,并使库能够访问这些对象,就像它们在当前文件中一样。以这种方式,组可能看起来直接包含数据集、命名数据类型,甚至实际上位于不同文件中的组。此功能通过一组创建和管理链接、定义和检索外部对象路径以及解释链接名称的函数来实现:

Here's how to do it in h5py:

以下是在 h5py 中的操作方法

myfile = h5py.File('foo.hdf5','a')
myfile['ext link'] = h5py.ExternalLink("otherfile.hdf5", "/path/to/resource")

Be careful:when opening myfile, you should open it with 'a'if it is an existing file. If you open it with 'w', it will erase its contents.

小心:在打开时myfile,你应该打开它'a',如果它是一个现有的文件。如果您使用'w'将其打开,它将删除其内容。

This would be very much faster than copying all the datasets into a new file. I don't know how fast access to otherfile.hdf5would be, but operating on all the datasets would be transparent - that is, h5py would see all the datasets as residing in foo.hdf5.

这比将所有数据集复制到新文件要快得多。我不知道访问的速度有多快otherfile.hdf5,但对所有数据集的操作将是透明的——也就是说,h5py 会将所有数据集视为驻留在foo.hdf5.

回答by G M

I usually use ipythonand h5copytool togheter, this is much faster compared to a pure python solution. Once installed h5copy.

我通常将ipythonh5copy工具一起使用,与纯 python 解决方案相比,这要快得多。一旦安装了h5copy。

Console solution M.W.E.

控制台解决方案 MWE

#PLESE NOTE THIS IS IPYTHON CONSOLE CODE NOT PURE PYTHON

import h5py
#for every dataset Dn.h5 you want to merge to Output.h5 
f = h5py.File('D1.h5','r+') #file to be merged 
h5_keys = f.keys() #get the keys (You can remove the keys you don't use)
f.close() #close the file
for i in h5_keys:
        !h5copy -i 'D1.h5' -o 'Output.h5' -s {i} -d {i}

Automated console solution

自动化控制台解决方案

To completely automatize the process supposing you are working in the folder were the files to be merged are stored:

为了完全自动化该过程,假设您正在文件夹中工作并存储要合并的文件:

import os 
d_names = os.listdir(os.getcwd())
d_struct = {} #Here we will store the database structure
for i in d_names:
   f = h5py.File(i,'r+')
   d_struct[i] = f.keys()
   f.close()

# A) empty all the groups in the new .h5 file 
for i in d_names:
    for j  in d_struct[i]:
        !h5copy -i '{i}' -o 'output.h5' -s {j} -d {j}

Create a new group for every .h5 file added

为每个添加的 .h5 文件创建一个新组

If you want to keep the previous dataset separate inside the output.h5, you have to create the group first using the flag -p:

如果您想在 output.h5 中保留先前的数据集,您必须首先使用标志创建组 -p

 # B) Create a new group in the output.h5 file for every input.h5 file
 for i in d_names:
        dataset = d_struct[i][0]
        newgroup = '%s/%s' %(i[:-3],dataset)
        !h5copy -i '{i}' -o 'output.h5' -s {dataset} -d {newgroup} -p
        for j  in d_struct[i][1:]:
            newgroup = '%s/%s' %(i[:-3],j) 
            !h5copy -i '{i}' -o 'output.h5' -s {j} -d {newgroup}

回答by fedepad

To update on this, with HDF5 version 1.10 comes a new feature that might be useful in this context called "Virtual Datasets".
Here you find a brief tutorial and some explanations: Virtual Datasets.
Here more complete and detailed explanations and documentation for the feature:
Virtual Datasets extra doc.
And here the merged pull request in h5py to include the virtual datatsets API into h5py:
h5py Virtual Datasets PRbut I don't know if it's already available in the current h5py version or will come later.

为了更新这一点,HDF5 1.10 版提供了一项新功能,该功能在此上下文中可能很有用,称为“虚拟数据集”。
在这里您可以找到一个简短的教程和一些解释: 虚拟数据集
这里有关于该功能的更完整和详细的解释和文档:
Virtual Datasets extra doc
此处合并了 h5py 中的合并拉取请求,将
虚拟数据集API 包含到 h5py 中:h5py Virtual Datasets PR但我不知道它是否已在当前的 h5py 版本中可用或稍后会出现。

回答by zilba25

To use Python (and not IPython) and h5copy to merge HDF5 files, we can build on GM's answer:

要使用 Python(而不是 IPython)和 h5copy 来合并 HDF5 文件,我们可以基于GM 的回答

import h5py
import os

d_names = os.listdir(os.getcwd())
d_struct = {} #Here we will store the database structure
for i in d_names:
   f = h5py.File(i,'r+')
   d_struct[i] = f.keys()
   f.close()

for i in d_names:
   for j  in d_struct[i]:
      os.system('h5copy -i %s -o output.h5 -s %s -d %s' % (i, j, j))