Python 将 HDF5 文件读入 numpy 数组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46733052/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read HDF5 file into numpy array
提问by e9e9s
I have the following code to read a hdf5 file as a numpy array:
我有以下代码可以将 hdf5 文件作为 numpy 数组读取:
hf = h5py.File('path/to/file', 'r')
n1 = hf.get('dataset_name')
n2 = np.array(n1)
and when I print n2
I get this:
当我打印时,n2
我得到了这个:
Out[15]:
array([[<HDF5 object reference>, <HDF5 object reference>,
<HDF5 object reference>, <HDF5 object reference>...
How can I read the HDF5 object reference
to view the data stored in it?
如何读取HDF5 object reference
以查看存储在其中的数据?
回答by bnaecker
The easiest thing is to use the .value
attribute of the HDF5 dataset.
最简单的就是使用.value
HDF5数据集的属性。
>>> hf = h5py.File('/path/to/file', 'r')
>>> data = hf.get('dataset_name').value # `data` is now an ndarray.
You can also slice the dataset, which produces an actual ndarray with the requested data:
您还可以对数据集进行切片,这会使用请求的数据生成实际的 ndarray:
>>> hf['dataset_name'][:10] # produces ndarray as well
But keep in mind that in many ways the h5py
dataset acts like an ndarray
. So you can pass the dataset itself unchanged to most, if not all, NumPy functions. So, for example, this works just fine: np.mean(hf.get('dataset_name'))
.
但请记住,在许多方面,h5py
数据集的作用类似于ndarray
. 因此,您可以将数据集本身原封不动地传递给大多数(如果不是全部)NumPy 函数。因此,举例来说,这个工作得很好:np.mean(hf.get('dataset_name'))
。
EDIT:
编辑:
I misunderstood the question originally. The problem isn't loading the numerical data, it's that the dataset actually containsHDF5 references. This is a strange setup, and it's kind of awkward to read in h5py
. You need to dereferenceeach reference in the dataset. I'll show it for just one of them.
我最初误解了这个问题。问题不在于加载数值数据,而是数据集实际上包含HDF5 引用。这是一个奇怪的设置,读入有点尴尬h5py
。您需要取消引用数据集中的每个引用。我将仅展示其中之一。
First, let's create a file and a temporary dataset:
首先,让我们创建一个文件和一个临时数据集:
>>> f = h5py.File('tmp.h5', 'w')
>>> ds = f.create_dataset('data', data=np.zeros(10,))
Next, create a reference to it and store a few of them in a dataset.
接下来,创建对它的引用并将其中的一些存储在数据集中。
>>> ref_dtype = h5py.special_dtype(ref=h5py.Reference)
>>> ref_ds = f.create_dataset('data_refs', data=(ds.ref, ds.ref), dtype=ref_dtype)
Then you can read one of these back, in a circuitous way, by getting its name ,and then reading from that actual dataset that is referenced.
然后,您可以通过获取其名称,然后从引用的实际数据集读取,以迂回的方式读取其中一个。
>>> name = h5py.h5r.get_name(ref_ds[0], f.id) # 2nd argument is the file identifier
>>> print(name)
b'/data'
>>> out = f[name]
>>> print(out.shape)
(10,)
It's round-about, but it seems to work. The TL;DR is: get the name of the referenced dataset, and read directly from that.
这是圆的,但它似乎工作。TL;DR 是:获取引用数据集的名称,并直接从中读取。
Note:
笔记:
The h5py.h5r.dereference
function seems pretty unhelpful here, despite the name. It returns the ID of the referenced object. This can be read from directly, but it's veryeasy to cause a crash in this case (I did it several times in this contrived example here). Getting the name and reading from that is much easier.
h5py.h5r.dereference
尽管名称如此,但该功能在这里似乎毫无用处。它返回引用对象的 ID。这可以直接读取,但在这种情况下很容易导致崩溃(我在这个人为的例子中做了几次)。获取名称并从中读取要容易得多。
回答by spate
Here is a direct approach to read hdf5 file as a numpy array:
这是将 hdf5 文件作为 numpy 数组读取的直接方法:
import numpy as np
import h5py
hf = h5py.File('path/to/file.h5', 'r')
n1 = np.array(hf["dataset_name"][:]) #dataset_name is same as hdf5 object name
print(n1)
回答by ArcherEX
h5py provides intrinsic method for such tasks: read_direct()
h5py 为此类任务提供了内在方法: read_direct()
hf = h5py.File('path/to/file', 'r')
n1 = np.zeros(shape, dtype=numpy_type)
hf['dataset_name'].read_direct(n1)
hf.close()
The combined steps are still faster than n1 = np.array(hf['dataset_name'])
if you %timeit
. The only drawback is, one needs to know the shape of the dataset beforehand, which can be assigned as an attribute by the data provider.
组合的步骤仍然比n1 = np.array(hf['dataset_name'])
您更快%timeit
。唯一的缺点是,需要事先知道数据集的形状,数据提供者可以将其作为属性分配。
回答by Pierre de Buyl
HDF5 has a simple object model for storing datasets(roughly speaking, the equivalent of an "on file array") and organizing those into groups (think of directories). On top of these two objects types, there are much more powerful features that require layers of understanding.
HDF5 有一个简单的对象模型,用于存储数据集(粗略地说,相当于“文件数组”)并将它们组织成组(想想目录)。在这两种对象类型之上,还有更强大的功能需要理解。
The one at hand is a "Reference". It is an internal address in the storage model of HDF5.
手头的一个是“参考”。它是 HDF5 存储模型中的内部地址。
h5py will do all the work for you without any calls to obscure routines, as it tries to follow as much as possible a dict-like interface (but for references, it is a bit more complex to make it transparent).
h5py 将为您完成所有工作,而无需调用任何晦涩的例程,因为它尝试尽可能多地遵循类似 dict 的界面(但对于参考,使其透明化会更复杂一些)。
The place to look for in the docs is Object and Region References. It states that to access an object pointed to by reference ref
, you do
在文档中查找的位置是Object 和 Region References。它指出要访问通过引用指向的对象ref
,您需要
my_object = my_file[ref]
In your problems, there are two steps: 1. Get the reference 2. Get the dataset
在您的问题中,有两个步骤:1. 获取参考 2. 获取数据集
# Open the file
hf = h5py.File('path/to/file', 'r')
# Obtain the dataset of references
n1 = hf['dataset_name']
# Obtain the dataset pointed to by the first reference
ds = hf[n1[0]]
# Obtain the data in ds
data = ds[:]
If the dataset containing references is 2D, for instance, you must use
例如,如果包含引用的数据集是二维的,则必须使用
ds = hf[n1[0,0]]
If the dataset is scalar, you must use
如果数据集是标量,则必须使用
data = ds[()]
To obtain the all the datasets at once:
要一次获取所有数据集:
all_data = [hf[ref] for ref in n1[:]]
assuming a 1D dataset for n1. For 2D, the idea holds but I don't see a short way to write it.
假设 n1 的一维数据集。对于 2D,这个想法是成立的,但我没有看到写它的捷径。
To get a full idea of how to roundtrip data with references, I wrote short "writer program" and a short "reader program":
为了全面了解如何使用引用来回传输数据,我编写了一个简短的“编写程序”和一个简短的“阅读程序”:
import numpy as np
import h5py
# Open file
myfile = h5py.File('myfile.hdf5', 'w')
# Create dataset
ds_0 = myfile.create_dataset('dataset_0', data=np.arange(10))
ds_1 = myfile.create_dataset('dataset_1', data=9-np.arange(10))
# Create a data
ref_dtype = h5py.special_dtype(ref=h5py.Reference)
ds_refs = myfile.create_dataset('ref_to_dataset', shape=(2,), dtype=ref_dtype)
ds_refs[0] = ds_0.ref
ds_refs[1] = ds_1.ref
myfile.close()
and
和
import numpy as np
import h5py
# Open file
myfile = h5py.File('myfile.hdf5', 'r')
# Read the references
ref_to_ds_0 = myfile['ref_to_dataset'][0]
ref_to_ds_1 = myfile['ref_to_dataset'][1]
# Read the dataset
ds_0 = myfile[ref_to_ds_0]
ds_1 = myfile[ref_to_ds_1]
# Read the value in the dataset
data_0 = ds_0[:]
data_1 = ds_1[:]
myfile.close()
print(data_0)
print(data_1)
You will notice that you cannot use the standard convenient and easy NumPy like syntax for reference datasets. This is because HDF5 references are not representable with the NumPy datatypes. They must be read and written one at a time.
您会注意到,您不能将标准的方便易用的 NumPy 之类的语法用于参考数据集。这是因为 HDF5 引用不能用 NumPy 数据类型表示。它们必须一次读取和写入一个。
回答by Yannick Guéhenneux
Hi this is the way I use to read hdf5 data, hope it could be usefull to you
嗨,这是我用来读取hdf5数据的方式,希望它对你有用
with h5py.File('name-of-file.h5', 'r') as hf:
data = hf['name-of-dataset'][:]
回答by Vinod Kumar
I tried all the answers suggested previously but none of them worked for me. For example, read_direct() method gives an error 'Operation not defined for data type class'. The .value method also does not work. After a lot of struggling I could get around with using the reference itself to get the numpy array.
我尝试了之前建议的所有答案,但没有一个对我有用。例如,read_direct() 方法给出错误“未为数据类型类定义操作”。.value 方法也不起作用。经过很多努力,我可以使用引用本身来获取 numpy 数组。
import numpy as np
import h5py
f = h5py.File('file.mat','r')
data2get = f.get('data2get')[:]
data = np.zeros([data2get.shape[1]])
for i in range(data2get.shape[1]):
data[i] = np.array(f[data2get[0][i]])[0][0]