Python 使用 h5py 增量写入 hdf5

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25655588/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:26:43  来源:igfitidea点击:

Incremental writes to hdf5 with h5py

pythonhdf5h5py

提问by user116293

I have got a question about how best to write to hdf5 files with python / h5py.

我有一个关于如何最好地使用 python/h5py 写入 hdf5 文件的问题。

I have data like:

我有这样的数据:

-----------------------------------------
| timepoint | voltage1 | voltage2 | ...
-----------------------------------------
| 178       | 10       | 12       | ...
-----------------------------------------
| 179       | 12       | 11       | ...
-----------------------------------------
| 185       | 9        | 12       | ...
-----------------------------------------
| 187       | 15       | 12       | ...
                    ...

with about 10^4 columns, and about 10^7 rows. (That's about 10^11 (100 billion) elements, or ~100GB with 1 byte ints).

大约有 10^4 列和大约 10^7 行。(这大约是 10^11(1000 亿)个元素,或者大约 100GB 的 1 个字节整数)。

With this data, typical use is pretty much write once, read many times, and the typical read case would be to grab column 1 and another column (say 254), load both columns into memory, and do some fancy statistics.

对于这些数据,典型的用途是一次写入,多次读取,典型的读取情况是抓取第 1 列和另一列(比如 254),将两列加载到内存中,并进行一些奇特的统计。

I think a good hdf5 structure would thus be to have each column in the table above be a hdf5 group, resulting in 10^4 groups. That way we won't need to read all the data into memory, yes? The hdf5 structure isn't yet defined though, so it can be anything.

因此,我认为一个好的 hdf5 结构是将上表中的每一列都设为一个 hdf5 组,从而产生 10^4 个组。这样我们就不需要将所有数据读入内存,是吗?hdf5 结构还没有定义,所以它可以是任何东西。

Now the question: I receive the data ~10^4 rows at a time (and not exactly the same numbers of rows each time), and need to write it incrementally to the hdf5 file. How do I write that file?

现在的问题是:我一次收到大约 10^4 行的数据(并且每次的行数不完全相同),并且需要将其增量写入 hdf5 文件。我怎么写那个文件?

I'm considering python and h5py, but could another tool if recommended. Is chunking the way to go, with e.g.

我正在考虑使用 python 和 h5py,但如果推荐,可以使用另一种工具。正在分块要走的路,例如

dset = f.create_dataset("voltage284", (100000,), maxshape=(None,), dtype='i8', chunks=(10000,))

and then when another block of 10^4 rows arrives, replace the dataset?

然后当另一块 10^4 行到达时,替换数据集?

Or is it better to just store each block of 10^4 rows as a separate dataset? Or do I really need to know the final number of rows? (That'll be tricky to get, but maybe possible).

还是将每个 10^4 行的块存储为单独的数据集更好?还是我真的需要知道最终的行数?(这将很难获得,但也许是可能的)。

I can bail on hdf5 if it's not the right tool for the job too, though I think once the awkward writes are done, it'll be wonderful.

如果 hdf5 也不是适合这项工作的正确工具,我可以放弃它,尽管我认为一旦完成了尴尬的写入,它会很棒。

采纳答案by unutbu

Per the FAQ, you can expand the dataset using dset.resize. For example,

根据常见问题解答,您可以使用dset.resize. 例如,

import os
import h5py
import numpy as np
path = '/tmp/out.h5'
os.remove(path)
with h5py.File(path, "a") as f:
    dset = f.create_dataset('voltage284', (10**5,), maxshape=(None,),
                            dtype='i8', chunks=(10**4,))
    dset[:] = np.random.random(dset.shape)        
    print(dset.shape)
    # (100000,)

    for i in range(3):
        dset.resize(dset.shape[0]+10**4, axis=0)   
        dset[-10**4:] = np.random.random(10**4)
        print(dset.shape)
        # (110000,)
        # (120000,)
        # (130000,)

回答by daniel

As @unutbu pointed out, dset.resizeis an excellent option. It may be work while to look at pandasand its HDF5support which may be useful given your workflow. It sounds like HDF5 is a reasonable choice given your needs but it is possible that your problem may be expressed better using an additional layer on top.

正如@unutbu 指出的那样,dset.resize是一个很好的选择。可能需要查看pandas它的HDF5支持,这可能对您的工作流程有用。鉴于您的需求,听起来 HDF5 是一个合理的选择,但使用顶部的附加层可能会更好地表达您的问题。

One big thing to consider is the orientation of the data. If you're primarily interested in reads, and you are primarily fetching data by column, then it sounds like you may want to transpose the data such that the reads can happen by row as HDF5 stores in row-major order.

需要考虑的一件大事是数据的方向。如果您主要对读取感兴趣,并且主要按列获取数据,那么听起来您可能希望转置数据,以便读取可以按行进行,因为 HDF5 以行优先顺序存储。