Python 将 HDF5 用于大型阵列存储(而不是平面二进制文件)是否具有分析速度或内存使用优势?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27710245/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Is there an analysis speed or memory usage advantage to using HDF5 for large array storage (instead of flat binary files)?
提问by Caleb
I am processing large 3D arrays, which I often need to slice in various ways to do a variety of data analysis. A typical "cube" can be ~100GB (and will likely get larger in the future)
我正在处理大型 3D 数组,我经常需要以各种方式对其进行切片以进行各种数据分析。一个典型的“立方体”大约为 100GB(将来可能会变得更大)
It seems that the typical recommended file format for large datasets in python is to use HDF5 (either h5py or pytables). My question is: is there any speed or memory usage benefit to using HDF5 to store and analyze these cubes over storing them in simple flat binary files? Is HDF5 more appropriate for tabular data, as opposed to large arrays like what I am working with? I see that HDF5 can provide nice compression, but I am more interested in processing speed and dealing with memory overflow.
似乎python中大型数据集的典型推荐文件格式是使用HDF5(h5py或pytables)。我的问题是:与将它们存储在简单的平面二进制文件中相比,使用 HDF5 来存储和分析这些多维数据集在速度或内存使用方面是否有任何好处?HDF5 是否更适合表格数据,而不是像我正在使用的大型数组?我看到 HDF5 可以提供很好的压缩,但我对处理速度和处理内存溢出更感兴趣。
I frequently want to analyze only one large subset of the cube. One drawback of both pytables and h5py is it seems is that when I take a slice of the array, I always get a numpy array back, using up memory. However, if I slice a numpy memmap of a flat binary file, I can get a view, which keeps the data on disk. So, it seems that I can more easily analyze specific sectors of my data without overrunning my memory.
我经常只想分析多维数据集的一个大子集。pytables 和 h5py 的一个缺点是,当我取数组的一部分时,我总是得到一个 numpy 数组,从而耗尽内存。但是,如果我将平面二进制文件的 numpy memmap 切片,我可以获得一个视图,它将数据保存在磁盘上。因此,似乎我可以更轻松地分析数据的特定部分,而不会超出我的记忆。
I have explored both pytables and h5py, and haven't seen the benefit of either so far for my purpose.
我已经探索了 pytables 和 h5py,到目前为止还没有看到它们对我的目的的好处。
采纳答案by Joe Kington
HDF5 Advantages: Organization, flexibility, interoperability
HDF5 优点:组织性、灵活性、互操作性
Some of the main advantages of HDF5 are its hierarchical structure (similar to folders/files), optional arbitrary metadata stored with each item, and its flexibility (e.g. compression). This organizational structure and metadata storage may sound trivial, but it's very useful in practice.
HDF5 的一些主要优点是它的层次结构(类似于文件夹/文件)、与每个项目一起存储的可选的任意元数据,以及它的灵活性(例如压缩)。这种组织结构和元数据存储可能听起来微不足道,但在实践中却非常有用。
Another advantage of HDF is that the datasets can be either fixed-size orflexibly sized. Therefore, it's easy to append data to a large dataset without having to create an entire new copy.
HDF的另一个优点是,数据集可以是固定大小的或灵活的尺寸。因此,可以轻松地将数据附加到大型数据集,而无需创建全新的副本。
Additionally, HDF5 is a standardized format with libraries available for almost any language, so sharing your on-disk data between, say Matlab, Fortran, R, C, and Python is very easy with HDF. (To be fair, it's not too hard with a big binary array, too, as long as you're aware of the C vs. F ordering and know the shape, dtype, etc of the stored array.)
此外,HDF5 是一种标准化格式,其库几乎可用于任何语言,因此在 Matlab、Fortran、R、C 和 Python 之间共享磁盘数据非常容易,使用 HDF。(公平地说,使用大的二进制数组也不是太难,只要您知道 C 与 F 的排序并知道存储数组的形状、dtype 等。)
HDF advantages for a large array: Faster I/O of an arbitrary slice
大型阵列的 HDF 优势:任意切片的更快 I/O
Just as the TL/DR:For an ~8GB 3D array, reading a "full" slice along any axis took ~20 seconds with a chunked HDF5 dataset, and 0.3 seconds (best-case) to over three hours(worst case) for a memmapped array of the same data.
就像 TL/DR:对于 ~8GB 的 3D 阵列,使用分块 HDF5 数据集沿任何轴读取“完整”切片需要约 20 秒,0.3 秒(最佳情况)到三个多小时(最坏情况)相同数据的内存映射数组。
Beyond the things listed above, there's another big advantage to a "chunked"* on-disk data format such as HDF5: Reading an arbitrary slice (emphasis on arbitrary) will typically be much faster, as the on-disk data is more contiguous on average.
除了上面列出的内容之外,“分块”* 磁盘数据格式(例如 HDF5)还有另一个很大的优势:读取任意切片(强调任意)通常会快得多,因为磁盘数据在平均数。
*
(HDF5 doesn't have to be a chunked data format. It supports chunking, but doesn't require it. In fact, the default for creating a dataset in h5py
is not to chunk, if I recall correctly.)
*
(HDF5 不必是分块数据格式。它支持分块,但不需要它。事实上h5py
,如果我没记错的话,在其中创建数据集的默认设置不是分块。)
Basically, your best case disk-read speed and your worst case disk read speed for a given slice of your dataset will be fairly close with a chunked HDF dataset (assuming you chose a reasonable chunk size or let a library choose one for you). With a simple binary array, the best-case is faster, but the worst-case is muchworse.
基本上,对于数据集的给定切片,最佳情况下的磁盘读取速度和最坏情况下的磁盘读取速度将与分块 HDF 数据集相当接近(假设您选择了合理的块大小或让库为您选择一个)。使用简单的二进制数组,最好的情况会更快,但最坏的情况要糟糕得多。
One caveat, if you have an SSD, you likely won't notice a huge difference in read/write speed. With a regular hard drive, though, sequential reads are much, much faster than random reads. (i.e. A regular hard drive has long seek
time.) HDF still has an advantage on an SSD, but it's more due its other features (e.g. metadata, organization, etc) than due to raw speed.
一个警告,如果你有一个 SSD,你可能不会注意到读/写速度的巨大差异。但是,对于普通硬盘,顺序读取比随机读取快得多。(即普通硬盘驱动器有很长的seek
时间。)HDF 仍然比 SSD 有优势,但更多是由于它的其他功能(例如元数据、组织等)而不是原始速度。
First off, to clear up confusion, accessing an h5py
dataset returns an object that behaves fairly similarly to a numpy array, but does not load the data into memory until it's sliced. (Similar to memmap, but not identical.) Have a look at the h5py
introductionfor more information.
首先,为了消除混淆,访问h5py
数据集会返回一个对象,该对象的行为与 numpy 数组非常相似,但在数据被切片之前不会将数据加载到内存中。(类似于 memmap,但不完全相同。)查看h5py
介绍以获取更多信息。
Slicing the dataset will load a subset of the data into memory, but presumably you want to do something with it, at which point you'll need it in memory anyway.
切片数据集会将数据的一个子集加载到内存中,但大概您想用它做一些事情,此时无论如何您都需要在内存中使用它。
If you do want to do out-of-core computations, you can fairly easily for tabular data with pandas
or pytables
. It is possible with h5py
(nicer for big N-D arrays), but you need to drop down to a touch lower level and handle the iteration yourself.
如果您确实想要进行核外计算,您可以很容易地使用pandas
或来处理表格数据pytables
。使用h5py
(对于大型 ND 阵列更好)是可能的,但是您需要降低到更低的级别并自己处理迭代。
However, the future of numpy-like out-of-core computations is Blaze. Have a look at itif you really want to take that route.
然而,类似 numpy 的核外计算的未来是 Blaze。如果你真的想走那条路,看看它。
The "unchunked" case
“未解决”的案例
First off, consider a 3D C-ordered array written to disk (I'll simulate it by calling arr.ravel()
and printing the result, to make things more visible):
首先,考虑一个写入磁盘的 3D C 序数组(我将通过调用arr.ravel()
和打印结果来模拟它,以使事情更明显):
In [1]: import numpy as np
In [2]: arr = np.arange(4*6*6).reshape(4,6,6)
In [3]: arr
Out[3]:
array([[[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[ 12, 13, 14, 15, 16, 17],
[ 18, 19, 20, 21, 22, 23],
[ 24, 25, 26, 27, 28, 29],
[ 30, 31, 32, 33, 34, 35]],
[[ 36, 37, 38, 39, 40, 41],
[ 42, 43, 44, 45, 46, 47],
[ 48, 49, 50, 51, 52, 53],
[ 54, 55, 56, 57, 58, 59],
[ 60, 61, 62, 63, 64, 65],
[ 66, 67, 68, 69, 70, 71]],
[[ 72, 73, 74, 75, 76, 77],
[ 78, 79, 80, 81, 82, 83],
[ 84, 85, 86, 87, 88, 89],
[ 90, 91, 92, 93, 94, 95],
[ 96, 97, 98, 99, 100, 101],
[102, 103, 104, 105, 106, 107]],
[[108, 109, 110, 111, 112, 113],
[114, 115, 116, 117, 118, 119],
[120, 121, 122, 123, 124, 125],
[126, 127, 128, 129, 130, 131],
[132, 133, 134, 135, 136, 137],
[138, 139, 140, 141, 142, 143]]])
The values would be stored on-disk sequentially as shown on line 4 below. (Let's ignore filesystem details and fragmentation for the moment.)
这些值将按顺序存储在磁盘上,如下面的第 4 行所示。(让我们暂时忽略文件系统详细信息和碎片。)
In [4]: arr.ravel(order='C')
Out[4]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,
65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103,
104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143])
In the best case scenario, let's take a slice along the first axis. Notice that these are just the first 36 values of the array. This will be a veryfast read! (one seek, one read)
在最好的情况下,让我们沿第一个轴切片。请注意,这些只是数组的前 36 个值。这将是一个非常快的阅读!(一寻一读)
In [5]: arr[0,:,:]
Out[5]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]])
Similarly, the next slice along the first axis will just be the next 36 values. To read a complete slice along this axis, we only need one seek
operation. If all we're going to be reading is various slices along this axis, then this is the perfect file structure.
同样,沿第一个轴的下一个切片将只是接下来的 36 个值。要沿该轴读取完整的切片,我们只需要一个seek
操作。如果我们要阅读的只是沿该轴的不同切片,那么这就是完美的文件结构。
However, let's consider the worst-case scenario: A slice along the last axis.
但是,让我们考虑最坏的情况:沿最后一个轴的切片。
In [6]: arr[:,:,0]
Out[6]:
array([[ 0, 6, 12, 18, 24, 30],
[ 36, 42, 48, 54, 60, 66],
[ 72, 78, 84, 90, 96, 102],
[108, 114, 120, 126, 132, 138]])
To read this slice in, we need 36 seeks and 36 reads, as all of the values are separated on disk. None of them are adjacent!
要读入这个切片,我们需要 36 次搜索和 36 次读取,因为所有值都在磁盘上分开。没有一个是相邻的!
This may seem pretty minor, but as we get to larger and larger arrays, the number and size of the seek
operations grows rapidly. For a large-ish (~10Gb) 3D array stored in this way and read in via memmap
, reading a full slice along the "worst" axis can easily take tens of minutes, even with modern hardware. At the same time, a slice along the best axis can take less than a second. For simplicity, I'm only showing "full" slices along a single axis, but the exact same thing happens with arbitrary slices of any subset of the data.
这可能看起来很小,但是随着我们获得越来越大的数组,操作的数量和大小seek
会迅速增长。对于以这种方式存储并通过memmap
读取的大型(~10Gb)3D 阵列,即使使用现代硬件,沿“最差”轴读取完整切片也很容易花费数十分钟。同时,沿最佳轴的切片可能需要不到一秒钟的时间。为简单起见,我仅沿单个轴显示“完整”切片,但完全相同的事情发生在数据的任何子集的任意切片上。
Incidentally there are several file formats that take advantage of this and basically store three copies of huge3D arrays on disk: one in C-order, one in F-order, and one in the intermediate between the two. (An example of this is Geoprobe's D3D format, though I'm not sure it's documented anywhere.) Who cares if the final file size is 4TB, storage is cheap! The crazy thing about that is that because the main use case is extracting a single sub-slice in each direction, the reads you want to make are very, very fast. It works very well!
顺便提一下,有几种文件格式利用了这一点,基本上在磁盘上存储了三个巨大的3D 阵列副本:一个按 C 顺序,一个按 F 顺序,一个在两者之间。(这方面的一个例子是 Geoprobe 的 D3D 格式,虽然我不确定它是否在任何地方都有记录。)谁在乎最终文件大小是否为 4TB,存储很便宜!疯狂之处在于,因为主要用例是在每个方向上提取单个子切片,所以您想要进行的读取非常非常快。它运作良好!
The simple "chunked" case
简单的“分块”案例
Let's say we store 2x2x2 "chunks" of the 3D array as contiguous blocks on disk. In other words, something like:
假设我们将 3D 阵列的 2x2x2“块”作为连续块存储在磁盘上。换句话说,类似于:
nx, ny, nz = arr.shape
slices = []
for i in range(0, nx, 2):
for j in range(0, ny, 2):
for k in range(0, nz, 2):
slices.append((slice(i, i+2), slice(j, j+2), slice(k, k+2)))
chunked = np.hstack([arr[chunk].ravel() for chunk in slices])
So the data on disk would look like chunked
:
所以磁盘上的数据看起来像chunked
:
array([ 0, 1, 6, 7, 36, 37, 42, 43, 2, 3, 8, 9, 38,
39, 44, 45, 4, 5, 10, 11, 40, 41, 46, 47, 12, 13,
18, 19, 48, 49, 54, 55, 14, 15, 20, 21, 50, 51, 56,
57, 16, 17, 22, 23, 52, 53, 58, 59, 24, 25, 30, 31,
60, 61, 66, 67, 26, 27, 32, 33, 62, 63, 68, 69, 28,
29, 34, 35, 64, 65, 70, 71, 72, 73, 78, 79, 108, 109,
114, 115, 74, 75, 80, 81, 110, 111, 116, 117, 76, 77, 82,
83, 112, 113, 118, 119, 84, 85, 90, 91, 120, 121, 126, 127,
86, 87, 92, 93, 122, 123, 128, 129, 88, 89, 94, 95, 124,
125, 130, 131, 96, 97, 102, 103, 132, 133, 138, 139, 98, 99,
104, 105, 134, 135, 140, 141, 100, 101, 106, 107, 136, 137, 142, 143])
And just to show that they're 2x2x2 blocks of arr
, notice that these are the first 8 values of chunked
:
为了表明它们是 2x2x2 块arr
,请注意这些是 的前 8 个值chunked
:
In [9]: arr[:2, :2, :2]
Out[9]:
array([[[ 0, 1],
[ 6, 7]],
[[36, 37],
[42, 43]]])
To read in any slice along an axis, we'd read in either 6 or 9 contiguous chunks (twice as much data as we need) and then only keep the portion we wanted. That's a worst-case maximum of 9 seeks vs a maximum of 36 seeks for the non-chunked version. (But the best case is still 6 seeks vs 1 for the memmapped array.) Because sequential reads are very fast compared to seeks, this significantly reduces the amount of time it takes to read an arbitrary subset into memory. Once again, this effect becomes larger with larger arrays.
要沿轴读取任何切片,我们将读取 6 或 9 个连续块(我们需要的数据量的两倍),然后只保留我们想要的部分。这是在最坏情况下最多 9 次搜索与非分块版本最多 36 次搜索的情况。(但最好的情况仍然是 6 次搜索 vs 1 次 memmapped 数组。)因为与搜索相比顺序读取非常快,这显着减少了将任意子集读入内存所需的时间。再一次,这种效果随着阵列的增加而变得更大。
HDF5 takes this a few steps farther. The chunks don't have to be stored contiguously, and they're indexed by a B-Tree. Furthermore, they don't have to be the same size on disk, so compression can be applied to each chunk.
HDF5 更进一步。块不必连续存储,它们由 B 树索引。此外,它们在磁盘上的大小不必相同,因此可以对每个块应用压缩。
Chunked arrays with h5py
分块数组 h5py
By default, h5py
doesn't created chunked HDF files on disk (I think pytables
does, by contrast). If you specify chunks=True
when creating the dataset, however, you'll get a chunked array on disk.
默认情况下,h5py
不会在磁盘上创建分块 HDF 文件(pytables
相比之下,我认为会)。chunks=True
但是,如果您在创建数据集时指定,您将在磁盘上获得一个分块数组。
As a quick, minimal example:
作为一个快速,最小的例子:
import numpy as np
import h5py
data = np.random.random((100, 100, 100))
with h5py.File('test.hdf', 'w') as outfile:
dset = outfile.create_dataset('a_descriptive_name', data=data, chunks=True)
dset.attrs['some key'] = 'Did you want some metadata?'
Note that chunks=True
tells h5py
to automatically pick a chunk size for us. If you know more about your most common use-case, you can optimize the chunk size/shape by specifying a shape tuple (e.g. (2,2,2)
in the simple example above). This allows you to make reads along a particular axis more efficient or optimize for reads/writes of a certain size.
请注意,chunks=True
告诉h5py
我们自动选择一个块大小。如果您对最常见的用例了解更多,则可以通过指定形状元组来优化块大小/形状(例如(2,2,2)
在上面的简单示例中)。这使您可以更高效地沿特定轴读取或优化特定大小的读取/写入。
I/O Performance comparison
I/O 性能比较
Just to emphasize the point, let's compare reading in slices from a chunked HDF5 dataset and a large (~8GB), Fortran-ordered 3D array containing the same exact data.
只是为了强调这一点,让我们比较从分块 HDF5 数据集和包含相同确切数据的大型 (~8GB)、Fortran 排序 3D 数组中读取切片。
I've cleared all OS cachesbetween each run, so we're seeing the "cold" performance.
我在每次运行之间清除了所有操作系统缓存,所以我们看到了“冷”性能。
For each file type, we'll test reading in a "full" x-slice along the first axis and a "full" z-slize along the last axis. For the Fortran-ordered memmapped array, the "x" slice is the worst case, and the "z" slice is the best case.
对于每种文件类型,我们将测试沿第一个轴的“完整”x 切片和沿最后一个轴的“完整”z 切片的读取。对于 Fortran 排序的 memapped 数组,“x”切片是最坏的情况,“z”切片是最好的情况。
The code used is in a gist(including creating the hdf
file). I can't easily share the data used here, but you could simulate it by an array of zeros of the same shape (621, 4991, 2600)
and type np.uint8
.
使用的代码在一个要点中(包括创建hdf
文件)。我无法轻松共享此处使用的数据,但您可以通过相同形状的零数组(621, 4991, 2600)
并键入np.uint8
.
The chunked_hdf.py
looks like this:
该chunked_hdf.py
如下所示:
import sys
import h5py
def main():
data = read()
if sys.argv[1] == 'x':
x_slice(data)
elif sys.argv[1] == 'z':
z_slice(data)
def read():
f = h5py.File('/tmp/test.hdf5', 'r')
return f['seismic_volume']
def z_slice(data):
return data[:,:,0]
def x_slice(data):
return data[0,:,:]
main()
memmapped_array.py
is similar, but has a touch more complexity to ensure the slices are actually loaded into memory (by default, another memmapped
array would be returned, which wouldn't be an apples-to-apples comparison).
memmapped_array.py
是相似的,但为了确保切片实际加载到内存中,它具有更高的复杂性(默认情况下,memmapped
将返回另一个数组,这不会是一个苹果对苹果的比较)。
import numpy as np
import sys
def main():
data = read()
if sys.argv[1] == 'x':
x_slice(data)
elif sys.argv[1] == 'z':
z_slice(data)
def read():
big_binary_filename = '/data/nankai/data/Volumes/kumdep01_flipY.3dv.vol'
shape = 621, 4991, 2600
header_len = 3072
data = np.memmap(filename=big_binary_filename, mode='r', offset=header_len,
order='F', shape=shape, dtype=np.uint8)
return data
def z_slice(data):
dat = np.empty(data.shape[:2], dtype=data.dtype)
dat[:] = data[:,:,0]
return dat
def x_slice(data):
dat = np.empty(data.shape[1:], dtype=data.dtype)
dat[:] = data[0,:,:]
return dat
main()
Let's have a look at the HDF performance first:
我们先来看看HDF的性能:
jofer at cornbread in ~
$ sudo ./clear_cache.sh
jofer at cornbread in ~
$ time python chunked_hdf.py z
python chunked_hdf.py z 0.64s user 0.28s system 3% cpu 23.800 total
jofer at cornbread in ~
$ sudo ./clear_cache.sh
jofer at cornbread in ~
$ time python chunked_hdf.py x
python chunked_hdf.py x 0.12s user 0.30s system 1% cpu 21.856 total
A "full" x-slice and a "full" z-slice take about the same amount of time (~20sec). Considering this is an 8GB array, that's not too bad. Most of the time
“完整”x 切片和“完整”z 切片花费的时间大致相同(~20 秒)。考虑到这是一个 8GB 的阵列,这还不错。大多数时候
And if we compare this to the memmapped array times (it's Fortran-ordered: A "z-slice" is the best case and an "x-slice" is the worst case.):
如果我们将其与 memmapped 数组时间进行比较(它是 Fortran 排序的:“z-slice”是最好的情况,“x-slice”是最坏的情况。):
jofer at cornbread in ~
$ sudo ./clear_cache.sh
jofer at cornbread in ~
$ time python memmapped_array.py z
python memmapped_array.py z 0.07s user 0.04s system 28% cpu 0.385 total
jofer at cornbread in ~
$ sudo ./clear_cache.sh
jofer at cornbread in ~
$ time python memmapped_array.py x
python memmapped_array.py x 2.46s user 37.24s system 0% cpu 3:35:26.85 total
Yes, you read that right. 0.3 seconds for one slice direction and ~3.5 hoursfor the other.
是的,你没看错。一个切片方向 0.3 秒,另一个切片方向约 3.5小时。
The time to slice in the "x" direction is farlonger than the amount of time it would take to load the entire 8GB array into memory and select the slice we wanted! (Again, this is a Fortran-ordered array. The opposite x/z slice timing would be the case for a C-ordered array.)
在“x”方向切片的时间远远长于将整个 8GB 阵列加载到内存中并选择我们想要的切片所需的时间!(同样,这是一个 Fortran 顺序数组。相反的 x/z 切片时序将是 C 顺序数组的情况。)
However, if we're always wanting to take a slice along the best-case direction, the big binary array on disk is very good. (~0.3 sec!)
但是,如果我们总是想沿最佳情况方向切分,那么磁盘上的大二进制数组非常好。(~0.3 秒!)
With a memmapped array, you're stuck with this I/O discrepancy (or perhaps anisotropy is a better term). However, with a chunked HDF dataset, you can choose the chunksize such that access is either equal or is optimized for a particular use-case. It gives you a lot more flexibility.
使用 memapped 数组,您会遇到这种 I/O 差异(或者各向异性是一个更好的术语)。但是,对于分块 HDF 数据集,您可以选择块大小,以便访问相等或针对特定用例进行优化。它为您提供了更多的灵活性。
In summary
总之
Hopefully that helps clear up one part of your question, at any rate. HDF5 has many other advantages over "raw" memmaps, but I don't have room to expand on all of them here. Compression can speed some things up (the data I work with doesn't benefit much from compression, so I rarely use it), and OS-level caching often plays more nicely with HDF5 files than with "raw" memmaps. Beyond that, HDF5 is a really fantastic container format. It gives you a lot of flexibility in managing your data, and can be used from more or less any programming language.
无论如何,希望这有助于澄清您问题的一部分。HDF5 比“原始” memmap 有许多其他优势,但我没有空间在这里展开所有这些。压缩可以加快某些事情的速度(我处理的数据并没有从压缩中受益多少,所以我很少使用它),并且操作系统级缓存对于 HDF5 文件通常比“原始”memmaps 更好。除此之外,HDF5 是一种非常棒的容器格式。它为您管理数据提供了很大的灵活性,并且或多或少可以从任何编程语言中使用。
Overall, try it and see if it works well for your use case. I think you might be surprised.
总的来说,尝试一下,看看它是否适合您的用例。我想你可能会感到惊讶。