Python numpy 数组的最快保存和加载选项

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30329726/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:15:59  来源:igfitidea点击:

Fastest save and load options for a numpy array

pythonarraysperformancenumpyio

提问by dbliss

I have a script that generates two-dimensional numpyarrays with dtype=floatand shape on the order of (1e3, 1e6). Right now I'm using np.saveand np.loadto perform IO operations with the arrays. However, these functions take several seconds for each array. Are there faster methods for saving and loading the entire arrays (i.e., without making assumptions about their contents and reducing them)? I'm open to converting the arrays to another type before saving as long as the data are retained exactly.

我有一个脚本,它生成二维numpyarrays,dtype=float形状为(1e3, 1e6). 现在我正在使用np.save并使用np.load数组执行 IO 操作。但是,对于每个数组,这些函数需要几秒钟的时间。是否有更快的方法来保存和加载整个数组(即,不对它们的内容进行假设并减少它们)?array只要数据完全保留,我就可以在保存之前将s转换为另一种类型。

采纳答案by Jiby

For really big arrays, I've heard about several solutions, and they mostly on being lazy on the I/O :

对于非常大的数组,我听说过几种解决方案,它们主要是在 I/O 上懒惰:

  • NumPy.memmap, maps big arrays to binary form
    • Pros :
      • No dependency other than Numpy
      • Transparent replacement of ndarray(Any class accepting ndarray accepts memmap)
    • Cons :
      • Chunks of your array are limited to 2.5G
      • Still limited by Numpy throughput
  • Use Python bindings for HDF5, a bigdata-ready file format, like PyTablesor h5py

    • Pros :
      • Format supports compression, indexing, and other super nice features
      • Apparently the ultimate PetaByte-large file format
    • Cons :
      • Learning curve of having a hierarchical format ?
      • Have to define what your performance needs are (see later)
  • Python's picklingsystem (out of the race, mentioned for Pythonicity rather than speed)

    • Pros:
      • It's Pythonic ! (haha)
      • Supports all sorts of objects
    • Cons:
      • Probably slower than others (because aimed at any objects not arrays)
  • NumPy.memmap,将大数组映射为二进制形式
    • 优点:
      • 除了 Numpy 没有依赖
      • 透明替换ndarray(任何接受 ndarray 的类都接受memmap
    • 缺点:
      • 阵列的块限制为 2.5G
      • 仍受 Numpy 吞吐量限制
  • 对 HDF5 使用 Python 绑定,这是一种支持大数据的文件格式,如PyTablesh5py

    • 优点:
      • 格式支持压缩、索引等超级好用的功能
      • 显然是最终的 PetaByte 大文件格式
    • 缺点:
      • 具有分层格式的学习曲线?
      • 必须定义您的性能需求(见下文)
  • Python 的酸洗系统(在比赛之外,提到 Pythonicity 而不是速度)

    • 优点:
      • 这是Pythonic!(哈哈)
      • 支持各种对象
    • 缺点:
      • 可能比其他人慢(因为针对任何对象而不是数组)


Numpy.memmap

numpy.memmap

From the docs of NumPy.memmap:

来自NumPy.memmap的文档:

Create a memory-map to an array stored in a binary file on disk.

Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory

The memmap object can be used anywhere an ndarray is accepted. Given any memmap fp, isinstance(fp, numpy.ndarray)returns True.

创建到存储在磁盘上的二进制文件中的数组的内存映射。

内存映射文件用于访问磁盘上大文件的小片段,而不用将整个文件读入内存

memmap 对象可以在任何接受 ndarray 的地方使用。给定任何 memmap fpisinstance(fp, numpy.ndarray)返回 True。



HDF5 arrays

HDF5 阵列

From the h5py doc

来自h5py 文档

Lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.

允许您存储大量数值数据,并轻松地从 NumPy 操作这些数据。例如,您可以将存储在磁盘上的数 TB 数据集切片,就好像它们是真正的 NumPy 数组一样。数以千计的数据集可以存储在一个文件中,可以根据需要进行分类和标记。

The format supports compression of data in various ways (more bits loaded for same I/O read), but this means that the data becomes less easy to query individually, but in your case (purely loading / dumping arrays) it might be efficient

该格式支持以各种方式压缩数据(为相同的 I/O 读取加载更多位),但这意味着单独查询数据变得不那么容易,但在您的情况下(纯粹加载/转储数组)它可能是有效的

回答by Mike Müller

Here is a comparison with PyTables.

这是与 PyTables 的比较。

I cannot get up to (int(1e3), int(1e6)due to memory restrictions. Therefore, I used a smaller array:

(int(1e3), int(1e6)由于内存限制,我无法起床。因此,我使用了一个较小的数组:

data = np.random.random((int(1e3), int(1e5)))

NumPy save:

NumPy save

%timeit np.save('array.npy', data)
1 loops, best of 3: 4.26 s per loop

NumPy load:

NumPy load

%timeit data2 = np.load('array.npy')
1 loops, best of 3: 3.43 s per loop

PyTables writing:

PyTables 写作:

%%timeit
with tables.open_file('array.tbl', 'w') as h5_file:
    h5_file.create_array('/', 'data', data)

1 loops, best of 3: 4.16 s per loop

PyTables reading:

PyTables 阅读:

 %%timeit
 with tables.open_file('array.tbl', 'r') as h5_file:
      data2 = h5_file.root.data.read()

 1 loops, best of 3: 3.51 s per loop

The numbers are very similar. So no real gain wit PyTables here. But we are pretty close to the maximum writing and reading rate of my SSD.

数字非常相似。所以在这里 PyTables 没有真正的收获。但是我们非常接近我的 SSD 的最大写入和读取速度。

Writing:

写作:

Maximum write speed: 241.6 MB/s
PyTables write speed: 183.4 MB/s

Reading:

读:

Maximum read speed: 250.2
PyTables read speed: 217.4

Compression does not really help due to the randomness of the data:

由于数据的随机性,压缩并没有真正的帮助:

%%timeit
FILTERS = tables.Filters(complib='blosc', complevel=5)
with tables.open_file('array.tbl', mode='w', filters=FILTERS) as h5_file:
    h5_file.create_carray('/', 'data', obj=data)
1 loops, best of 3: 4.08 s per loop

Reading of the compressed data becomes a bit slower:

读取压缩数据变得有点慢:

%%timeit
with tables.open_file('array.tbl', 'r') as h5_file:
    data2 = h5_file.root.data.read()

1 loops, best of 3: 4.01 s per loop

This is different for regular data:

这与常规数据不同:

 reg_data = np.ones((int(1e3), int(1e5)))

Writing is significantly faster:

写入速度明显更快:

%%timeit
FILTERS = tables.Filters(complib='blosc', complevel=5)
with tables.open_file('array.tbl', mode='w', filters=FILTERS) as h5_file:
    h5_file.create_carray('/', 'reg_data', obj=reg_data)

1 loops, best of 3: 849 ms per loop

1 个循环,最好的 3 个:每个循环 849 毫秒

The same holds true for reading:

这同样适用于阅读:

%%timeit
with tables.open_file('array.tbl', 'r') as h5_file:
    reg_data2 = h5_file.root.reg_data.read()

1 loops, best of 3: 1.7 s per loop

Conclusion: The more regular your data the faster it should get using PyTables.

结论:您的数据越规则,使用 PyTables 的速度就越快。

回答by Clock ZHONG

According to my experience, np.save()&np.load() is the fastest solution when trasfering data between hard disk and memory so far. I've heavily relied my data loading on database and HDFS system before I realized this conclusion. My tests shows that: The database data loading(from hard disk to memory) bandwidth could be around 50 MBps(Byets/Second), but the np.load() bandwidth is almost same as my hard disk maximum bandwidth: 2GBps(Byets/Second). Both test environments use the simplest data structure.

根据我的经验,np.save()&np.load() 是目前为止在硬盘和内存之间传输数据时最快的解决方案。在我意识到这个结论之前,我非常依赖数据库和 HDFS 系统上的数据加载。我的测试表明:数据库数据加载(从硬盘到内存)带宽可能在 50 MBps(字节/秒)左右,但 np.load() 带宽几乎与我的硬盘最大带宽相同:2GBps(字节/秒)第二)。两种测试环境都使用最简单的数据结构。

And I don't think it's a problem to use several seconds to loading an array with shape: (1e3, 1e6). E.g. Your array shape is (1000, 1000000), its data type is float128, then the pure data size is (128/8)*1000*1,000,000=16,000,000,000=16GBytes and if it takes 4 seconds, Then your data loading bandwidth is 16GBytes/4Seconds = 4GBps. SATA3 maximum bandwidth is 600MBps=0.6GBps, your data loading bandwidth is already 6 times of it, your data loading performance almost could compete with DDR's maximum bandwidth, what else do you want?

而且我认为使用几秒钟来加载具有形状的数组不是问题:(1e3, 1e6)。比如你的数组shape是(1000, 1000000),数据类型是float128,那么纯数据大小是(128/8)*1000*1,000,000=16,000,000,000=16GBytes,如果需要4秒,那么你的数据加载带宽是16GBytes /4 秒 = 4GBps。SATA3最大带宽600MBps=0.6GBps,你的数据加载带宽已经是它的6倍了,你的数据加载性能几乎可以和DDR的最大带宽竞争,你还想要什么?

So my final conclusion is:

所以我的最终结论是:

Don't use python's Pickle, don't use any database, don't use any big data system to store your data into hard disk, if you could use np.save() and np.load(). These two functions are the fastest solution to transfer data between harddisk and memory so far.

不要使用python的Pickle,不要使用任何数据库,不要使用任何大数据系统将您的数据存储到硬盘中,如果您可以使用np.save()和np.load()。这两个函数是迄今为止硬盘和内存之间传输数据最快的解决方案。

I've also tested the HDF5, and found that it's much slower than np.load() and np.save(), so use np.save()&np.load() if you've enough DDR memory in your platfrom.

我还测试了HDF5,发现它比 np.load() 和 np.save() 慢得多,因此如果您的平台中有足够的 DDR 内存,请使用 np.save()&np.load() 。

回答by Nico Schl?mer

I've compared a few methods using perfplot(one of my projects). Here are the results:

我已经使用perfplot(我的项目之一)比较了几种方法。结果如下:

Writing

写作

enter image description here

在此处输入图片说明

For large arrays, all methods are about equally fast. The file sizes are also equal which is to be expected since the input array are random doubles and hence hardly compressible.

对于大型数组,所有方法的速度都差不多。文件大小也相同,这是可以预料的,因为输入数组是随机双精度数,因此几乎不可压缩。

Code to reproduce the plot:

重现情节的代码:

import perfplot
import pickle
import numpy
import h5py
import tables
import zarr


def npy_write(data):
    numpy.save("npy.npy", data)


def hdf5_write(data):
    f = h5py.File("hdf5.h5", "w")
    f.create_dataset("data", data=data)


def pickle_write(data):
    with open("test.pkl", "wb") as f:
        pickle.dump(data, f)


def pytables_write(data):
    f = tables.open_file("pytables.h5", mode="w")
    gcolumns = f.create_group(f.root, "columns", "data")
    f.create_array(gcolumns, "data", data, "data")
    f.close()


def zarr_write(data):
    zarr.save("out.zarr", data)


perfplot.save(
    "write.png",
    setup=numpy.random.rand,
    kernels=[npy_write, hdf5_write, pickle_write, pytables_write, zarr_write],
    n_range=[2 ** k for k in range(28)],
    xlabel="len(data)",
    logx=True,
    logy=True,
    equality_check=None,
)

Reading

enter image description here

在此处输入图片说明

pickles, pytables and hdf5 are roughly equally fast; pickles and zarr are slower for large arrays.

pickles、pytables 和 hdf5 的速度大致相同;对于大型数组,pickles 和 zarr 速度较慢。

Code to reproduce the plot:

重现情节的代码:

import perfplot
import pickle
import numpy
import h5py
import tables
import zarr


def setup(n):
    data = numpy.random.rand(n)
    # write all files
    #
    numpy.save("out.npy", data)
    #
    f = h5py.File("out.h5", "w")
    f.create_dataset("data", data=data)
    f.close()
    #
    with open("test.pkl", "wb") as f:
        pickle.dump(data, f)
    #
    f = tables.open_file("pytables.h5", mode="w")
    gcolumns = f.create_group(f.root, "columns", "data")
    f.create_array(gcolumns, "data", data, "data")
    f.close()
    #
    zarr.save("out.zip", data)


def npy_read(data):
    return numpy.load("out.npy")


def hdf5_read(data):
    f = h5py.File("out.h5", "r")
    out = f["data"][()]
    f.close()
    return out


def pickle_read(data):
    with open("test.pkl", "rb") as f:
        out = pickle.load(f)
    return out


def pytables_read(data):
    f = tables.open_file("pytables.h5", mode="r")
    out = f.root.columns.data[()]
    f.close()
    return out


def zarr_read(data):
    return zarr.load("out.zip")


perfplot.show(
    setup=setup,
    kernels=[
        npy_read,
        hdf5_read,
        pickle_read,
        pytables_read,
        zarr_read,
    ],
    n_range=[2 ** k for k in range(28)],
    xlabel="len(data)",
    logx=True,
    logy=True,
)