Python 有效地压缩 numpy 数组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22400652/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:53:20  来源:igfitidea点击:

Compress numpy arrays efficiently

pythonarraysnumpycompressionlossless-compression

提问by Basj

I tried various methods to do data compression when saving to disk some numpy arrays.

保存到磁盘时,我尝试了各种方法来进行数据压缩 some numpy arrays.

These 1D arrays contain sampled data at a certain sampling rate (can be sound recorded with a microphone, or any other measurment with any sensor) : the data is essentially continuous(in a mathematical sense ; of course after sampling it is now discrete data).

这些一维阵列包含一定采样率的采样数据(可以用麦克风记录声音,或用任何传感器进行任何其他测量):数据本质上是连续的(在数学意义上;当然,采样后它现在是离散数据) .

I tried with HDF5(h5py) :

我试过HDF5(h5py) :

f.create_dataset("myarray1", myarray, compression="gzip", compression_opts=9)

but this is quite slow, and the compression ratio is not the best we can expect.

但这很慢,而且压缩比不是我们所能期望的最好的。

I also tried with

我也试过

numpy.savez_compressed()

but once again it may not be the best compression algorithm for such data (described before).

但再一次,它可能不是此类数据的最佳压缩算法(之前描述过)。

What would you choose for better compression ratio on a numpy array, with such data ?

对于numpy array这样的数据,您会选择什么来获得更好的压缩率?

(I thought about things like lossless FLAC (initially designed for audio), but is there an easy way to apply such an algorithm on numpy data ?)

(我想到了无损 FLAC(最初为音频设计)之类的东西,但是有没有一种简单的方法可以将这种算法应用于 numpy 数据?)

回答by SiggyF

You might want to try blz. It can compress binary data very efficiently.

您可能想尝试blz。它可以非常有效地压缩二进制数据。

import blz
# this stores the array in memory
blz.barray(myarray) 
# this stores the array on disk
blz.barray(myarray, rootdir='arrays') 

It storesarrays either on file or compressed in memory. Compression is based on blosc. See the scipy videofor a bit of context.

它将数组存储在文件中或压缩在内存中。压缩基于blosc。有关一些上下文,请参阅scipy 视频

回答by Eelco Hoogendoorn

What constitutes the best compression (if any) highly depends on the nature of the data. Many kinds of measurement data are virtually completely incompressible, if loss-free compression is indeed required.

什么构成最佳压缩(如果有)在很大程度上取决于数据的性质。如果确实需要无损压缩,则许多类型的测量数据实际上是完全不可压缩的。

The pytables docs contains a lot of useful guidelines on data compression. It also details speed tradeoffs and so on; higher compression levels are usually a waste of time, as it turns out.

pytables 文档包含许多关于数据压缩的有用指南。它还详细介绍了速度权衡等;事实证明,更高的压缩级别通常是浪费时间。

http://pytables.github.io/usersguide/optimization.html

http://pytables.github.io/usersguide/optimization.html

Note that this is probably as good as it will get. For integer measurements, a combination of a shuffle filter with a simple zip-type compression usually works reasonably well. This filter very efficiently exploits the common situation where the highest-endian byte is usually 0, and only included to guard against overflow.

请注意,这可能和它会得到的一样好。对于整数测量,混洗过滤器与简单的 zip 类型压缩的组合通常工作得相当好。此过滤器非常有效地利用了最高端字节通常为 0 的常见情况,并且仅用于防止溢出。

回答by Alex I

  1. Noise is incompressible. Thus, any part of the data that you have which is noise will go into the compressed data 1:1 regardless of the compression algorithm, unless you discard it somehow (lossy compression). If you have a 24 bits per sample with effective number of bits (ENOB) equal to 16 bits, the remaining 24-16 = 8 bits of noise will limit your maximum lossless compression ratio to 3:1, even if your (noiseless) data is perfectlycompressible. Non-uniform noise is compressible to the extent to which it is non-uniform; you probably want to look at the effective entropy of the noise to determine how compressible it is.

  2. Compressing data is based on modelling it (partly to remove redundancy, but also partly so you can separate from noise and discard the noise). For example, if you know your data is bandwidth limited to 10MHz and you're sampling at 200MHz, you can do an FFT, zero out the high frequencies, and store the coefficients for the low frequencies only (in this example: 10:1 compression). There is a whole field called "compressive sensing" which is related to this.

  3. A practical suggestion, suitable for many kinds of reasonably continuous data: denoise -> bandwidth limit -> delta compress -> gzip (or xz, etc). Denoise could be the same as bandwidth limit, or a nonlinear filter like a running median. Bandwidth limit can be implemented with FIR/IIR. Delta compress is just y[n] = x[n] - x[n-1].

  1. 噪音是不可压缩的。因此,无论压缩算法如何,您拥有的任何噪声数据部分都将以 1:1 的比例进入压缩数据,除非您以某种方式丢弃它(有损压缩)。如果每个样本有 24 位且有效位数 (ENOB) 等于 16 位,则剩余的 24-16 = 8 位噪声会将您的最大无损压缩比限制为 3:1,即使您的(无噪声)数据是完全可压缩的。非均匀噪声可压缩到其非均匀程度;您可能想查看噪声的有效熵以确定它的可压缩性。

  2. 压缩数据基于对其进行建模(部分是为了消除冗余,但也部分是为了您可以从噪声中分离并丢弃噪声)。例如,如果您知道您的数据带宽限制为 10MHz 并且您以 200MHz 采样,您可以执行 FFT,将高频清零,并仅存储低频系数(在本例中:10:1压缩)。有一个称为“压缩感知”的整个领域与此相关。

  3. 一个实用的建议,适用于多种合理连续的数据:去噪 -> 带宽限制 -> delta 压缩 -> gzip(或 xz 等)。去噪可能与带宽限制相同,也可能与运行中值之类的非线性滤波器相同。带宽限制可以通过 FIR/IIR 实现。增量压缩只是 y[n] = x[n] - x[n-1]。

EDITAn illustration:

编辑一个插图:

from pylab import *
import numpy
import numpy.random
import os.path
import subprocess

# create 1M data points of a 24-bit sine wave with 8 bits of gaussian noise (ENOB=16)
N = 1000000
data = (sin( 2 * pi * linspace(0,N,N) / 100 ) * (1<<23) + \
    numpy.random.randn(N) * (1<<7)).astype(int32)

numpy.save('data.npy', data)
print os.path.getsize('data.npy')
# 4000080 uncompressed size

subprocess.call('xz -9 data.npy', shell=True)
print os.path.getsize('data.npy.xz')
# 1484192 compressed size
# 11.87 bits per sample, ~8 bits of that is noise

data_quantized = data / (1<<8)
numpy.save('data_quantized.npy', data_quantized)
subprocess.call('xz -9 data_quantized.npy', shell=True)
print os.path.getsize('data_quantized.npy.xz')
# 318380
# still have 16 bits of signal, but only takes 2.55 bits per sample to store it

回答by Mike

First, for general data sets, the shuffle=Trueargument to create_datasetimproves compression dramatically with roughly continuous datasets. It very cleverly rearranges the bits to be compressed so that (for continuous data) the bits change slowly, which means they can be compressed better. It slows the compression down a very little bit in my experience, but can substantially improve the compression ratios in my experience. It is notlossy, so you really do get the same data out as you put in.

首先,对于一般数据集,使用大致连续的数据集显着提高压缩率的shuffle=True论点create_dataset。它非常巧妙地重新排列要压缩的位,以便(对于连续数据)位变化缓慢,这意味着可以更好地压缩它们。根据我的经验,它会稍微减慢压缩速度,但可以大大提高我的经验的压缩率。它不是有损的,因此您确实可以获取与输入相同的数据。

If you don't care about the accuracy so much, you can also use the scaleoffsetargument to limit the number of bits stored. Be careful, though, because this is not what it might sound like. In particular, it is an absoluteprecision, rather than a relativeprecision. For example, if you pass scaleoffset=8, but your data points are less then 1e-8you'll just get zeros. Of course, if you've scaled the data to max out around 1, and don't think you can hear differences smaller than a part in a million, you can pass scaleoffset=6and get great compression without much work.

如果您不太关心准确性,您还可以使用scaleoffset参数来限制存储的位数。不过要小心,因为这听起来不像。特别是,它是绝对精度,而不是相对精度。例如,如果您通过scaleoffset=8,但您的数据点较少,那么1e-8您只会得到零。当然,如果您已将数据缩放到最大大约 1,并且认为您听不到小于百万分之一的差异,那么您可以通过scaleoffset=6并获得很好的压缩,而无需做太多工作。

But for audio specifically, I expect that you are right in wanting to use FLAC, because its developers have put in huge amounts of thought, balancing compression with preservation of distinguishable details. You can convert to WAV with scipy, and thence to FLAC.

但特别是对于音频,我希望您使用 FLAC 是正确的,因为它的开发人员已经投入了大量的思考,在压缩与保留可区分的细节之间取得平衡。您可以使用 scipy 转换为 WAV然后转换为 FLAC

回答by Albert

What I do now:

我现在应该做什么:

import gzip
import numpy

f = gzip.GzipFile("my_array.npy.gz", "w")
numpy.save(file=f, arr=my_array)
f.close()

回答by Igor Podolak

The HDF5 file saving with compression can be very quick and efficient: it all depends on the compression algorithm, and whether you want it to be quick while saving, or while reading it back, or both. And, naturally, on the data itself, as it was explained above. GZIP tends to be somewhere in between, but with low compression ratio. BZIP2 is slow on both sides, although with better ratio. BLOSC is one of the algorithms that I have found to get quite compression, and quick on both ends. The downside of BLOSC is that it is not implemented in all implementations of HDF5. Thus your program may not be portable. You always need to make, at least some, tests to select the best configuration for your needs.

使用压缩保存 HDF5 文件可以非常快速和高效:这完全取决于压缩算法,以及您是否希望在保存时快速保存,或在读取时快速保存,或两者兼而有之。而且,很自然地,就数据本身而言,正如上面所解释的那样。GZIP 往往介于两者之间,但压缩率较低。BZIP2 在双方都很慢,但具有更好的比率。BLOSC 是我发现的算法之一,可以得到相当大的压缩,而且两端都很快。BLOSC 的缺点是它并未在 HDF5 的所有实现中实现。因此您的程序可能不可移植。您总是需要至少进行一些测试,以选择最适合您需求的配置。