Numerical compression

remi

I stumbled recently on a couple of libraries to compress data (e.g. take a std::vector<float> and transform it into a smaller array), namely blosc and zfp.

I spent some time playing with zfp, which offers several modes to do lossy compression with e.g. guaranteed accuracy (i.e. error is never larger than a given value) and I have to say that I'm pretty impressed with the results. On arrays of about 1 million elements, it manages to compress them by a factor 10 or so, while keeping the error well within what's acceptable for my needs (on the order of 0.01% or even less), all that in a time that's basically instantaneous compared to what I'm doing with those numbers before/after. So I'm pretty happy about it.

(zfp relies on the data having some sort of spatial structure, so it's not suited to e.g. random numbers, but in my case that's ok)

I also like the ability of zfp to do random access into a compressed array. I haven't tested it but I can see how it could help us with some applications that compute a huge array that barely fits into memory. Though it seems to come with its own limitations (I don't like that it's not obvious how the parameter controlling the compression in that case relates to what will be lost in the data in this case), but anyway the idea is tempting and I will look at it.

Mostly out of curiosity, has anyone else here ever used those libraries? Is there any pitfall or issue with those libraries that might not be apparent with a quick test? Are there other libraries that I might not be aware of and that might give good/better results?

dkf

@remi This is an interesting question to me, despite my data problem being mainly temporal and not spatial.

cvi

Yeah, I'm curious too. I've seen methods like this pop up a few times (I think I stumbled across the two libraries that you mention before), but never really did any in-depth evaluation.

error

@remi said in Numerical compression:

it's not suited to e.g. random numbers,

All compression algorithms rely on the data having some predictability (low entropy). Random numbers, are, by definition, unpredictable (max entropy). No compression algorithm is suited to random numbers.

But still .

Filed under: NB4 But you forgot the compression algorithm where "1" expands to "1238590476895247687348957439858903486974389573894205798437507348573984276098734673489670239678493"

remi

@dkf said in Numerical compression:

@remi This is an interesting question to me, despite my data problem being mainly temporal and not spatial.

I think that it's not an issue, as far as zfp is concerned. You can just pass it a 1D array (temporal) and it will work with it.

Note that from my tests, zfp still works pretty well on any data, well structured or not. I passed it arrays that are not very structured and it still managed to compress them reasonably. But it definitely seems to work better when it has enough data to detect structures (I think?).

For example, I tried it on the Cholesky decomposition of the covariance matrix of highly correlated data, i.e. a triangular matrix that is highly structured. I happened to have that matrix stored as a 1D array, so successive runs of N, N-1, N-2 etc. values represent the consecutive rows, meaning that you have runs of highly structured data that are abruptly interrupted when going to the next row. On a matrix of about 100x100, I can achieve a compression factor of about 3 before I start seeing too many errors, but on a matrix 1000x1000 the factor is about 6, so twice as big. Clearly with longer runs of values it's able to extract more structure and compress better. With some other matrices that have different structures, I got up to a compression factor of 10, which I find pretty impressive.

On the coding side, zfp is a bit tricky to use, there are about 10 calls to make to properly compress an array (with initialisation, clean-up etc.), but once you got it, you can just wrap all that into simple helpers, no big deal. And it has very nifty zfp::array1<double>, zfp::array2<double>, zfp::array3<double>, zfp::array4<double> classes for up to 4D arrays that allow in-place compression, and those ones are literally just one line to use (declare them as zfp::array instead of e.g. std::vector, and that's it!). The only drawback of those is that they can only work with one of the compression modes which... doesn't seem very good (because it has to use the same amount of space for all values). So I haven't really tried them.

So from my tests, I like zfp a lot. Of course, it depends on the data and how you handle it, and there are several internals that may or may not matter, and I'm just proficient enough in this kind of numerical things to know that devil definitely is in the details!

OTOH, I was pretty disappointed by blosc. It says something about not being a compressor but a meta-compressor that can be used with any compression algorithm, but that got into cricket territory far too quickly for me to understand whether it means I'm using it right or not. But the bottom line is, I barely get a compression ratio of 1.1 or 1.2, which isn't very impressive. Of course, it being lossless probably explains a lot (I did not try zfp in lossless mode to see if it gave similar numbers).

dkf

@remi said in Numerical compression:

I think that it's not an issue, as far as zfp is concerned. You can just pass it a 1D array (temporal) and it will work with it.

Hmm. It looks like the whole thing founders for me on deployability for now, but I did come across some interesting ideas while researching this, such as using a separate compression stream per byte of each value, which apparently works well (presumably because the high-order bits are usually highly correlated even if the low ones aren't). Oh well. Not yet a pressing problem for me right now; I'm not running out of space yet!

[EDIT]: Put a note with my conclusions on this in our issue database (tagged bluesky because I don't know if we'll ever get around to it).

remi

@dkf said in Numerical compression:

using a separate compression stream per byte of each value, which apparently works well (presumably because the high-order bits are usually highly correlated even if the low ones aren't).

That's... clever. Though that borders on the point where I wonder whether it's clever or crazy. I'd love to see a library that makes it easy, and test it.

I'm not going to do it myself, ever, because 1) it requires a bit too much bit-fiddling for my liking (another topic where I know just enough to know it's easy to get it wrong!) and 2) since you then need to compress streams of bits, not bytes or other formats, most existing compression libraries can't be used directly (since usually they work on at least bytes!), so you probably need to start from much lower level, and I don't know a thing about writing compression algorithms themselves.

Oh well. Not yet a pressing problem for me right now; I'm not running out of space yet!

I did have that compression idea in the back of my mind for years, for the same reason as you. Years ago I implemented a very crude hack where I rewrote some of the values from float to unsigned short (32 to 16 bits) by using a simple rescaling (i.e. value_in_float = base + value_in_short*increment), but that obviously only worked for one specific array of data where I knew what the expected range (and required accuracy) was. I recently started working on a different formulation of the same mathematical algorithm, and ended up having to store a full (well, half since it's symmetric) matrix instead of a vector, basically, and that suddenly made this compression issue much more urgent.

Though of course the issue here is going from N to N^2, so while a smaller constant (i.e. compression ratio) helps, it's not really the perfect solution.

cvi

@remi said in Numerical compression:

That's... clever. Though that borders on the point where I wonder whether it's clever or crazy. I'd love to see a library that makes it easy, and test it.
I'm not going to do it myself, ever, because 1) it requires a bit too much bit-fiddling for my liking (another topic where I know just enough to know it's easy to get it wrong!) and 2) since you then need to compress streams of bits, not bytes or other formats, most existing compression libraries can't be used directly (since usually they work on at least bytes!), so you probably need to start from much lower level, and I don't know a thing about writing compression algorithms themselves.

I was looking at doing that as well at one point, but more for off-line compression. It did work to some degree for our data, and was comparable to other dedicated libraries (of which some relied on the same trick). In the end, none of the options we found really produced sufficient compression to really make it worthwhile over just getting more storage, though.

dkf

@remi said in Numerical compression:

since you then need to compress streams of bits

That's not a problem in practice. You feed in a byte stream and let the bits look after themselves. The real trick is knowing that doing the partitioning of the word beforehand is worthwhile. (I'd guess that you'd use a basic deflater rather than a gzip stream for this; you need special knowledge to reassemble the decompressed data anyway, so there's no benefit to having extra metadata in the compressed streams.)

remi

@dkf yeah, that's too complicated for what I'm confident I can do without royally messing it up, which is why I need a library to do that for me... plus, same as you I guess, I don't really have time to, well, spend that much time on something like this.

(especially since, ultimately, I'm not sure any sort of generic compressor can do phenomenally better than another, once you've excluded crappy things. All compressors are just trying to find patterns in the data and there is only so much patterns you can find without actually knowing what the data represents and what part of it matters or not. The best compressor is one written specifically for that specific data type that you're handling in this specific application and using in that specific way afterwards, but that's really too much work in most cases... so a generic one is probably the best trade-off you can get!)