Better compression algorithm for vector data? - c++

I need to compress some spatially correlated data records. Currently I am getting 1.2x-1.5x compression with zlib, but I figure it should be possible to get more like 2x. The data records have various fields, but for example, zlib seems to have trouble compressing lists of points.
The points represent a road network. They are pairs of fixed-point 4-byte integers of the form XXXXYYYY. Typically, if a single data block has 100 points, there will be only be a few combinations of the top two bytes of X and Y (spatial correlation). But the bottom bytes are always changing and must look like random data to zlib.
Similarly, the records have 4-byte IDs which tend to have constant high bytes and variable low bytes.
Is there another algorithm that would be able to compress this kind of data better? I'm using C++.
Edit: Please no more suggestions to change the data itself. My question is about automatic compression algorithms. If somebody has a link to an overview of all popular compression algorithms I'll just accept that as answer.

You'll likely get much better results if you try to compress the data yourself based on your knowledge of its structure.
General-purpose compression algorithms just treat your data as a bitstream. They look for commonly-used sequences of bits, and replace them with a shorter dictionary indices.
But the duplicate data doesn't go away. The duplicated sequence gets shorter, but it's still duplicated just as often as it was before.
As I understand it, you have a large number of data points of the form
XXxxYYyy, where the upper-case letters are very uniform. So factor them out.
Rewrite the list as something similar to this:
XXYY // a header describing the common first and third byte for all the subsequent entries
xxyy // the remaining bytes, which vary
xxyy
xxyy
xxyy
...
XXYY // next unique combination of 1st and 3rd byte)
xxyy
xxyy
...
Now, each combination of the rarely varying bytes is listed only once, rather than duplicated for every entry they occur in. That adds up to a significant space saving.
Basically, try to remove duplicate data yourself, before running it through zlib. You can do a better job of it because you have additional knowledge about the data.
Another approach might be, instead of storing these coordinates as absolute numbers, write them as deltas, relative deviations from some location chosen to be as close as possible to all the entries. Your deltas will be smaller numbers, which can be stored using fewer bits.

Not specific to your data, but I would recommend checking out 7zip instead of zlib if you can. I've seen ridiculously good compression ratios using this.
http://www.7-zip.org/

Without seeing the data and its exact distribution, I can't say for certain what the best method is, but I would suggest that you start each group of 1-4 records with a byte whose 8 bits indicate the following:
0-1 Number of bytes of ID that should be borrowed from previous record
2-4 Format of position record
6-7 Number of succeeding records that use the same 'mode' byte
Each position record may be stored one of eight ways; all types other than 000 use signed displacements. The number after the bit code is the size of the position record.
000 - 8 - Two full four-byte positions
001 - 3 - Twelve bits for X and Y
010 - 2 - Ten-bit X and six-bit Y
011 - 2 - Six-bit X and ten-bit Y
100 - 4 - Two sixteen-bit signed displacements
101 - 3 - Sixteen-bit X and 8-bit Y signed displacement
110 - 3 - Eight-bit signed displacement for X; 16-bit for Y
111 - 2 - Two eight-bit signed displacements
A mode byte of zero will store all the information applicable to a point without reference to any previous point, using a total of 13 bytes to store 12 bytes of useful information. Other mode bytes will allow records to be compacted based upon similarity to previous records. If four consecutive records differ only in the last bit of the ID, and either have both X and Y within +/- 127 of the previous record, or have X within +/- 31 and Y within +/- 511, or X within +/- 511 and Y within +/- 31, then all four records may be stored in 13 bytes (an average of 3.25 bytes each (a 73% reduction in space).
A "greedy" algorithm may be used for compression: examine a record to see what size ID and XY it will have to use in the output, and then grab up to three more records until one is found that either can't "fit" with the previous records using the chosen sizes, or could be written smaller (note that if e.g. the first record has X and Y displacements both equal to 12, the XY would be written with two bytes, but until one reads following records one wouldn't know which of the three two-byte formats to use).
Before setting your format in stone, I'd suggest running your data through it. It may be that a small adjustment (e.g. using 7+9 or 5+11 bit formats instead of 6+10) would allow many data to pack better. The only real way to know, though, is to see what happens with your real data.

It looks like the Burrows–Wheeler transform might be useful for this problem. It has a peculiar tendency to put runs of repeating bytes together, which might make zlib compress better. This article suggests I should combine other algorithms than zlib with BWT, though.
Intuitively it sounds expensive, but a look at some source code shows that reverse BWT is O(N) with 3 passes over the data and a moderate space overhead, likely making it fast enough on my target platform (WinCE). The forward transform is roughly O(N log N) or slightly over, assuming an ordinary sort algorithm.

Sort the points by some kind of proximity measure such that the average distance between adjacent points is small. Then store the difference between adjacent points.
You might do even better if you manage to sort the points so that most differences are positive in both the x and y axes, but I can't say for sure.
As an alternative to zlib, a family of compression techniques that works well when the probability distribution is skewed towards small numbers is universal codes. They would have to be tweaked for signed numbers (encode abs(x)<<1 + (x < 0 ? 1 : 0)).

You might want to write two lists to the compressed file: a NodeList and a LinkList. Each node would have an ID, x, y. Each link would have a FromNode and a ToNode, along with a list of intermediate xy values. You might be able to have a header record with a false origin and have node xy values relative to that.
This would provide the most benefit if your streets follow an urban grid network, by eliminating duplicate coordinates at intersections.
If the compression is not required to be lossless, you could use truncated deltas for intermediate coordinates. While someone above mentioned deltas, keep in mind that a loss in connectivity would likely cause more problems than a loss in shape, which is what would happen if you use truncated deltas to represent the last coordinate of a road (which is often an intersection).
Again, if your roads aren't on an urban grid, this probably wouldn't buy you much.

Related

How do I write files that gzip well?

I'm working on a web project, and I need to create a format to transmit files very efficiently (lots of data). The data is entirely numerical, and split into a few sections. Of course, this will be transferred with gzip compression.
I can't seem to find any information on what makes a file compress better than another file.
How can I encode floats (32bit) and short integers (16bit) in a format that results in the smallest gzip size?
P.s. it will be a lot of data, so saving 5% means a lot here. There won't likely be any repeats in the floats, but the integers will likely repeat about 5-10 times in each file.
The only way to compress data is to remove redundancy. This is essentially what any compression tool does - it looks for redundant/repeatable parts and replaces them with link/reference to the same data that was observed before in your stream.
If you want to make your data format more efficient, you should remove everything that could be possibly removed. For example, it is more efficient to store numbers in binary rather than in text (JSON, XML, etc). If you have to use text format, consider removing unnecessary spaces or linefeeds.
One good example of efficient binary format is google protocol buffers. It has lots of benefits, and not least of them is storing numbers as variable number of bytes (i.e. number 1 consumes less space than number 1000000).
Text or binary, but if you can sort your data before sending, it can increase possibility for gzip compressor to find redundant parts, and most likely to increase compression ratio.
Since you said 32-bit floats and 16-bit integers, you are already coding them in binary.
Consider the range and useful accuracy of your numbers. If you can limit those, you can recode the numbers using fewer bits. Especially the floats, which may have more bits than you need.
If the right number of bits is not a multiple of eight, then treat your stream of bytes as a stream of bits and use only the bits needed. Be careful to deal with the end of your data properly so that the added bits to go to the next byte boundary are not interpreted as another number.
If your numbers have some correlation to each other, then you should take advantage of that. For example, if the difference between successive numbers is usually small, which is the case for a representation of a waveform for example, then send the differences instead of the numbers. Differences can be coded using variable-length integers or Huffman coding or a combination, e.g. Huffman codes for ranges and extra bits within each range.
If there are other correlations that you can use, then design a predictor for the next value based on the previous values. Then send the difference between the actual and predicted value. In the previous example, the predictor is simply the last value. An example of a more complex predictor is a 2D predictor for when the numbers represent a 2D table and both adjacent rows and columns are correlated. The PNG image format has a few examples of 2D predictors.
All of this will require experimentation with your data, ideally large amounts of your data, to see what helps and what doesn't or has only marginal benefit.
Use binary instead of text.
A float in its text representation with 8 digits (a float has a precision of eight decimal digits), plus decimal separator, plus field separator, consumes 10 bytes. In binary representation, it takes only 4.
If you need to use text, use hex. It consumes less digits.
But although this makes a lot of difference for the uncompressed file, these differences might disappear after compression, since the compression algo should implicitly take care if that. But you may try.

Highly compressing a grid of numbers

I have a square grid which contains numbers and I need to compress it a lot so it can easily be transferred over a network. For instance, I need to be able to compress a 40x40 grid into less than 512 bytes, regardless of the values of the numbers in the grid. That is my basic requirement.
Each of the grid's cells contains a number from 0-7, so each cell can fit in 3 bits.
Does anyone know of a good algorithm that can achieve what I want?
You can encode your information differently. You dont't need to assign all numbers 0 to 7 a code with the same number of bits. You can assign based on number of times in the sequence.
First read the whole sequence counting the number of appearances of every number.
Based on that you can assign the code to each number.
If you assign the code following for example Huffman code the codes will be prefix code, meaning there is no extra character to separate numbers.
There are certain variations that you can introduce on the algorithm based on your test results to fine tune the compression ratio.
I used this technique in a project (university) and it provides, in general, good results. At least it should approximmate your theoretical 3 bits per character and can be much better if the distribution of probabilities helps.
What you want to do is perform a "burrowes-wheeler" transform on your data, and then compress it. Run-length encoding will be enough in this case.
http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
This will likely outperform huffman in your case.
It is true that in some cases you will need more than 512 bytes. So in your protocol just make an exception for "perverse" grids. But in the general case you should be under 512 easily.
As others have stated, the problem as stated is not possible as 600 bytes are required to represent all possible grids. 600 bytes is from 40 rows, 40 columns, 3 bits per cell, and 8 bits per byte (40 * 40 * 3 / 8). As Kerrek SB explained in the comments, you pack 8 cells into 3 bytes.
In your own comments, you mentioned this was game state being transferred over the network. Assuming you have a mechanism to assure reliable transport of the data, then if there is a reasonable limit to the number of cells that can change between updates, or if you are allowed to send your updates when a certain number of cells change, you can achieve a representation in 512 bytes. If you use 1 bit to represent whether or not a cell has changed, you would need 200 bytes. Then, you have 312 remaining bytes to represent the new values of the cells that have changed. So, you can represent up to 312*8/3 = 832 modified cells.
As an aside, this representation can represent up to 1064 changed cells in less than 600 bytes.

How do I compress a large number of similar doubles?

I want to store billions (10^9) of double precision floating point numbers in memory and save space. These values are grouped in thousands of ordered sets (they are time series), and within a set, I know that the difference between values is usually not large (compared to their absolute value). Also, the closer to each other, the higher the probability of the difference being relatively small.
A perfect fit would be a delta encoding that stores only the difference of each value to its predecessor. However, I want random access to subsets of the data, so I can't depend on going through a complete set in sequence. I'm therefore using deltas to a set-wide baseline that yields deltas which I expect to be within 10 to 50 percent of the absolute value (most of the time).
I have considered the following approaches:
divide the smaller value by the larger one, yielding a value between 0 and 1 that could be stored as an integer of some fixed precision plus one bit for remembering which number was divided by which. This is fairly straightforward and yields satisfactory compression, but is not a lossless method and thus only a secondary choice.
XOR the IEEE 754 binary64 encoded representations of both values and store the length of the long stretches of zeroes at the beginning of the exponent and mantissa plus the remaining bits which were different. Here I'm quite unsure how to judge the compression, although I think it should be good in most cases.
Are there standard ways to do this? What might be problems about my approaches above? What other solutions have you seen or used yourself?
Rarely are all the bits of a double-precision number meaningful.
If you have billions of values that are the result of some measurement, find the calibration and error of your measurement device. Quantize the values so that you only work with meaningful bits.
Often, you'll find that you only need 16 bits of actual dynamic range. You can probably compress all of this into arrays of "short" that retain all of the original input.
Use a simple "Z-score technique" where every value is really a signed fraction of the standard deviation.
So a sequence of samples with a mean of m and a standard deviation of s gets transformed into a bunch of Z score. Normal Z-score transformations use a double, but you should use a fixed-point version of that double. s/1000 or s/16384 or something that retains only the actual precision of your data, not the noise bits on the end.
for u in samples:
z = int( 16384*(u-m)/s )
for z in scaled_samples:
u = s*(z/16384.0)+m
Your Z-scores retain a pleasant easy-to-work with statistical relationship with the original samples.
Let's say you use a signed 16-bit Z-score. You have +/- 32,768. Scale this by 16,384 and your Z-scores have an effective resolution of 0.000061 decimal.
If you use a signed 24-but Z-score, you have +/- 8 million. Scale this by 4,194,304 and you have a resolution of 0.00000024.
I seriously doubt you have measuring devices this accurate. Further, any arithmetic done as part of filter, calibration or noise reduction may reduce the effective range because of noise bits introduced during the arithmetic. A badly thought-out division operator could make a great many of your decimal places nothing more than noise.
Whatever compression scheme you pick, you can decouple that from the problem of needing to be able to perform arbitrary seeks by compressing into fixed-size blocks and prepending to each block a header containing all the data required to decompress it (e.g. for a delta encoding scheme, the block would contain deltas enconded in some fashion that takes advantage of their small magnitude to make them take less space, e.g. fewer bits for exponent/mantissa, conversion to fixed-point value, Huffman encoding etc; and the header a single uncompressed sample); seeking then becomes a matter of cheaply selecting the appropriate block, then decompressing it.
If the compression ratio is so variable that much space is being wasted padding the compressed data to produce fixed size blocks, a directory of offsets into the compressed data could be built instead and the state required to decompress recorded in that.
If you know a group of doubles has the same exponent, you could store the exponent once, and only store the mantissa for each value.

Shannon's entropy formula. Help my confusion

my understanding of the entropy formula is that it's used to compute the minimum number of bits required to represent some data. It's usually worded differently when defined, but the previous understanding is what I relied on until now.
Here's my problem. Suppose I have a sequence of 100 '1' followed by 100 '0' = 200 bits. The alphabet is {0,1}, base of entropy is 2. Probability of symbol "0" is 0.5 and "1" is 0.5. So the entropy is 1 or 1 bit to represent 1 bit.
However you can run-length encode it with something like 100 / 1 / 100 / 0 where it's number of bits to output followed by the bit. It seems like I have a representation smaller than the data. Especially if you increase the 100 to much larger number.
I'm using: http://en.wikipedia.org/wiki/Information_entropy as reference at the moment.
Where did I go wrong? Is it the probability assigned to symbols? I don't think it's wrong. Or did I get the connection between compression and entropy wrong? Anything else?
Thanks.
Edit
Following some of the answers my followup are: would you apply the entropy formula to a particular instance of a message to try to find out its information content? Would it be valid to take the message "aaab" and say the entropy is ~0.811. If yes then what's the entropy of 1...10....0 where 1s and 0s are repeated n times using the entropy formula. Is the answer 1?
Yes I understand that you are creating a random variable of your input symbols and guessing at the probability mass function based on your message. What I'm trying to confirm is the entropy formula does not take into account the position of the symbols in the message.
Or did I get the connection between compression and entropy wrong?
You're pretty close, but this last question is where the mistake was. If you're able to compress something into a form that was smaller than its original representation, it means that the original representation had at least some redundancy. Each bit in the message really wasn't conveying 1 bit of information.
Because redundant data does not contribute to the information content of a message, it also does not increase its entropy. Imagine, for example, a "random bit generator" that only returns the value "0". This conveys no information at all! (Actually, it conveys an undefined amount of information, because any binary message consisting of only one kind of symbol requires a division by zero in the entropy formula.)
By contrast, had you simulated a large number of random coin flips, it would be very hard to reduce the size of this message by much. Each bit would be contributing close to 1 bit of entropy.
When you compress data, you extract that redundancy. In exchange, you pay a one-time entropy price by having to devise a scheme that knows how to compress and decompress this data; that itself takes some information.
However you can run-length encode it with something like 100 / 1 / 100 / 0 where it's number of bits to output followed by the bit. It seems like I have a representation smaller than the data. Especially if you increase the 100 to much larger number.
To summarize, the fact that you could devise a scheme to make the encoding of the data smaller than the original data tells you something important. Namely, it says that your original data contained very little information.
Further reading
For a more thorough treatment of this, including exactly how you'd calculate the entropy for any arbitrary sequence of digits with a few examples, check out this short whitepaper.
Have a look at Kolmogorov complexity
The minimum number of bits into which a string can be compressed without losing information. This is defined with respect to a fixed, but universal decompression scheme, given by a universal Turing machine.
And in your particular case, don't restrict yourself to alphabet {0,1}. For your example use {0...0, 1...1} (hundred of 0's and hundred of 1's)
Your encoding works in this example, but it is possible to conceive an equally valid case: 010101010101... which would be encoded as 1 / 0 / 1 / 1 / ...
Entropy is measured across all possible messages that can be constructed in the given alphabet, and not just pathological examples!
John Feminella got it right, but I think there is more to say.
Shannon entropy is based on probability, and probability is always in the eye of the beholder.
You said that 1 and 0 were equally likely (0.5). If that is so, then the string of 100 1s followed by 100 0s has a probability of 0.5^200, of which -log(base 2) is 200 bits, as you expect. However, the entropy of that string (in Shannon terms) is its information content times its probability, or 200 * 0.5^200, still a really small number.
This is important because if you do run-length coding to compress the string, in the case of this string it will get a small length, but averaged over all 2^200 strings, it will not do well. With luck, it will average out to about 200, but not less.
On the other hand, if you look at your original string and say it is so striking that whoever generated it is likely to generate more like it, then you are really saying its probability is larger than 0.5^200, so you are making a different assumptions about the original probability structure of the generator of the string, namely that it has lower entropy than 200 bits.
Personally, I find this subject really interesting, especially when you look into Kolmogorov (Algorithmic) information. In that case, you define the information content of a string as the length of the smallest program that could generate it. This leads to all sorts of insights into software engineering and language design.
I hope that helps, and thanks for your question.

Compression for a unique stream of data

I've got a large number of integer arrays. Each one has a few thousand integers in it, and each integer is generally the same as the one before it or is different by only a single bit or two. I'd like to shrink each array down as small as possible to reduce my disk IO.
Zlib shrinks it to about 25% of its original size. That's nice, but I don't think its algorithm is particularly well suited for the problem. Does anyone know a compression library or simple algorithm that might perform better for this type of information?
Update: zlib after converting it to an array of xor deltas shrinks it to about 20% of the original size.
If most of the integers really are the same as the previous, and the inter-symbol difference can usually be expressed as a single bit flip, this sounds like a job for XOR.
Take an input stream like:
1101
1101
1110
1110
0110
and output:
1101
0000
0010
0000
1000
a bit of pseudo code
compressed[0] = uncompressed[0]
loop
compressed[i] = uncompressed[i-1] ^ uncompressed[i]
We've now reduced most of the output to 0, even when a high bit is changed. The RLE compression in any other tool you use will have a field day with this. It'll work even better on 32-bit integers, and it can still encode a radically different integer popping up in the stream. You're saved the bother of dealing with bit-packing yourself, as everything remains an int-sized quantity.
When you want to decompress:
uncompressed[0] = compressed[0]
loop
uncompressed[i] = uncompressed[i-1] ^ compressed[i]
This also has the advantage of being a simple algorithm that is going to run really, really fast, since it is just XOR.
Have you considered Run-length encoding?
Or try this: Instead of storing the numbers themselves, you store the differences between the numbers. 1 1 2 2 2 3 5 becomes 1 0 1 0 0 1 2. Now most of the numbers you have to encode are very small. To store a small integer, use an 8-bit integer instead of the 32-bit one you'll encode on most platforms. That's a factor of 4 right there. If you do need to be prepared for bigger gaps than that, designate the high-bit of the 8-bit integer to say "this number requires the next 8 bits as well".
You can combine that with run-length encoding for even better compression ratios, depending on your data.
Neither of these options is particularly hard to implement, and they all run very fast and with very little memory (as opposed to, say, bzip).
You want to preprocess your data -- reversibly transform it to some form that is better-suited to your back-end data compression method, first. The details will depend on both the back-end compression method, and (more critically) on the properties you expect from the data you're compressing.
In your case, zlib is a byte-wise compression method, but your data comes in (32-bit?) integers. You don't need to reimplement zlib yourself, but you do need to read up on how it works, so you can figure out how to present it with easily compressible data, or if it's appropriate for your purposes at all.
Zlib implements a form of Lempel-Ziv coding. JPG and many others use Huffman coding for their backend. Run-length encoding is popular for many ad hoc uses. Etc., etc. ...
Perhaps the answer is to pre-filter the arrays in a way analogous to the Filtering used to create small PNG images. Here are some ideas right off the top of my head. I've not tried these approaches, but if you feel like playing, they could be interesting.
Break your ints up each into 4 bytes, so i0, i1, i2, ..., in becomes b0,0, b0,1, b0,2, b0,3, b1,0, b1,1, b1,2, b1,3, ..., bn,0, bn,1, bn,2, bn,3. Then write out all the bi,0s, followed by the bi,1s, bi,2s, and bi,3s. If most of the time your numbers differ only by a bit or two, you should get nice long runs of repeated bytes, which should compress really nicely using something like Run-length Encoding or zlib. This is my favourite of the methods I present.
If the integers in each array are closely-related to the one before, you could maybe store the original integer, followed by diffs against the previous entry - this should give a smaller set of values to draw from, which typically results in a more compressed form.
If you have various bits differing, you still may have largish differences, but if you're more likely to have large numeric differences that correspond to (usually) one or two bits differing, you may be better off with a scheme where you create ahebyte array - use the first 4 bytes to encode the first integer, and then for each subsequent entry, use 0 or more bytes to indicate which bits should be flipped - storing 0, 1, 2, ..., or 31 in the byte, with a sentinel (say 32) to indicate when you're done. This could result the raw number of bytes needed to represent and integer to something close to 2 on average, which most bytes coming from a limited set (0 - 32). Run that stream through zlib, and maybe you'll be pleasantly surprised.
Did you try bzip2 for this?
http://bzip.org/
It's always worked better than zlib for me.
Since your concern is to reduce disk IO, you'll want to compress each integer array independently, without making reference to other integer arrays.
A common technique for your scenario is to store the differences, since a small number of differences can be encoded with short codewords. It sounds like you need to come up with your own coding scheme for differences, since they are multi-bit differences, perhaps using an 8 bit byte something like this as a starting point:
1 bit to indicate that a complete new integer follows, or that this byte encodes a difference from the last integer,
1 bit to indicate that there are more bytes following, recording more single bit differences for the same integer.
6 bits to record the bit number to switch from your previous integer.
If there are more than 4 bits different, then store the integer.
This scheme might not be appropriate if you also have a lot of completely different codes, since they'll take 5 bytes each now instead of 4.
"Zlib shrinks it by a factor of about 4x." means that a file of 100K now takes up negative 300K; that's pretty impressive by any definition :-). I assume you mean it shrinks it by 75%, i.e., to 1/4 its original size.
One possibility for an optimized compression is as follows (it assumes a 32-bit integer and at most 3 bits changing from element to element).
Output the first integer (32 bits).
Output the number of bit changes (n=0-3, 2 bits).
Output n bit specifiers (0-31, 5 bits each).
Worst case for this compression is 3 bit changes in every integer (2+5+5+5 bits) which will tend towards 17/32 of original size (46.875% compression).
I say "tends towards" since the first integer is always 32 bits but, for any decent sized array, that first integer would be negligable.
Best case is a file of identical integers (no bit changes for every integer, just the 2 zero bits) - this will tend towards 2/32 of original size (93.75% compression).
Where you average 2 bits different per consecutive integer (as you say is your common case), you'll get 2+5+5 bits per integer which will tend towards 12/32 or 62.5% compression.
Your break-even point (if zlib gives 75% compression) is 8 bits per integer which would be
single-bit changes (2+5 = 7 bits) : 80% of the transitions.
double-bit changes (2+5+5 = 12 bits) : 20% of the transitions.
This means your average would have to be 1.2 bit changes per integer to make this worthwhile.
One thing I would suggest looking at is 7zip - this has a very liberal licence and you can link it with your code (I think the source is available as well).
I notice (for my stuff anyway) it performs much better than WinZip on a Windows platform so it may also outperform zlib.