Highly compressing a grid of numbers

Highly compressing a grid of numbers - c++

I have a square grid which contains numbers and I need to compress it a lot so it can easily be transferred over a network. For instance, I need to be able to compress a 40x40 grid into less than 512 bytes, regardless of the values of the numbers in the grid. That is my basic requirement.
Each of the grid's cells contains a number from 0-7, so each cell can fit in 3 bits.
Does anyone know of a good algorithm that can achieve what I want?

You can encode your information differently. You dont't need to assign all numbers 0 to 7 a code with the same number of bits. You can assign based on number of times in the sequence.
First read the whole sequence counting the number of appearances of every number.
Based on that you can assign the code to each number.
If you assign the code following for example Huffman code the codes will be prefix code, meaning there is no extra character to separate numbers.
There are certain variations that you can introduce on the algorithm based on your test results to fine tune the compression ratio.
I used this technique in a project (university) and it provides, in general, good results. At least it should approximmate your theoretical 3 bits per character and can be much better if the distribution of probabilities helps.

What you want to do is perform a "burrowes-wheeler" transform on your data, and then compress it. Run-length encoding will be enough in this case.
http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
This will likely outperform huffman in your case.
It is true that in some cases you will need more than 512 bytes. So in your protocol just make an exception for "perverse" grids. But in the general case you should be under 512 easily.

As others have stated, the problem as stated is not possible as 600 bytes are required to represent all possible grids. 600 bytes is from 40 rows, 40 columns, 3 bits per cell, and 8 bits per byte (40 * 40 * 3 / 8). As Kerrek SB explained in the comments, you pack 8 cells into 3 bytes.
In your own comments, you mentioned this was game state being transferred over the network. Assuming you have a mechanism to assure reliable transport of the data, then if there is a reasonable limit to the number of cells that can change between updates, or if you are allowed to send your updates when a certain number of cells change, you can achieve a representation in 512 bytes. If you use 1 bit to represent whether or not a cell has changed, you would need 200 bytes. Then, you have 312 remaining bytes to represent the new values of the cells that have changed. So, you can represent up to 312*8/3 = 832 modified cells.
As an aside, this representation can represent up to 1064 changed cells in less than 600 bytes.

Related

Minimum storage space for an ordered list

I am seeking advice on achieving the most efficient storage of an ordered list, that is the minimum storage for a list.
An ordered list of 256 unique items where each item is a unique number from 0 to 255 will standardly require 2^16 bits of data for storage, 2^8 places, each place holding a 2^8 value.
However this information ought to be storable in near 2^15 bits.
The second item, rather than being in the 2nd place of 256, can be viewed as being the next of the remaining 255, the next item the next of the remaining 254 etc.
This is a continuation of not needing to store the detail of the last item in a sorted list because that item must be in the last place by default.
In this case you can simply see you can have 2^8-1 places each place holding a 2^8 value, which is less than 2^16.
So how does this get down to 2^15+1 bits of storage. Or is there a proof that says otherwise? If there is a proof I would hope it doesn't say 2^16 bits of storage are needed as I have just shown that that is wrong!
I am hopefully just unaware of the terminology to identify work on this subject.
Can anyone advise of work on the matter?
Thank you for your time.
Glenn

Upon clarification of the question as storage of some particular permutation of 256 items (being the 8-bit numbers from 0 to 255 in particular), I have updated my answer. The prior discussion is below for posterity.
Answer
1684 bits.
Explanation
In this case, the clearest analysis comes through encoding and informational entropy. Again, we use the pigeonhole principle: in order to uniquely determine a particular permutation, we must have at least as many encodings as we have possible messages to be encoded.
An example may be helpful: consider a list of 256 numbers, each of which is an 8-bit number. The first item has 256 possible values, as does the second, as does the third, and so on. Overall, we have 256^256 possible messages, so we need at least 256^256 possible encodings. To determine the number of bits needed, we can simply take the base 2 logarithm of this, and log2(256^256) = 256*log2(256) = 256 * log2(2^8) = 256 * 8 = 2^11, so we can see that to encode this list, we only need 2^11, or 2048 bits. You may note this is the same as taking 8 bits per item, and multiplying it by the number of items. Your original question was incorrect on the storage needed, as you supposed that it requires 2^8 bits, so a 256 bit integer, which could store values from 0 to ~10^77.
With this understanding, we turn our attention to the problem at hand. There are 256 possibilities for the first item, then 255 possibilities for the second item, 254 possibilities for the third item, etc, until there is only 1 possibility for the last item. Overall, we have 256! possibilities, so we need at least 256! encodings. Again, we use the base 2 logarithm to determine how many bits we need, so we need log2(256!) bits. A nice property of logarithms is that they turn products into sums, so log2(256!) = log2(256) + log2(255) + log2(254) + ... + log2(2) + log2(1). This is analogous to using 8 bits for each of the 256 items, but here as each item has progressively less information, it requires fewer bits. Also note that log2(1) is 0, which corresponds to your observation that you don't need any information to encode the last item. Overall, when we perform this sum, we end up with 1683.996..., so we need 1684 bits to encode these ordered lists. Some variable-length encodings could get lower, but they can never go lower than log2(256!) bits on average.
Coming up with an encoding that uses 1684 bits is not simple, but here's one method that's more efficient than the original full storage. We can note that the first 128 items each have between 129 and 256 possibilities, and encode each of these items with 8 bits. The next 64 items each have between 65 and 128 possibilities, so we can encode each of these items with 7 bits. Continuing on, we end up using
(128 * 8) + (64 * 7) + (32 * 6) + (16 * 5) + (8 * 4) + (4 * 3) + (2 * 2) + (1 * 1) + (1 * 0) = 1793 bits to store the list.
Pre-clarification discussion
If all you're ever interested in encoding is an ordered list of 256 unique items, where each item is an 8-bit integer, then you can do it in 1 bit: 0 means you have that list, 1 means you don't, because there's only one possible list satisfying those criteria.
If you're trying to store anything in memory, you need at least as many configurations of that memory as there are different options (otherwise by the pigeonhole principle there would be at least two options you couldn't differentiate). Assuming by "ordered" you mean that they are strictly increasing or decreasing, an n-element ordered list of 8-bit integers, without repetition, has 256 choose n possible options (as there's only one possible configuration, the ordered one). Summing 256 choose n over all possible values of n (i.e. 0 to 256), gives 2^256, or 2^(2^8). Therefore, a perfect encoding scheme could use as few as 2^8 bits to store this particular kind of list, but couldn't encode any other kind of list.
EDIT: If you want to read more about this sort of stuff, read up on information theory.
EDIT: An easier way to think about this encoding is like this: We know the list is ordered, so if we know what items are in it then we know what order they're in, so we only need to know which items are in the list. There's 256 possible items (0 through 255), and if we assume the items in the list are unique then each item is either in the list, or it isn't. For each item, we use 1 bit to store whether or not it's in the list (so bit 0 records if the list contains 0, bit 1 records if the list contains 1, etc, bit 255 records if the list contains 255). Tada, we've stored all of the information about this 256 element array of bytes in only 256 = 2^8 bits.
EDIT: Let's examine an analogous situation. We have an ordered, unique list of up to 4 items, each of which is 2 bits. We'll write out all of the possible circumstances:[], [0], [1], [2], [3], [0,1], [0,2], [0,3], [1,2], [1,3], [2,3], [0,1,2], [0,1,3], [0,2,3], [1,2,3], [0,1,2,3]. These are the only possible lists where each element is two bits, the elements are unique, and the elements are in ascending order. Notice that I can't actually make it more efficient just by leaving the 3 off of [0,1,2,3], because I also need to distinguish it from [0,1,2]. The thing is, asking how much space you need to "store" something isolated from context is almost unanswerable. If all you want is to store enough information to recover it (i.e. you want lossless compression), and if you presume you know the properties, you can get your compression ratio pretty much as low as you want. For example, if you gave me an ordered list containing every element from 0 to 1,000,000 exactly once, even though storing that list directly in memory requires ~2^40 bits, you can recover the list from the known properties and the two numbers 0 and 1,000,000, for a total of 40 bits.

Saving binary date into file in c++

My algoritm produces stream of 9bits and 17bits I need to find solution to store this data in file. but i can't just store 9 bits as int and 17bits as int_32.
For example if my algoritm produces 10x9bit and 5x17bits the outfile size need to be 22bytes.
Also one of the big problem to solve is that the out file can be very big and size of the file is unknown.
The only idea with I have now is to use bool *vector;

If you have to save dynamic bits, then you should probably save two values: The first being either the number of bits (if bits are consecutive from 0 to x), or a bitmask to say which bits are valid; The second being the 32-bit integer representing your bits.

Taking your example literally: if you want to store 175 bits and it consists of unknown number of entities of two different lengths, then the file absolutely cannot be only 22 bytes. You need to know what is ahead of you in the file, you need the lengths. If you got only two possible sizes, then it can be only a single bit. 0 means 9 bit, 1 means 17 bit.
|0|9bit|0|9bit|1|17bit|0|9bit|1|17bit|1|17bit|...
So for your example, you would need 10*(1+9)+5*(1+17) = 190 bits ~ 24 bytes. The outstanding 2 bits need to be padded with 0's so that you align at byte boundary. The fact that you will go on reading the file as if there was another entity (because you said you don't know how long the file is) shouldn't be a problem because last such padding will be always less than 9 bits. Upon reaching end of file, you can throw away the last incomplete reading.
This approach indeed requires implementing a bit-level manipulation of the byte-level stream. Which means careful masking and logical operations. BASE64 is exactly that, only being simpler than you, consisting only of fixed 6-bit entities, stored in a textfile.

seeking a better way to code and compress numbers

I have 13 numbers drawing from a set with 13 types of data, each type has 4 item so total 52 items. We can number the item as 1,2,3,4,5,6,7,8,9,10,11,12,13, so there will be 4 "1", 4"2", ... 4"13" in the set. The 13 numbers drawing from the set are random. The whole process repeated million times or even more, so I need a efficient way to store the 13 numbers. I was thinking to use some sort of coding method to compress the 13 integers into bits. For example, I count how many "1", "2" ... first, coding the count for each item with 2 bits and use 1 more bit to denote if the item was drawn or not. So for each item, we need 3 bits, total 13 items cost 39 bits. It definite need 8 bytes to do so. But it is still too much since I am talking about couple millions or billion times of calculation and each set have to be stored to the file later. So if I use 8 bytes, if will still asking about 80GB for my data. However, if I can reduce that by half, I will save 40GB. Any idea how to compress this structure more efficiently? I also think of to use 5 bytes instead but than I need to take care of the different type of number (one int + one char), is there any library in c++ can easily do the coding/compressing for me?
Thanks.

Google's Protocol Buffers can store integers with less bits, depending on its value. It might reduce your storage significantly. See http://code.google.com/p/protobuf/
The actual protocol is described here: https://developers.google.com/protocol-buffers/docs/encoding
As for compression, have you looked at how zlib handles your data?

With your scheme, every hand of 39 bits represented by 8 bytes of 64 bits will have 25 bits wasted, about 40%.
If you batch hands together, you can represent them without wasting those bits.
39 and 64 have no common factors, so the lowest common multiple is just the multiple 39 * 64 = 2496 bits, or 312 bytes. This holds 64 hands and is about 60% of the size of your current scheme.

try googling LV77 and LVZ compression

Maybe a bit more sophisticated than you're looking for, but check out HDF5.

Better compression algorithm for vector data?

I need to compress some spatially correlated data records. Currently I am getting 1.2x-1.5x compression with zlib, but I figure it should be possible to get more like 2x. The data records have various fields, but for example, zlib seems to have trouble compressing lists of points.
The points represent a road network. They are pairs of fixed-point 4-byte integers of the form XXXXYYYY. Typically, if a single data block has 100 points, there will be only be a few combinations of the top two bytes of X and Y (spatial correlation). But the bottom bytes are always changing and must look like random data to zlib.
Similarly, the records have 4-byte IDs which tend to have constant high bytes and variable low bytes.
Is there another algorithm that would be able to compress this kind of data better? I'm using C++.
Edit: Please no more suggestions to change the data itself. My question is about automatic compression algorithms. If somebody has a link to an overview of all popular compression algorithms I'll just accept that as answer.

You'll likely get much better results if you try to compress the data yourself based on your knowledge of its structure.
General-purpose compression algorithms just treat your data as a bitstream. They look for commonly-used sequences of bits, and replace them with a shorter dictionary indices.
But the duplicate data doesn't go away. The duplicated sequence gets shorter, but it's still duplicated just as often as it was before.
As I understand it, you have a large number of data points of the form
XXxxYYyy, where the upper-case letters are very uniform. So factor them out.
Rewrite the list as something similar to this:
XXYY // a header describing the common first and third byte for all the subsequent entries
xxyy // the remaining bytes, which vary
xxyy
xxyy
xxyy
...
XXYY // next unique combination of 1st and 3rd byte)
xxyy
xxyy
...
Now, each combination of the rarely varying bytes is listed only once, rather than duplicated for every entry they occur in. That adds up to a significant space saving.
Basically, try to remove duplicate data yourself, before running it through zlib. You can do a better job of it because you have additional knowledge about the data.
Another approach might be, instead of storing these coordinates as absolute numbers, write them as deltas, relative deviations from some location chosen to be as close as possible to all the entries. Your deltas will be smaller numbers, which can be stored using fewer bits.

Not specific to your data, but I would recommend checking out 7zip instead of zlib if you can. I've seen ridiculously good compression ratios using this.
http://www.7-zip.org/

Without seeing the data and its exact distribution, I can't say for certain what the best method is, but I would suggest that you start each group of 1-4 records with a byte whose 8 bits indicate the following:
0-1 Number of bytes of ID that should be borrowed from previous record
2-4 Format of position record
6-7 Number of succeeding records that use the same 'mode' byte
Each position record may be stored one of eight ways; all types other than 000 use signed displacements. The number after the bit code is the size of the position record.
000 - 8 - Two full four-byte positions
001 - 3 - Twelve bits for X and Y
010 - 2 - Ten-bit X and six-bit Y
011 - 2 - Six-bit X and ten-bit Y
100 - 4 - Two sixteen-bit signed displacements
101 - 3 - Sixteen-bit X and 8-bit Y signed displacement
110 - 3 - Eight-bit signed displacement for X; 16-bit for Y
111 - 2 - Two eight-bit signed displacements
A mode byte of zero will store all the information applicable to a point without reference to any previous point, using a total of 13 bytes to store 12 bytes of useful information. Other mode bytes will allow records to be compacted based upon similarity to previous records. If four consecutive records differ only in the last bit of the ID, and either have both X and Y within +/- 127 of the previous record, or have X within +/- 31 and Y within +/- 511, or X within +/- 511 and Y within +/- 31, then all four records may be stored in 13 bytes (an average of 3.25 bytes each (a 73% reduction in space).
A "greedy" algorithm may be used for compression: examine a record to see what size ID and XY it will have to use in the output, and then grab up to three more records until one is found that either can't "fit" with the previous records using the chosen sizes, or could be written smaller (note that if e.g. the first record has X and Y displacements both equal to 12, the XY would be written with two bytes, but until one reads following records one wouldn't know which of the three two-byte formats to use).
Before setting your format in stone, I'd suggest running your data through it. It may be that a small adjustment (e.g. using 7+9 or 5+11 bit formats instead of 6+10) would allow many data to pack better. The only real way to know, though, is to see what happens with your real data.

It looks like the Burrows–Wheeler transform might be useful for this problem. It has a peculiar tendency to put runs of repeating bytes together, which might make zlib compress better. This article suggests I should combine other algorithms than zlib with BWT, though.
Intuitively it sounds expensive, but a look at some source code shows that reverse BWT is O(N) with 3 passes over the data and a moderate space overhead, likely making it fast enough on my target platform (WinCE). The forward transform is roughly O(N log N) or slightly over, assuming an ordinary sort algorithm.

Sort the points by some kind of proximity measure such that the average distance between adjacent points is small. Then store the difference between adjacent points.
You might do even better if you manage to sort the points so that most differences are positive in both the x and y axes, but I can't say for sure.
As an alternative to zlib, a family of compression techniques that works well when the probability distribution is skewed towards small numbers is universal codes. They would have to be tweaked for signed numbers (encode abs(x)<<1 + (x < 0 ? 1 : 0)).

You might want to write two lists to the compressed file: a NodeList and a LinkList. Each node would have an ID, x, y. Each link would have a FromNode and a ToNode, along with a list of intermediate xy values. You might be able to have a header record with a false origin and have node xy values relative to that.
This would provide the most benefit if your streets follow an urban grid network, by eliminating duplicate coordinates at intersections.
If the compression is not required to be lossless, you could use truncated deltas for intermediate coordinates. While someone above mentioned deltas, keep in mind that a loss in connectivity would likely cause more problems than a loss in shape, which is what would happen if you use truncated deltas to represent the last coordinate of a road (which is often an intersection).
Again, if your roads aren't on an urban grid, this probably wouldn't buy you much.

Compression for a unique stream of data

I've got a large number of integer arrays. Each one has a few thousand integers in it, and each integer is generally the same as the one before it or is different by only a single bit or two. I'd like to shrink each array down as small as possible to reduce my disk IO.
Zlib shrinks it to about 25% of its original size. That's nice, but I don't think its algorithm is particularly well suited for the problem. Does anyone know a compression library or simple algorithm that might perform better for this type of information?
Update: zlib after converting it to an array of xor deltas shrinks it to about 20% of the original size.

If most of the integers really are the same as the previous, and the inter-symbol difference can usually be expressed as a single bit flip, this sounds like a job for XOR.
Take an input stream like:
1101
1101
1110
1110
0110
and output:
1101
0000
0010
0000
1000
a bit of pseudo code
compressed[0] = uncompressed[0]
loop
compressed[i] = uncompressed[i-1] ^ uncompressed[i]
We've now reduced most of the output to 0, even when a high bit is changed. The RLE compression in any other tool you use will have a field day with this. It'll work even better on 32-bit integers, and it can still encode a radically different integer popping up in the stream. You're saved the bother of dealing with bit-packing yourself, as everything remains an int-sized quantity.
When you want to decompress:
uncompressed[0] = compressed[0]
loop
uncompressed[i] = uncompressed[i-1] ^ compressed[i]
This also has the advantage of being a simple algorithm that is going to run really, really fast, since it is just XOR.

Have you considered Run-length encoding?
Or try this: Instead of storing the numbers themselves, you store the differences between the numbers. 1 1 2 2 2 3 5 becomes 1 0 1 0 0 1 2. Now most of the numbers you have to encode are very small. To store a small integer, use an 8-bit integer instead of the 32-bit one you'll encode on most platforms. That's a factor of 4 right there. If you do need to be prepared for bigger gaps than that, designate the high-bit of the 8-bit integer to say "this number requires the next 8 bits as well".
You can combine that with run-length encoding for even better compression ratios, depending on your data.
Neither of these options is particularly hard to implement, and they all run very fast and with very little memory (as opposed to, say, bzip).

You want to preprocess your data -- reversibly transform it to some form that is better-suited to your back-end data compression method, first. The details will depend on both the back-end compression method, and (more critically) on the properties you expect from the data you're compressing.
In your case, zlib is a byte-wise compression method, but your data comes in (32-bit?) integers. You don't need to reimplement zlib yourself, but you do need to read up on how it works, so you can figure out how to present it with easily compressible data, or if it's appropriate for your purposes at all.
Zlib implements a form of Lempel-Ziv coding. JPG and many others use Huffman coding for their backend. Run-length encoding is popular for many ad hoc uses. Etc., etc. ...

Perhaps the answer is to pre-filter the arrays in a way analogous to the Filtering used to create small PNG images. Here are some ideas right off the top of my head. I've not tried these approaches, but if you feel like playing, they could be interesting.
Break your ints up each into 4 bytes, so i0, i1, i2, ..., in becomes b0,0, b0,1, b0,2, b0,3, b1,0, b1,1, b1,2, b1,3, ..., bn,0, bn,1, bn,2, bn,3. Then write out all the bi,0s, followed by the bi,1s, bi,2s, and bi,3s. If most of the time your numbers differ only by a bit or two, you should get nice long runs of repeated bytes, which should compress really nicely using something like Run-length Encoding or zlib. This is my favourite of the methods I present.
If the integers in each array are closely-related to the one before, you could maybe store the original integer, followed by diffs against the previous entry - this should give a smaller set of values to draw from, which typically results in a more compressed form.
If you have various bits differing, you still may have largish differences, but if you're more likely to have large numeric differences that correspond to (usually) one or two bits differing, you may be better off with a scheme where you create ahebyte array - use the first 4 bytes to encode the first integer, and then for each subsequent entry, use 0 or more bytes to indicate which bits should be flipped - storing 0, 1, 2, ..., or 31 in the byte, with a sentinel (say 32) to indicate when you're done. This could result the raw number of bytes needed to represent and integer to something close to 2 on average, which most bytes coming from a limited set (0 - 32). Run that stream through zlib, and maybe you'll be pleasantly surprised.

Did you try bzip2 for this?
http://bzip.org/
It's always worked better than zlib for me.

Since your concern is to reduce disk IO, you'll want to compress each integer array independently, without making reference to other integer arrays.
A common technique for your scenario is to store the differences, since a small number of differences can be encoded with short codewords. It sounds like you need to come up with your own coding scheme for differences, since they are multi-bit differences, perhaps using an 8 bit byte something like this as a starting point:
1 bit to indicate that a complete new integer follows, or that this byte encodes a difference from the last integer,
1 bit to indicate that there are more bytes following, recording more single bit differences for the same integer.
6 bits to record the bit number to switch from your previous integer.
If there are more than 4 bits different, then store the integer.
This scheme might not be appropriate if you also have a lot of completely different codes, since they'll take 5 bytes each now instead of 4.

"Zlib shrinks it by a factor of about 4x." means that a file of 100K now takes up negative 300K; that's pretty impressive by any definition :-). I assume you mean it shrinks it by 75%, i.e., to 1/4 its original size.
One possibility for an optimized compression is as follows (it assumes a 32-bit integer and at most 3 bits changing from element to element).
Output the first integer (32 bits).
Output the number of bit changes (n=0-3, 2 bits).
Output n bit specifiers (0-31, 5 bits each).
Worst case for this compression is 3 bit changes in every integer (2+5+5+5 bits) which will tend towards 17/32 of original size (46.875% compression).
I say "tends towards" since the first integer is always 32 bits but, for any decent sized array, that first integer would be negligable.
Best case is a file of identical integers (no bit changes for every integer, just the 2 zero bits) - this will tend towards 2/32 of original size (93.75% compression).
Where you average 2 bits different per consecutive integer (as you say is your common case), you'll get 2+5+5 bits per integer which will tend towards 12/32 or 62.5% compression.
Your break-even point (if zlib gives 75% compression) is 8 bits per integer which would be
single-bit changes (2+5 = 7 bits) : 80% of the transitions.
double-bit changes (2+5+5 = 12 bits) : 20% of the transitions.
This means your average would have to be 1.2 bit changes per integer to make this worthwhile.
One thing I would suggest looking at is 7zip - this has a very liberal licence and you can link it with your code (I think the source is available as well).
I notice (for my stuff anyway) it performs much better than WinZip on a Windows platform so it may also outperform zlib.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js