Shannon's entropy formula. Help my confusion - compression

my understanding of the entropy formula is that it's used to compute the minimum number of bits required to represent some data. It's usually worded differently when defined, but the previous understanding is what I relied on until now.
Here's my problem. Suppose I have a sequence of 100 '1' followed by 100 '0' = 200 bits. The alphabet is {0,1}, base of entropy is 2. Probability of symbol "0" is 0.5 and "1" is 0.5. So the entropy is 1 or 1 bit to represent 1 bit.
However you can run-length encode it with something like 100 / 1 / 100 / 0 where it's number of bits to output followed by the bit. It seems like I have a representation smaller than the data. Especially if you increase the 100 to much larger number.
I'm using: http://en.wikipedia.org/wiki/Information_entropy as reference at the moment.
Where did I go wrong? Is it the probability assigned to symbols? I don't think it's wrong. Or did I get the connection between compression and entropy wrong? Anything else?
Thanks.
Edit
Following some of the answers my followup are: would you apply the entropy formula to a particular instance of a message to try to find out its information content? Would it be valid to take the message "aaab" and say the entropy is ~0.811. If yes then what's the entropy of 1...10....0 where 1s and 0s are repeated n times using the entropy formula. Is the answer 1?
Yes I understand that you are creating a random variable of your input symbols and guessing at the probability mass function based on your message. What I'm trying to confirm is the entropy formula does not take into account the position of the symbols in the message.

Or did I get the connection between compression and entropy wrong?
You're pretty close, but this last question is where the mistake was. If you're able to compress something into a form that was smaller than its original representation, it means that the original representation had at least some redundancy. Each bit in the message really wasn't conveying 1 bit of information.
Because redundant data does not contribute to the information content of a message, it also does not increase its entropy. Imagine, for example, a "random bit generator" that only returns the value "0". This conveys no information at all! (Actually, it conveys an undefined amount of information, because any binary message consisting of only one kind of symbol requires a division by zero in the entropy formula.)
By contrast, had you simulated a large number of random coin flips, it would be very hard to reduce the size of this message by much. Each bit would be contributing close to 1 bit of entropy.
When you compress data, you extract that redundancy. In exchange, you pay a one-time entropy price by having to devise a scheme that knows how to compress and decompress this data; that itself takes some information.
However you can run-length encode it with something like 100 / 1 / 100 / 0 where it's number of bits to output followed by the bit. It seems like I have a representation smaller than the data. Especially if you increase the 100 to much larger number.
To summarize, the fact that you could devise a scheme to make the encoding of the data smaller than the original data tells you something important. Namely, it says that your original data contained very little information.
Further reading
For a more thorough treatment of this, including exactly how you'd calculate the entropy for any arbitrary sequence of digits with a few examples, check out this short whitepaper.

Have a look at Kolmogorov complexity
The minimum number of bits into which a string can be compressed without losing information. This is defined with respect to a fixed, but universal decompression scheme, given by a universal Turing machine.
And in your particular case, don't restrict yourself to alphabet {0,1}. For your example use {0...0, 1...1} (hundred of 0's and hundred of 1's)

Your encoding works in this example, but it is possible to conceive an equally valid case: 010101010101... which would be encoded as 1 / 0 / 1 / 1 / ...
Entropy is measured across all possible messages that can be constructed in the given alphabet, and not just pathological examples!

John Feminella got it right, but I think there is more to say.
Shannon entropy is based on probability, and probability is always in the eye of the beholder.
You said that 1 and 0 were equally likely (0.5). If that is so, then the string of 100 1s followed by 100 0s has a probability of 0.5^200, of which -log(base 2) is 200 bits, as you expect. However, the entropy of that string (in Shannon terms) is its information content times its probability, or 200 * 0.5^200, still a really small number.
This is important because if you do run-length coding to compress the string, in the case of this string it will get a small length, but averaged over all 2^200 strings, it will not do well. With luck, it will average out to about 200, but not less.
On the other hand, if you look at your original string and say it is so striking that whoever generated it is likely to generate more like it, then you are really saying its probability is larger than 0.5^200, so you are making a different assumptions about the original probability structure of the generator of the string, namely that it has lower entropy than 200 bits.
Personally, I find this subject really interesting, especially when you look into Kolmogorov (Algorithmic) information. In that case, you define the information content of a string as the length of the smallest program that could generate it. This leads to all sorts of insights into software engineering and language design.
I hope that helps, and thanks for your question.

Related

Does log2 (n)*(x) set compression limits.

I may get all kinds of flags and penalties thrown at me for this. So please be patient. 2 questions
If the minimal number of bits to represent an arbitrary number of decimals is calculated by log2 (n)*(x)....n is range x is length, then you should be able to calculate max compression by turning the file into decimals by the>>> bin to dec.?
Is this result a law that one can not compress below the theoretical min compression limit, or is it an approximated limit?
Jon Hutton
It's actually a bit (ha) trickier. That formula assumes that the number is drawn from a uniform distribution, which is often not the case, but notably is the case for what is commonly called "random data" (though that is an inaccurate name, since data may be random but drawn from a non-uniform distribution).
The entropy H of X in bits is given by the formula:
H(X) = - sum[i](P(x[i]) log2(P(x[i])))
Where P gives the probability of every value x[i] that X may take. The bounds of i are implied and irrelevant, impossible options have a probability of zero anyway. In the uniform case, P(x[i]) is (by definition) 1/N for any possible x[i], we have H(X) = -N * (1/N log2(1/N)) = -log2(1/N) = log2(N).
The formula should in general not simply be multiplied by the length of the data, that only works if all symbols are independent and identically distributed (so for example on your file with IID uniform-random digits, it does work). Often for meaningful data, the probability distribution for a symbol depends on its context, and indeed a lot of compression techniques are aimed at exploiting this.
There is no law that says you cannot get lucky and thereby compress an individual file to fewer bits than are suggested by its entropy. You can arrange for it to be possible on purpose (but it won't necessarily happen), for example, let's say we expect that any letter is equally probable, but we decide to go against the flow and encode an A with the single bit 0, and any other letter as a 1 followed by 5 bits that indicate which letter it is. This is obviously a bad encoding given the expectation, there are only 26 letters and they're equally probable but we're using more than log2(26) ≈ 4.7 bits on average, the average would be (1 + 25 * 6)/26 ≈ 5.8. However, if by some accident we happen to actually get an A (there is a chance of 1/26th that this happens, the odds are not too bad), we compress it to a single bit, which is much better than expected. Of course one cannot rely on luck, it can only come as a surprise.
For further reference you could read about entropy (information theory) on Wikipedia.

Maximum message length for CRC codes? [duplicate]

I've seen 8-bit, 16-bit, and 32-bit CRCs.
At what point do I need to jump to a wider CRC?
My gut reaction is that it is based on the data length:
1-100 bytes: 8-bit CRC
101 - 1000 bytes: 16-bit CRC
1001 - ??? bytes: 32-bit CRC
EDIT:
Looking at the Wikipedia page about CRC and Lott's answer, here' what we have:
<64 bytes: 8-bit CRC
<16K bytes: 16-bit CRC
<512M bytes: 32-bit CRC
It's not a research topic. It's really well understood: http://en.wikipedia.org/wiki/Cyclic_redundancy_check
The math is pretty simple. An 8-bit CRC boils all messages down to one of 256 values. If your message is more than a few bytes long, the possibility of multiple messages having the same hash value goes up higher and higher.
A 16-bit CRC, similarly, gives you one of the 65,536 available hash values. What are the odds of any two messages having one of these values?
A 32-bit CRC gives you about 4 billion available hash values.
From the wikipedia article: "maximal total blocklength is equal to 2**r − 1". That's in bits. You don't need to do much research to see that 2**9 - 1 is 511 bits. Using CRC-8, multiple messages longer than 64 bytes will have the same CRC checksum value.
The effectiveness of a CRC is dependent on multiple factors. You not only need to select the SIZE of the CRC but also the GENERATING POLYNOMIAL to use. There are complicated and non-intuitive trade-offs depending on:
The expected bit error rate of the channel.
Whether the errors tend to occur in bursts or tend to be spread out (burst is common)
The length of the data to be protected - maximum length, minimum length and distribution.
The paper Cyclic Redundancy Code Polynominal Selection For Embedded Networks, by Philip Koopman and Tridib Chakravarty, publised in the proceedings of the 2004 International Conference on Dependable Systems and Networks gives a very good overview and makes several recomendations. It also provides a bibliography for further understanding.
http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
The choice of CRC length versus file size is mainly relevant in cases where one is more likely to have an input which differs from the "correct" input by three or fewer bits than to have a one which is massively different. Given two inputs which are massively different, the possibility of a false match will be about 1/256 with most forms of 8-bit check value (including CRC), 1/65536 with most forms of 16-bit check value (including CRC), etc. The advantage of CRC comes from its treatment of inputs which are very similar.
With an 8-bit CRC whose polynomial generates two periods of length 128, the fraction of single, double, or triple bit errors in a packet shorter than that which go undetected won't be 1/256--it will be zero. Likewise with a 16-bit CRC of period 32768, using packets of 32768 bits or less.
If packets are longer than the CRC period, however, then a double-bit error will go undetected if the distance between the erroneous bits is a multiple of the CRC period. While that might not seem like a terribly likely scenario, a CRC8 will be somewhat worse at catching double-bit errors in long packets than at catching "packet is totally scrambled" errors. If double-bit errors are the second most common failure mode (after single-bit errors), that would be bad. If anything that corrupts some data is likely to corrupt a lot of it, however, the inferior behavior of CRCs with double-bit errors may be a non-issue.
I think the size of the CRC has more to do with how unique of a CRC you need instead of of the size of the input data. This is related to the particular usage and number of items on which you're calculating a CRC.
The CRC should be chosen specifically for the length of the messages, it is not just a question of the size of the CRC: http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf
Here is a nice "real world" evaluation of CRC-N
http://www.backplane.com/matt/crc64.html
I use CRC-32 and file-size comparison and have NEVER, in the billions of files checked, run into a matching CRC-32 and File-Size collision. But I know a few exist, when not purposely forced to exist. (Hacked tricks/exploits)
When doing comparison, you should ALSO be checking "data-sizes". You will rarely have a collision of the same data-size, with a matching CRC, within the correct sizes.
Purposely manipulated data, to fake a match, is usually done by adding extra-data until the CRC matches a target. However, that results in a data-size that no-longer matches. Attempting to brute-force, or cycle through random, or sequential data, of the same exact size, would leave a real narrow collision-rate.
You can also have collisions within the data-size, just by the generic limits of the formulas used, and constraints of using bits/bytes and base-ten systems, which depends on floating-point values, which get truncated and clipped.
The point you would want to think about going larger, is when you start to see many collisions which can not be "confirmed" as "originals". (When they both have the same data-size, and (when tested backwards, they have a matching CRC. Reverse/byte or reverse/bits, or bit-offsets)
In any event, it should NEVER be used as the ONLY form of comparison, just for a quick form of comparison, for indexing.
You can use a CRC-8 to index the whole internet, and divide everything into one of N-catagories. You WANT those collisions. Now, with those pre-sorted, you only have to check one of N-directories, looking for "file-size", or "reverse-CRC", or whatever other comparison you can do to that smaller data-set, fast...
Doing a CRC-32 forwards and backwards on the same blob of data is more reliable than using CRC-64 in just one direction. (Or an MD5, for that matter.)
You can detect a single bit error with a CRC in any size packet. Detecting double bit errors or correction of single bit errors is limited to the number of distinct values the CRC can take, so for 8 bits, that would 256; for 16 bits, 65535; etc. 2^n; In practice, though, CRCs actually take on fewer distinct values for single bit errors. For example what I call the 'Y5' polynomial, the 0x5935 polynomial only takes on up to 256 different values before they repeat going back farther, but on the other hand it is able to correct double bit errors that distance, which is 30 bytes plus 2 bytes for errors in the CRC itself.
The number of bits you can correct with forward error correction is also limited by the Hamming Distance of the polynomial. For example, if the Hamming distance is three, you have to flip three bits to change from a set of bits that represents one valid message with matching CRC to another valid message with its own matching CRC. If that is the case, you can correct one bit with confidence. If the Hamming distance were 5, you could correct two bits. But when correcting multiple bits, you are effectively indexing multiple positions, so you need twice as many bits to represent the indexes of two corrected bits rather than one.
With forward error correction, you calculate the CRC on a packet and CRC together, and get a residual value. A good message with zero errors will always have the expected residual value (zero unless there's a nonzero initial value for the CRC register), and each bit position of error has a unique residual value, so use it to identify the position. If you ever get a CRC result with that residual, you know which bit (or bits) to flip to correct the error.

32 bit CRC with some inputs set to zero. Is this less accurate than dummy data?

Sorry if I should be able to answer this simple question myself!
I am working on an embedded system with a 32bit CRC done in hardware for speed. A utility exists that I cannot modify that initially takes 3 inputs (words) and returns a CRC.
If a standard 32 bit was implemented, would generating a CRC from a 32 bit word of actual data and 2 32 bit words comprising only of zeros produce a less reliable CRC than if I just made up/set some random values for the last 2 32?
Depending on the CRC/polynomial, my limited understanding of CRC would say the more data you put in the less accurate it is. But don't zero'd data reduce accuracy when performing the shifts?
Using zeros will be no different than some other value you might pick. The input word will be just as well spread among the CRC bits either way.
I agree with Mark Adler that zeros are mathematically no worse than other numbers. However, if the utility you can't change does something bad like set the initial CRC to zero, then choose non-zero pad words. An initial CRC=0 + Data=0 + Pads=0 produces a final CRC=0. This is technically valid, but routinely getting CRC=0 is undesirable for data integrity checking. You could compensate for a problem like this with non-zero pad characters, e.g. pad = -1.

Better compression algorithm for vector data?

I need to compress some spatially correlated data records. Currently I am getting 1.2x-1.5x compression with zlib, but I figure it should be possible to get more like 2x. The data records have various fields, but for example, zlib seems to have trouble compressing lists of points.
The points represent a road network. They are pairs of fixed-point 4-byte integers of the form XXXXYYYY. Typically, if a single data block has 100 points, there will be only be a few combinations of the top two bytes of X and Y (spatial correlation). But the bottom bytes are always changing and must look like random data to zlib.
Similarly, the records have 4-byte IDs which tend to have constant high bytes and variable low bytes.
Is there another algorithm that would be able to compress this kind of data better? I'm using C++.
Edit: Please no more suggestions to change the data itself. My question is about automatic compression algorithms. If somebody has a link to an overview of all popular compression algorithms I'll just accept that as answer.
You'll likely get much better results if you try to compress the data yourself based on your knowledge of its structure.
General-purpose compression algorithms just treat your data as a bitstream. They look for commonly-used sequences of bits, and replace them with a shorter dictionary indices.
But the duplicate data doesn't go away. The duplicated sequence gets shorter, but it's still duplicated just as often as it was before.
As I understand it, you have a large number of data points of the form
XXxxYYyy, where the upper-case letters are very uniform. So factor them out.
Rewrite the list as something similar to this:
XXYY // a header describing the common first and third byte for all the subsequent entries
xxyy // the remaining bytes, which vary
xxyy
xxyy
xxyy
...
XXYY // next unique combination of 1st and 3rd byte)
xxyy
xxyy
...
Now, each combination of the rarely varying bytes is listed only once, rather than duplicated for every entry they occur in. That adds up to a significant space saving.
Basically, try to remove duplicate data yourself, before running it through zlib. You can do a better job of it because you have additional knowledge about the data.
Another approach might be, instead of storing these coordinates as absolute numbers, write them as deltas, relative deviations from some location chosen to be as close as possible to all the entries. Your deltas will be smaller numbers, which can be stored using fewer bits.
Not specific to your data, but I would recommend checking out 7zip instead of zlib if you can. I've seen ridiculously good compression ratios using this.
http://www.7-zip.org/
Without seeing the data and its exact distribution, I can't say for certain what the best method is, but I would suggest that you start each group of 1-4 records with a byte whose 8 bits indicate the following:
0-1 Number of bytes of ID that should be borrowed from previous record
2-4 Format of position record
6-7 Number of succeeding records that use the same 'mode' byte
Each position record may be stored one of eight ways; all types other than 000 use signed displacements. The number after the bit code is the size of the position record.
000 - 8 - Two full four-byte positions
001 - 3 - Twelve bits for X and Y
010 - 2 - Ten-bit X and six-bit Y
011 - 2 - Six-bit X and ten-bit Y
100 - 4 - Two sixteen-bit signed displacements
101 - 3 - Sixteen-bit X and 8-bit Y signed displacement
110 - 3 - Eight-bit signed displacement for X; 16-bit for Y
111 - 2 - Two eight-bit signed displacements
A mode byte of zero will store all the information applicable to a point without reference to any previous point, using a total of 13 bytes to store 12 bytes of useful information. Other mode bytes will allow records to be compacted based upon similarity to previous records. If four consecutive records differ only in the last bit of the ID, and either have both X and Y within +/- 127 of the previous record, or have X within +/- 31 and Y within +/- 511, or X within +/- 511 and Y within +/- 31, then all four records may be stored in 13 bytes (an average of 3.25 bytes each (a 73% reduction in space).
A "greedy" algorithm may be used for compression: examine a record to see what size ID and XY it will have to use in the output, and then grab up to three more records until one is found that either can't "fit" with the previous records using the chosen sizes, or could be written smaller (note that if e.g. the first record has X and Y displacements both equal to 12, the XY would be written with two bytes, but until one reads following records one wouldn't know which of the three two-byte formats to use).
Before setting your format in stone, I'd suggest running your data through it. It may be that a small adjustment (e.g. using 7+9 or 5+11 bit formats instead of 6+10) would allow many data to pack better. The only real way to know, though, is to see what happens with your real data.
It looks like the Burrows–Wheeler transform might be useful for this problem. It has a peculiar tendency to put runs of repeating bytes together, which might make zlib compress better. This article suggests I should combine other algorithms than zlib with BWT, though.
Intuitively it sounds expensive, but a look at some source code shows that reverse BWT is O(N) with 3 passes over the data and a moderate space overhead, likely making it fast enough on my target platform (WinCE). The forward transform is roughly O(N log N) or slightly over, assuming an ordinary sort algorithm.
Sort the points by some kind of proximity measure such that the average distance between adjacent points is small. Then store the difference between adjacent points.
You might do even better if you manage to sort the points so that most differences are positive in both the x and y axes, but I can't say for sure.
As an alternative to zlib, a family of compression techniques that works well when the probability distribution is skewed towards small numbers is universal codes. They would have to be tweaked for signed numbers (encode abs(x)<<1 + (x < 0 ? 1 : 0)).
You might want to write two lists to the compressed file: a NodeList and a LinkList. Each node would have an ID, x, y. Each link would have a FromNode and a ToNode, along with a list of intermediate xy values. You might be able to have a header record with a false origin and have node xy values relative to that.
This would provide the most benefit if your streets follow an urban grid network, by eliminating duplicate coordinates at intersections.
If the compression is not required to be lossless, you could use truncated deltas for intermediate coordinates. While someone above mentioned deltas, keep in mind that a loss in connectivity would likely cause more problems than a loss in shape, which is what would happen if you use truncated deltas to represent the last coordinate of a road (which is often an intersection).
Again, if your roads aren't on an urban grid, this probably wouldn't buy you much.

Compression for a unique stream of data

I've got a large number of integer arrays. Each one has a few thousand integers in it, and each integer is generally the same as the one before it or is different by only a single bit or two. I'd like to shrink each array down as small as possible to reduce my disk IO.
Zlib shrinks it to about 25% of its original size. That's nice, but I don't think its algorithm is particularly well suited for the problem. Does anyone know a compression library or simple algorithm that might perform better for this type of information?
Update: zlib after converting it to an array of xor deltas shrinks it to about 20% of the original size.
If most of the integers really are the same as the previous, and the inter-symbol difference can usually be expressed as a single bit flip, this sounds like a job for XOR.
Take an input stream like:
1101
1101
1110
1110
0110
and output:
1101
0000
0010
0000
1000
a bit of pseudo code
compressed[0] = uncompressed[0]
loop
compressed[i] = uncompressed[i-1] ^ uncompressed[i]
We've now reduced most of the output to 0, even when a high bit is changed. The RLE compression in any other tool you use will have a field day with this. It'll work even better on 32-bit integers, and it can still encode a radically different integer popping up in the stream. You're saved the bother of dealing with bit-packing yourself, as everything remains an int-sized quantity.
When you want to decompress:
uncompressed[0] = compressed[0]
loop
uncompressed[i] = uncompressed[i-1] ^ compressed[i]
This also has the advantage of being a simple algorithm that is going to run really, really fast, since it is just XOR.
Have you considered Run-length encoding?
Or try this: Instead of storing the numbers themselves, you store the differences between the numbers. 1 1 2 2 2 3 5 becomes 1 0 1 0 0 1 2. Now most of the numbers you have to encode are very small. To store a small integer, use an 8-bit integer instead of the 32-bit one you'll encode on most platforms. That's a factor of 4 right there. If you do need to be prepared for bigger gaps than that, designate the high-bit of the 8-bit integer to say "this number requires the next 8 bits as well".
You can combine that with run-length encoding for even better compression ratios, depending on your data.
Neither of these options is particularly hard to implement, and they all run very fast and with very little memory (as opposed to, say, bzip).
You want to preprocess your data -- reversibly transform it to some form that is better-suited to your back-end data compression method, first. The details will depend on both the back-end compression method, and (more critically) on the properties you expect from the data you're compressing.
In your case, zlib is a byte-wise compression method, but your data comes in (32-bit?) integers. You don't need to reimplement zlib yourself, but you do need to read up on how it works, so you can figure out how to present it with easily compressible data, or if it's appropriate for your purposes at all.
Zlib implements a form of Lempel-Ziv coding. JPG and many others use Huffman coding for their backend. Run-length encoding is popular for many ad hoc uses. Etc., etc. ...
Perhaps the answer is to pre-filter the arrays in a way analogous to the Filtering used to create small PNG images. Here are some ideas right off the top of my head. I've not tried these approaches, but if you feel like playing, they could be interesting.
Break your ints up each into 4 bytes, so i0, i1, i2, ..., in becomes b0,0, b0,1, b0,2, b0,3, b1,0, b1,1, b1,2, b1,3, ..., bn,0, bn,1, bn,2, bn,3. Then write out all the bi,0s, followed by the bi,1s, bi,2s, and bi,3s. If most of the time your numbers differ only by a bit or two, you should get nice long runs of repeated bytes, which should compress really nicely using something like Run-length Encoding or zlib. This is my favourite of the methods I present.
If the integers in each array are closely-related to the one before, you could maybe store the original integer, followed by diffs against the previous entry - this should give a smaller set of values to draw from, which typically results in a more compressed form.
If you have various bits differing, you still may have largish differences, but if you're more likely to have large numeric differences that correspond to (usually) one or two bits differing, you may be better off with a scheme where you create ahebyte array - use the first 4 bytes to encode the first integer, and then for each subsequent entry, use 0 or more bytes to indicate which bits should be flipped - storing 0, 1, 2, ..., or 31 in the byte, with a sentinel (say 32) to indicate when you're done. This could result the raw number of bytes needed to represent and integer to something close to 2 on average, which most bytes coming from a limited set (0 - 32). Run that stream through zlib, and maybe you'll be pleasantly surprised.
Did you try bzip2 for this?
http://bzip.org/
It's always worked better than zlib for me.
Since your concern is to reduce disk IO, you'll want to compress each integer array independently, without making reference to other integer arrays.
A common technique for your scenario is to store the differences, since a small number of differences can be encoded with short codewords. It sounds like you need to come up with your own coding scheme for differences, since they are multi-bit differences, perhaps using an 8 bit byte something like this as a starting point:
1 bit to indicate that a complete new integer follows, or that this byte encodes a difference from the last integer,
1 bit to indicate that there are more bytes following, recording more single bit differences for the same integer.
6 bits to record the bit number to switch from your previous integer.
If there are more than 4 bits different, then store the integer.
This scheme might not be appropriate if you also have a lot of completely different codes, since they'll take 5 bytes each now instead of 4.
"Zlib shrinks it by a factor of about 4x." means that a file of 100K now takes up negative 300K; that's pretty impressive by any definition :-). I assume you mean it shrinks it by 75%, i.e., to 1/4 its original size.
One possibility for an optimized compression is as follows (it assumes a 32-bit integer and at most 3 bits changing from element to element).
Output the first integer (32 bits).
Output the number of bit changes (n=0-3, 2 bits).
Output n bit specifiers (0-31, 5 bits each).
Worst case for this compression is 3 bit changes in every integer (2+5+5+5 bits) which will tend towards 17/32 of original size (46.875% compression).
I say "tends towards" since the first integer is always 32 bits but, for any decent sized array, that first integer would be negligable.
Best case is a file of identical integers (no bit changes for every integer, just the 2 zero bits) - this will tend towards 2/32 of original size (93.75% compression).
Where you average 2 bits different per consecutive integer (as you say is your common case), you'll get 2+5+5 bits per integer which will tend towards 12/32 or 62.5% compression.
Your break-even point (if zlib gives 75% compression) is 8 bits per integer which would be
single-bit changes (2+5 = 7 bits) : 80% of the transitions.
double-bit changes (2+5+5 = 12 bits) : 20% of the transitions.
This means your average would have to be 1.2 bit changes per integer to make this worthwhile.
One thing I would suggest looking at is 7zip - this has a very liberal licence and you can link it with your code (I think the source is available as well).
I notice (for my stuff anyway) it performs much better than WinZip on a Windows platform so it may also outperform zlib.