I'm writing floating point numbers to file, but there's two different ways of writing these numbers and I'm wondering which to use.
The two choices are:
write the raw representative bits to file
write the ascii representation of the number to file
Option 1 seems like it would be more practical to me, since I'm truncating each float to 4 bytes. And parsing each number can be skipped entirely when reading. But in practice, I've only ever seen option 2 used.
The data in question is 3D model information, where small file sizes and quick reading can be very advantageous, but again, no existing 3D model format does this that I know of, and I imagine there must be a good reason behind it.
My question is, what reasons are there for choosing to write the written out form of numbers, instead of the bit representation? And are there situations where using the binary form would be preferred?
First of all, floats are 4 bytes on any architecture you might encounter normally, so nothing is "truncated" when you write the 4 bytes from memory to a file.
As for your main question, many regular file formats are designed for "interoperability" and ease of reading/writing. That's why text, which is an almost universally portable representation (character encoding issues notwithstanding,) is used most often.
For example, it is very easy for a program to read the string "123" from a text file and know that it represents the number 123.
(But note that text itself is not a format. You might choose to represent all your data elements as ASCII/Unicode/whatever strings of characters, and put all these strings along with each other to form a text file, but you still need to specify exactly what each element means and what data can be found where. For example, a very simplistic text-based 3D triangle mesh file format might have the number of triangles in the mesh on the first line of the file, followed by three triplets of real numbers on the next N lines, each, specifying the 9 numbers required for the X,Y,Z coordinates of the three vertices of a triangle.)
On the other hand are the binary formats. These usually have in them the data elements in the same format as they are found in computer memory. This means an integer is represented with a fixed number of bytes (1, 2, 4 or 8, usually in "two's complement" format) or a real number is represented by 4 or 8 bytes in IEEE 754 format. (Note that I'm omitting a lot of details for the sake of staying on point.)
Main advantages of a binary format are:
They are usually smaller in size. A 32-bit integer written as an ASCII string can get upto 10 or 11 bytes (e.g. -1000000000) but in binary it always takes up 4 bytes. And smaller means faster-to-transfer (over network, from disk to memory, etc.) and easier to store.
Each data element is faster to read. No complicated parsing is required. If the data element happens to be in the exact format/layout that your platform/language can work with, then you just need to transfer the few bytes from disk to memory and you are done.
Even large and complex data structures can be laid out on disk in exactly the same way as they would have been in memory, and then all you need to do to "read" that format would be to get that large blob of bytes (which probably contains many many data elements) from disk into memory, in one easy and fast operation, and you are done.
But that 3rd advantage requires that you match the layout of data on disk exactly (bit for bit) with the layout of your data structures in memory. This means that, almost always, that file format will only work with your code and your code only, and not even if you change some stuff around in your own code. This means that it is not at all portable or interoperable. But it is damned fast to work with!
There are disadvantages to binary formats too:
You cannot view or edit or make sense of them in a simple, generic software like a text editor anymore. You can open any XML, JSON or config file in any text editor and make some sense of it quite easily, but not a JPEG file.
You will usually need more specific code to read in/write out a binary format, than a text format. Not to mention specification that document what every bit of the file should be. Text files are generally more self-explanatory and obvious.
In some (many) languages (scripting and "higher-level" languages) you usually don't have access to the bytes that make up an integer or a float, not to read them nor to write them. This means that you'll lose most of the speed advantages that binary files give you when you are working in a lower-level language like C or C++.
Binary in-memory formats of primitive data types are almost always tied to the hardware (or more generally, the whole platform) that the memory is attached to. When you choose to write the same bits from memory to a file, the file format becomes hardware-dependent as well. One hardware might not store floating-point real numbers exactly the same way as another, which means binary files written on one cannot be read on the other naively (care must be taken and the data carefully converted into the target format.) One major difference between hardware architectures is know as "endianness" which affects how multibyte primitives (e.g. a 4-byte integer, or an 8-byte float) are expected to be stored in memory (from highest-order byte to the lowest-order, or vice versa, which are called "big endian" and "little endian" respectively.) Data written to a binary file on a big-endian architecture (e.g. PowerPC) and read verbatim on a little-endian architecture (e.g. x86) will have all the bytes in each primitive swapped from high-value to low-value, which means all (well, almost all) the values will be wrong.
Since you mention 3D model data, let me give you an example of what formats are used in a typical game engine. The game engine runtime will most likely need the most speed it can have in reading the models, and 3D models are large, so usually it has a very specific, and not-at-all-portable format for its model files. But that format would most likely not be supported by any modeling software. So you need to write a converter (also called an exporter or importer) that would take a common, generally-used format (e.g. OBJ, DAE, etc.) and convert that into the engine-specific, proprietary format. But as I mentioned, reading/transferring/working-with a text-based format is easier than a binary format, so you usually would choose a text-based common format to export your models into, then run the converter on them to the optimized, binary, engine-specific runtime format.
You might prefer binary format if:
You want more compact encoding (fewer bytes - because text encoding will probably take more space).
Precision - because if you encode as text you might lose precision - but maybe there are ways to encode as text without losing precision*.
Performance is probably also another advantage of binary encoding.
Since you mention data in question is 3D model simulation, compactness of encoding (maybe also performance) and precision maybe relevant for you. On the other hand, text encoding is human readable.
That said, with binary encoding you typically have issues like endianness, and that float representation maybe different on different machines, but here is a way to encode floats (or doubles) in binary format in a portable way:
uint64_t pack754(long double f, unsigned bits, unsigned expbits)
{
long double fnorm;
int shift;
long long sign, exp, significand;
unsigned significandbits = bits - expbits - 1; // -1 for sign bit
if (f == 0.0) return 0; // get this special case out of the way
// check sign and begin normalization
if (f < 0) { sign = 1; fnorm = -f; }
else { sign = 0; fnorm = f; }
// get the normalized form of f and track the exponent
shift = 0;
while(fnorm >= 2.0) { fnorm /= 2.0; shift++; }
while(fnorm < 1.0) { fnorm *= 2.0; shift--; }
fnorm = fnorm - 1.0;
// calculate the binary form (non-float) of the significand data
significand = fnorm * ((1LL<<significandbits) + 0.5f);
// get the biased exponent
exp = shift + ((1<<(expbits-1)) - 1); // shift + bias
// return the final answer
return (sign<<(bits-1)) | (exp<<(bits-expbits-1)) | significand;
}
*: In C, since C99 there seems a way to do this, but still I think it will take more space.
Related
I'm working on a web project, and I need to create a format to transmit files very efficiently (lots of data). The data is entirely numerical, and split into a few sections. Of course, this will be transferred with gzip compression.
I can't seem to find any information on what makes a file compress better than another file.
How can I encode floats (32bit) and short integers (16bit) in a format that results in the smallest gzip size?
P.s. it will be a lot of data, so saving 5% means a lot here. There won't likely be any repeats in the floats, but the integers will likely repeat about 5-10 times in each file.
The only way to compress data is to remove redundancy. This is essentially what any compression tool does - it looks for redundant/repeatable parts and replaces them with link/reference to the same data that was observed before in your stream.
If you want to make your data format more efficient, you should remove everything that could be possibly removed. For example, it is more efficient to store numbers in binary rather than in text (JSON, XML, etc). If you have to use text format, consider removing unnecessary spaces or linefeeds.
One good example of efficient binary format is google protocol buffers. It has lots of benefits, and not least of them is storing numbers as variable number of bytes (i.e. number 1 consumes less space than number 1000000).
Text or binary, but if you can sort your data before sending, it can increase possibility for gzip compressor to find redundant parts, and most likely to increase compression ratio.
Since you said 32-bit floats and 16-bit integers, you are already coding them in binary.
Consider the range and useful accuracy of your numbers. If you can limit those, you can recode the numbers using fewer bits. Especially the floats, which may have more bits than you need.
If the right number of bits is not a multiple of eight, then treat your stream of bytes as a stream of bits and use only the bits needed. Be careful to deal with the end of your data properly so that the added bits to go to the next byte boundary are not interpreted as another number.
If your numbers have some correlation to each other, then you should take advantage of that. For example, if the difference between successive numbers is usually small, which is the case for a representation of a waveform for example, then send the differences instead of the numbers. Differences can be coded using variable-length integers or Huffman coding or a combination, e.g. Huffman codes for ranges and extra bits within each range.
If there are other correlations that you can use, then design a predictor for the next value based on the previous values. Then send the difference between the actual and predicted value. In the previous example, the predictor is simply the last value. An example of a more complex predictor is a 2D predictor for when the numbers represent a 2D table and both adjacent rows and columns are correlated. The PNG image format has a few examples of 2D predictors.
All of this will require experimentation with your data, ideally large amounts of your data, to see what helps and what doesn't or has only marginal benefit.
Use binary instead of text.
A float in its text representation with 8 digits (a float has a precision of eight decimal digits), plus decimal separator, plus field separator, consumes 10 bytes. In binary representation, it takes only 4.
If you need to use text, use hex. It consumes less digits.
But although this makes a lot of difference for the uncompressed file, these differences might disappear after compression, since the compression algo should implicitly take care if that. But you may try.
As I mentioned above,
1-) Is size of smallest unit of data written to file on file stream in binary mode always 8 bits? If it writes to file whatever character is passed with the function put(), can we say that it is always 8 bits?
2-) If we add an integer to a variable of char type, does position in character set of the variable change as many as the integer added, regardless of how bits of variable of char type are represented in memory whichever platform/machine it is tried on? And what if we exceed the limit of value the variable can take in any system that has signed or unsigned char representation of char type? Does it always return from end to begining when adding and do the reverse for extracting?
3-) What exactly I want to know is whether there is a portable way to storage data in file for binary mode and how common file formats are manipulated by reading and writing without problems.
Thanks.
1) The C++ standard is pretty clear that a "byte" (or char) is not necessarily 8 bits, for one thing. Although machines with 9- or 12-bit char types are not very common, if you want extreme portability you need to take this into account in some way (e.g. specify that "our implementation expects a char to be 8 bits - which can of course be checked during compilation or runtime, e.g:
#if (CHAR_BITS != 8)
#error This implementation requires char_bits == 8.
#endif
or
if (CHAR_BITS != 8)
{
cerr << "Sorry, can't run on this platform, CHAR_BITS is not 8\n";
exit(2);
}
2) Adding an int value to a char value will convert it to an int - if you then convert it back to a char, it should be consistent, yes. Although behaviour is technically "undefined" for overflows between positive and negative values, which can cause strange things (e.g. traps for overflows) on some machines.
3) As long as it's clearly defined and documents, a binary format can be made to work well in a portable scenarion. See "JPG", "PNG" and to some degree "BMP" as examples where binary data is "quite portable". I'm not sure how well it works to display a JPG on a DEC-10 system with a 36-bit machine word tho'.
1) No, the smallest unit of allocation is a disk page, as defined by the filesystem parameters. With most modern file systems, this is 4k, though some next-gen file systems exceptionally small files' content can be stored in the inode, so the content itself takes no extra space on the disk. FAT and NTFS page sizes range from 4k to 64k depending on how the disk was formatted.
1a) "smallest read/write" unit is usually an 8-bit byte, though on some oddball systems use different byte sizes (CDC cyber comes to mind with a 12-bit byte). I can't think of any modern systems that use anything other than an 8-bit byte.
2) adding an integer to a char will result in a size integer result. The compiler will implicitly promote the char to integer before the arithmetic. This can then be downcast (by truncation, usually) to a char.
3) Yes and yes. You have to thoroughly document the file formats, including endianness of words if you plan to be running on different CPU architectures (i.e. Intel is little-ended, motorola is big-ended, and some supercomputers are weirdly ended). These different architectures will read and write words and dwords differently, and you may have to account for that in your reader code.
3a) This is fairly common (though now with XML and other self-defining semistructured formats perhaps less so), and so long as the documentation is complete, there are few issues in reading or writing these files.
I'm coding a game project as a hobby and I'm currently in the part where I need to store some resource data (.BMPs for example) into a file format of my own so my game can parse all of it and load into the screen.
For reading BMPs, I read the header, and then the RGB data for each pixel, and I have a array[width][height] that stores these values.
I was told I should save these type of data in binary, but not the reason. I've read about binary and what it is (the 0-1 representation of data), but why should I use it to save a .BMP data for example?
If I'm going to read it later in the game, doesn't it just adds more complexness and maybe even slow down the loading process?
And lastly, if it is better to save in binary (I'm guessing it is, seeing as how everyone seems to do so from what I researched in other game resource files) how do I read and write binary in C++?
I've seen lots of questions but with many different ways for many different types of variables, so I'm asking which is the best/more C++ish way of doing it?
You have it all backwards. A computer processor operates with data at the binary level. Everything in a computer is binary. To deal with data in human-readable form, we write functions that jump through hoops to make that binary data look like something that humans understand. So if you store your .BMP data in a file as text, you're actually making the computer do a whole lot more work to convert the .BMP data from its natural binary form into text, and then from its text form back into binary in order to display it.
The truth of the matter is that the more you can handle data in its raw binary form, the faster your code will be able to run. Less conversions means faster code. But there's obviously a tradeoff: If you need to be able to look at data and understand it without pulling out a magic decoder ring, then you might want to store it in a file as text. But in doing so, we have to understand that there is conversion processing that must be done to make that human-readable text meaningful to the processor, which as I said, operates on nothing but pure binary data.
And, just in case you already knew that or sort-of-knew-it, and your question was "why should I open my .bmp file in binary mode and not in text mode", then the reason for that is that opening a file in text mode asks the platform to perform CRLF-to-LF conversions ("\r\n"-to-"\n" conversions), as necessary based on the platform, so that at the internal string-processing level, all you're dealing with is '\n' characters. If your file consists of binary data, you don't want that conversion going on, or else it will corrupt the data from the file as you read it. In this state, most of the data will be fine, and things may work fine most of the time, but occasionally you'll run across a pair of bytes of the hexadecimal form 0x0d,0x0a (decimal 13,10) that will get converted to just 0x0a (10), and you'll be missing a byte in the data you read. Therefore be sure to open binary files in binary mode!
OK, based on your most recent comment (below), here's this:
As you (now?) understand, data in a computer is stored in binary format. Yes, that means it's in 0's and 1's. However, when programming, you don't actually have to fiddle with the 0's and 1's yourself, unless you're doing bitwise logical operations for some reason. A variable of type, let's say int for example, is a collection of individual bits, each of which can be either 0 or 1. It's also a collection of bytes, and assuming that there are 8 bits in a byte, then there are generally 2, 4, or 8 bytes in an int, depending on your platform and compiler options. But you work with that int as an int, not as individual 0's and 1's. If you write that int out to a file in its purest form, the bytes (and thus the bits) get written out in an unconverted raw form. But you could also convert them to ASCII text and write them out that way. If you're displaying an int on the screen, you don't want to see the individual 0's and 1's of course, so you print it in its ASCII form, generally decoded as a decimal number. You could just as easily print that same int in its hexadecimal form, and the result would look different even though it's the same number. For example, in decimal, you might have the decimal value 65. That same value in hexadecimal is 0x41 (or, just 41 if we understand that it's in base 16). That same value is the letter 'A' if we display it in ASCII form (and consider only the low byte of the 2,- 4,- or 8-byte int, i.e. treat it as a char).
For the rest of this discussion, forget that we were talking about an int and now consider that we're discussing a char, or 1 byte (8 bits). Let's say we still have that same value, 65, or 0x41, or 'A', however you want to look at it. If you want to send that value to a file, you can send it in its raw form, or you can convert it to text form. If you send it in its raw form, it will occupy 8 bits (one byte) in the file. But if you want to write it to the file in text form, you'd convert it to ASCII, which depending on the format you want to write it an the actual value (65 in this case), it will occupy either 1, 2, or 3 bytes. Say you want to write it in decimal ASCII with no padding characters. The value 65 will then take 2 bytes: one for the '6' and one for the '5'. If you want to print it in hexadecimal form, it will still take 2 bytes: one for the '4' and one for the '1', unless you prepend it with "0x", in which case it will take 4 bytes, one for '0', one for 'x', one for '4', and another for '1'. Or suppose your char is the value 255 (the maximum value of a char): If we write it to the file in decimal ASCII form, it will take 3 bytes. But if we write that same value in hexadecimal ASCII form, it will still take 2 bytes (or 4, if we're prepending "0x"), because the value 255 in hexadecimal is 0xFF. Compare this to writing that 8-bit byte (char) in its raw binary form: A char takes 1 byte (by definition), so it will consume only 1 byte of the file in binary form regardless of what its value is.
I want to store billions (10^9) of double precision floating point numbers in memory and save space. These values are grouped in thousands of ordered sets (they are time series), and within a set, I know that the difference between values is usually not large (compared to their absolute value). Also, the closer to each other, the higher the probability of the difference being relatively small.
A perfect fit would be a delta encoding that stores only the difference of each value to its predecessor. However, I want random access to subsets of the data, so I can't depend on going through a complete set in sequence. I'm therefore using deltas to a set-wide baseline that yields deltas which I expect to be within 10 to 50 percent of the absolute value (most of the time).
I have considered the following approaches:
divide the smaller value by the larger one, yielding a value between 0 and 1 that could be stored as an integer of some fixed precision plus one bit for remembering which number was divided by which. This is fairly straightforward and yields satisfactory compression, but is not a lossless method and thus only a secondary choice.
XOR the IEEE 754 binary64 encoded representations of both values and store the length of the long stretches of zeroes at the beginning of the exponent and mantissa plus the remaining bits which were different. Here I'm quite unsure how to judge the compression, although I think it should be good in most cases.
Are there standard ways to do this? What might be problems about my approaches above? What other solutions have you seen or used yourself?
Rarely are all the bits of a double-precision number meaningful.
If you have billions of values that are the result of some measurement, find the calibration and error of your measurement device. Quantize the values so that you only work with meaningful bits.
Often, you'll find that you only need 16 bits of actual dynamic range. You can probably compress all of this into arrays of "short" that retain all of the original input.
Use a simple "Z-score technique" where every value is really a signed fraction of the standard deviation.
So a sequence of samples with a mean of m and a standard deviation of s gets transformed into a bunch of Z score. Normal Z-score transformations use a double, but you should use a fixed-point version of that double. s/1000 or s/16384 or something that retains only the actual precision of your data, not the noise bits on the end.
for u in samples:
z = int( 16384*(u-m)/s )
for z in scaled_samples:
u = s*(z/16384.0)+m
Your Z-scores retain a pleasant easy-to-work with statistical relationship with the original samples.
Let's say you use a signed 16-bit Z-score. You have +/- 32,768. Scale this by 16,384 and your Z-scores have an effective resolution of 0.000061 decimal.
If you use a signed 24-but Z-score, you have +/- 8 million. Scale this by 4,194,304 and you have a resolution of 0.00000024.
I seriously doubt you have measuring devices this accurate. Further, any arithmetic done as part of filter, calibration or noise reduction may reduce the effective range because of noise bits introduced during the arithmetic. A badly thought-out division operator could make a great many of your decimal places nothing more than noise.
Whatever compression scheme you pick, you can decouple that from the problem of needing to be able to perform arbitrary seeks by compressing into fixed-size blocks and prepending to each block a header containing all the data required to decompress it (e.g. for a delta encoding scheme, the block would contain deltas enconded in some fashion that takes advantage of their small magnitude to make them take less space, e.g. fewer bits for exponent/mantissa, conversion to fixed-point value, Huffman encoding etc; and the header a single uncompressed sample); seeking then becomes a matter of cheaply selecting the appropriate block, then decompressing it.
If the compression ratio is so variable that much space is being wasted padding the compressed data to produce fixed size blocks, a directory of offsets into the compressed data could be built instead and the state required to decompress recorded in that.
If you know a group of doubles has the same exponent, you could store the exponent once, and only store the mantissa for each value.
I've got a large number of integer arrays. Each one has a few thousand integers in it, and each integer is generally the same as the one before it or is different by only a single bit or two. I'd like to shrink each array down as small as possible to reduce my disk IO.
Zlib shrinks it to about 25% of its original size. That's nice, but I don't think its algorithm is particularly well suited for the problem. Does anyone know a compression library or simple algorithm that might perform better for this type of information?
Update: zlib after converting it to an array of xor deltas shrinks it to about 20% of the original size.
If most of the integers really are the same as the previous, and the inter-symbol difference can usually be expressed as a single bit flip, this sounds like a job for XOR.
Take an input stream like:
1101
1101
1110
1110
0110
and output:
1101
0000
0010
0000
1000
a bit of pseudo code
compressed[0] = uncompressed[0]
loop
compressed[i] = uncompressed[i-1] ^ uncompressed[i]
We've now reduced most of the output to 0, even when a high bit is changed. The RLE compression in any other tool you use will have a field day with this. It'll work even better on 32-bit integers, and it can still encode a radically different integer popping up in the stream. You're saved the bother of dealing with bit-packing yourself, as everything remains an int-sized quantity.
When you want to decompress:
uncompressed[0] = compressed[0]
loop
uncompressed[i] = uncompressed[i-1] ^ compressed[i]
This also has the advantage of being a simple algorithm that is going to run really, really fast, since it is just XOR.
Have you considered Run-length encoding?
Or try this: Instead of storing the numbers themselves, you store the differences between the numbers. 1 1 2 2 2 3 5 becomes 1 0 1 0 0 1 2. Now most of the numbers you have to encode are very small. To store a small integer, use an 8-bit integer instead of the 32-bit one you'll encode on most platforms. That's a factor of 4 right there. If you do need to be prepared for bigger gaps than that, designate the high-bit of the 8-bit integer to say "this number requires the next 8 bits as well".
You can combine that with run-length encoding for even better compression ratios, depending on your data.
Neither of these options is particularly hard to implement, and they all run very fast and with very little memory (as opposed to, say, bzip).
You want to preprocess your data -- reversibly transform it to some form that is better-suited to your back-end data compression method, first. The details will depend on both the back-end compression method, and (more critically) on the properties you expect from the data you're compressing.
In your case, zlib is a byte-wise compression method, but your data comes in (32-bit?) integers. You don't need to reimplement zlib yourself, but you do need to read up on how it works, so you can figure out how to present it with easily compressible data, or if it's appropriate for your purposes at all.
Zlib implements a form of Lempel-Ziv coding. JPG and many others use Huffman coding for their backend. Run-length encoding is popular for many ad hoc uses. Etc., etc. ...
Perhaps the answer is to pre-filter the arrays in a way analogous to the Filtering used to create small PNG images. Here are some ideas right off the top of my head. I've not tried these approaches, but if you feel like playing, they could be interesting.
Break your ints up each into 4 bytes, so i0, i1, i2, ..., in becomes b0,0, b0,1, b0,2, b0,3, b1,0, b1,1, b1,2, b1,3, ..., bn,0, bn,1, bn,2, bn,3. Then write out all the bi,0s, followed by the bi,1s, bi,2s, and bi,3s. If most of the time your numbers differ only by a bit or two, you should get nice long runs of repeated bytes, which should compress really nicely using something like Run-length Encoding or zlib. This is my favourite of the methods I present.
If the integers in each array are closely-related to the one before, you could maybe store the original integer, followed by diffs against the previous entry - this should give a smaller set of values to draw from, which typically results in a more compressed form.
If you have various bits differing, you still may have largish differences, but if you're more likely to have large numeric differences that correspond to (usually) one or two bits differing, you may be better off with a scheme where you create ahebyte array - use the first 4 bytes to encode the first integer, and then for each subsequent entry, use 0 or more bytes to indicate which bits should be flipped - storing 0, 1, 2, ..., or 31 in the byte, with a sentinel (say 32) to indicate when you're done. This could result the raw number of bytes needed to represent and integer to something close to 2 on average, which most bytes coming from a limited set (0 - 32). Run that stream through zlib, and maybe you'll be pleasantly surprised.
Did you try bzip2 for this?
http://bzip.org/
It's always worked better than zlib for me.
Since your concern is to reduce disk IO, you'll want to compress each integer array independently, without making reference to other integer arrays.
A common technique for your scenario is to store the differences, since a small number of differences can be encoded with short codewords. It sounds like you need to come up with your own coding scheme for differences, since they are multi-bit differences, perhaps using an 8 bit byte something like this as a starting point:
1 bit to indicate that a complete new integer follows, or that this byte encodes a difference from the last integer,
1 bit to indicate that there are more bytes following, recording more single bit differences for the same integer.
6 bits to record the bit number to switch from your previous integer.
If there are more than 4 bits different, then store the integer.
This scheme might not be appropriate if you also have a lot of completely different codes, since they'll take 5 bytes each now instead of 4.
"Zlib shrinks it by a factor of about 4x." means that a file of 100K now takes up negative 300K; that's pretty impressive by any definition :-). I assume you mean it shrinks it by 75%, i.e., to 1/4 its original size.
One possibility for an optimized compression is as follows (it assumes a 32-bit integer and at most 3 bits changing from element to element).
Output the first integer (32 bits).
Output the number of bit changes (n=0-3, 2 bits).
Output n bit specifiers (0-31, 5 bits each).
Worst case for this compression is 3 bit changes in every integer (2+5+5+5 bits) which will tend towards 17/32 of original size (46.875% compression).
I say "tends towards" since the first integer is always 32 bits but, for any decent sized array, that first integer would be negligable.
Best case is a file of identical integers (no bit changes for every integer, just the 2 zero bits) - this will tend towards 2/32 of original size (93.75% compression).
Where you average 2 bits different per consecutive integer (as you say is your common case), you'll get 2+5+5 bits per integer which will tend towards 12/32 or 62.5% compression.
Your break-even point (if zlib gives 75% compression) is 8 bits per integer which would be
single-bit changes (2+5 = 7 bits) : 80% of the transitions.
double-bit changes (2+5+5 = 12 bits) : 20% of the transitions.
This means your average would have to be 1.2 bit changes per integer to make this worthwhile.
One thing I would suggest looking at is 7zip - this has a very liberal licence and you can link it with your code (I think the source is available as well).
I notice (for my stuff anyway) it performs much better than WinZip on a Windows platform so it may also outperform zlib.