How do I write files that gzip well? - compression

I'm working on a web project, and I need to create a format to transmit files very efficiently (lots of data). The data is entirely numerical, and split into a few sections. Of course, this will be transferred with gzip compression.
I can't seem to find any information on what makes a file compress better than another file.
How can I encode floats (32bit) and short integers (16bit) in a format that results in the smallest gzip size?
P.s. it will be a lot of data, so saving 5% means a lot here. There won't likely be any repeats in the floats, but the integers will likely repeat about 5-10 times in each file.

The only way to compress data is to remove redundancy. This is essentially what any compression tool does - it looks for redundant/repeatable parts and replaces them with link/reference to the same data that was observed before in your stream.
If you want to make your data format more efficient, you should remove everything that could be possibly removed. For example, it is more efficient to store numbers in binary rather than in text (JSON, XML, etc). If you have to use text format, consider removing unnecessary spaces or linefeeds.
One good example of efficient binary format is google protocol buffers. It has lots of benefits, and not least of them is storing numbers as variable number of bytes (i.e. number 1 consumes less space than number 1000000).
Text or binary, but if you can sort your data before sending, it can increase possibility for gzip compressor to find redundant parts, and most likely to increase compression ratio.

Since you said 32-bit floats and 16-bit integers, you are already coding them in binary.
Consider the range and useful accuracy of your numbers. If you can limit those, you can recode the numbers using fewer bits. Especially the floats, which may have more bits than you need.
If the right number of bits is not a multiple of eight, then treat your stream of bytes as a stream of bits and use only the bits needed. Be careful to deal with the end of your data properly so that the added bits to go to the next byte boundary are not interpreted as another number.
If your numbers have some correlation to each other, then you should take advantage of that. For example, if the difference between successive numbers is usually small, which is the case for a representation of a waveform for example, then send the differences instead of the numbers. Differences can be coded using variable-length integers or Huffman coding or a combination, e.g. Huffman codes for ranges and extra bits within each range.
If there are other correlations that you can use, then design a predictor for the next value based on the previous values. Then send the difference between the actual and predicted value. In the previous example, the predictor is simply the last value. An example of a more complex predictor is a 2D predictor for when the numbers represent a 2D table and both adjacent rows and columns are correlated. The PNG image format has a few examples of 2D predictors.
All of this will require experimentation with your data, ideally large amounts of your data, to see what helps and what doesn't or has only marginal benefit.

Use binary instead of text.
A float in its text representation with 8 digits (a float has a precision of eight decimal digits), plus decimal separator, plus field separator, consumes 10 bytes. In binary representation, it takes only 4.
If you need to use text, use hex. It consumes less digits.
But although this makes a lot of difference for the uncompressed file, these differences might disappear after compression, since the compression algo should implicitly take care if that. But you may try.

Related

Can true random data be losslessly compressed by this idea?

I am aware lossless compression relies on statistical redundancy. I had this idea on compression a random binary string, though, and I would like to know if (and why) it could/couldn't work:
As the binary string is random, it is expected that the probability of a different bit than the last bit is one half. That is, if the bit string is ...01101, the probability of the next bit being 0 is one half. That being said, half of the data is expected to "change its digit flow" at "one", let's say. Let's call N consecutive binary digits a "sequence" (note: a sequence of ones relies between zeros and vice-versa).
That being said, at random, it is expected:
1/2 (50%) of sequences of one digit
1/4 (25%) of sequences of two digits
1/8 (12.5%) of sequences of three digits
1/16 (6.25%) of sequences of four digits
...
1/(2^N) of sequences of N digits
Could this be exploited in order to compress the data? Such as:
Considering an infinite random binary string, picking up a sample of 2^M sequences, we know half of them will be sequences of one, one fourth will be sequences of two and so on. What is the proper logic to apply in order to compress the random data with efficiency? And, if not possible, why not possible?
No. Not by any idea.
If all files are compressed by even just one bit each, then by simple counting, you are assured that at least two distinct files are compressed to the exact same thing. (Actually far more than that, but I only need two to make the point.) Now your decompressor will produce a single result from that compressed input. That single result can match at most one of the distinct files. It therefore cannot losslessly compress and decompress the one that it didn't match.

Does Gzip/Deflate recognize patterns

I am studying the inner-workings of Gzip, and I understand that it uses a combination of Huffman Coding and LZ77.
I also realize that a Gzip file is divided into blocks, and each block has a dictionary built for it. Then frequent occurrences of similar data are replaced by pointers pointing at locations in the dictionary.
So the phrase "horses race other horses" would have the word horses replaced by a pointer.
However what if I have an array of 32 bit integers, but it only stores numbers up to 24 bits? For arguments sake lets say these 24 bit numbers are very random and hard to compress and hard to find repetition in.
This would make the first 8 bits of every integer an easy to compress string of 0's, but each string would need a pointer and each pointer still takes up some amount of data. Even a 1 bit pointer (which I know is smaller than what's realistically possible) would still take up 12.5% of the original space.
That would seem somewhat redundant when the array could easily be reduced to a "24 bit" array, with basic pattern recognition.
So my question is:
Does Gzip contain any mechanisms to better compress a file than dictionary pointers?
How well can Gzip compress small amounts of repetitive data, followed by small amounts of hard to compress data?
Each deflate block does not have a "dictionary built for it". What is built for each deflate block is a set of Huffman codes for the literal/length symbols and the distance symbols.
The dictionary you refer to is simply the 32K bytes of uncompressed input that immediately precede the bytes currently being compressed. That's it. Each length/distance pair can refer to a string of 3 to 258 bytes in the last 32K. That is independent of deflate blocks, and such references often go back one or more blocks.
Deflate will not do well trying to compress a sequence of three random bytes, zero byte, three random bytes, zero byte ... There will be no useful repeated strings, where deflate will only be able to Huffman code the literals, with zeros being more frequent. It would code zeros as two bits, since they occur a little more than 25% of the time, and the rest of the literals to at least 8.25 bits each. For this data that would give an average of about 6.7 bits per byte or a compressed ratio of 0.85. In fact gzip gives about 0.86 on this data.
If you want to compress that sequence, simply remove the zero bytes! Then you are done, with no further compression possible at a ratio of 0.75.

Writing numerical data to file as binary vs. written out?

I'm writing floating point numbers to file, but there's two different ways of writing these numbers and I'm wondering which to use.
The two choices are:
write the raw representative bits to file
write the ascii representation of the number to file
Option 1 seems like it would be more practical to me, since I'm truncating each float to 4 bytes. And parsing each number can be skipped entirely when reading. But in practice, I've only ever seen option 2 used.
The data in question is 3D model information, where small file sizes and quick reading can be very advantageous, but again, no existing 3D model format does this that I know of, and I imagine there must be a good reason behind it.
My question is, what reasons are there for choosing to write the written out form of numbers, instead of the bit representation? And are there situations where using the binary form would be preferred?
First of all, floats are 4 bytes on any architecture you might encounter normally, so nothing is "truncated" when you write the 4 bytes from memory to a file.
As for your main question, many regular file formats are designed for "interoperability" and ease of reading/writing. That's why text, which is an almost universally portable representation (character encoding issues notwithstanding,) is used most often.
For example, it is very easy for a program to read the string "123" from a text file and know that it represents the number 123.
(But note that text itself is not a format. You might choose to represent all your data elements as ASCII/Unicode/whatever strings of characters, and put all these strings along with each other to form a text file, but you still need to specify exactly what each element means and what data can be found where. For example, a very simplistic text-based 3D triangle mesh file format might have the number of triangles in the mesh on the first line of the file, followed by three triplets of real numbers on the next N lines, each, specifying the 9 numbers required for the X,Y,Z coordinates of the three vertices of a triangle.)
On the other hand are the binary formats. These usually have in them the data elements in the same format as they are found in computer memory. This means an integer is represented with a fixed number of bytes (1, 2, 4 or 8, usually in "two's complement" format) or a real number is represented by 4 or 8 bytes in IEEE 754 format. (Note that I'm omitting a lot of details for the sake of staying on point.)
Main advantages of a binary format are:
They are usually smaller in size. A 32-bit integer written as an ASCII string can get upto 10 or 11 bytes (e.g. -1000000000) but in binary it always takes up 4 bytes. And smaller means faster-to-transfer (over network, from disk to memory, etc.) and easier to store.
Each data element is faster to read. No complicated parsing is required. If the data element happens to be in the exact format/layout that your platform/language can work with, then you just need to transfer the few bytes from disk to memory and you are done.
Even large and complex data structures can be laid out on disk in exactly the same way as they would have been in memory, and then all you need to do to "read" that format would be to get that large blob of bytes (which probably contains many many data elements) from disk into memory, in one easy and fast operation, and you are done.
But that 3rd advantage requires that you match the layout of data on disk exactly (bit for bit) with the layout of your data structures in memory. This means that, almost always, that file format will only work with your code and your code only, and not even if you change some stuff around in your own code. This means that it is not at all portable or interoperable. But it is damned fast to work with!
There are disadvantages to binary formats too:
You cannot view or edit or make sense of them in a simple, generic software like a text editor anymore. You can open any XML, JSON or config file in any text editor and make some sense of it quite easily, but not a JPEG file.
You will usually need more specific code to read in/write out a binary format, than a text format. Not to mention specification that document what every bit of the file should be. Text files are generally more self-explanatory and obvious.
In some (many) languages (scripting and "higher-level" languages) you usually don't have access to the bytes that make up an integer or a float, not to read them nor to write them. This means that you'll lose most of the speed advantages that binary files give you when you are working in a lower-level language like C or C++.
Binary in-memory formats of primitive data types are almost always tied to the hardware (or more generally, the whole platform) that the memory is attached to. When you choose to write the same bits from memory to a file, the file format becomes hardware-dependent as well. One hardware might not store floating-point real numbers exactly the same way as another, which means binary files written on one cannot be read on the other naively (care must be taken and the data carefully converted into the target format.) One major difference between hardware architectures is know as "endianness" which affects how multibyte primitives (e.g. a 4-byte integer, or an 8-byte float) are expected to be stored in memory (from highest-order byte to the lowest-order, or vice versa, which are called "big endian" and "little endian" respectively.) Data written to a binary file on a big-endian architecture (e.g. PowerPC) and read verbatim on a little-endian architecture (e.g. x86) will have all the bytes in each primitive swapped from high-value to low-value, which means all (well, almost all) the values will be wrong.
Since you mention 3D model data, let me give you an example of what formats are used in a typical game engine. The game engine runtime will most likely need the most speed it can have in reading the models, and 3D models are large, so usually it has a very specific, and not-at-all-portable format for its model files. But that format would most likely not be supported by any modeling software. So you need to write a converter (also called an exporter or importer) that would take a common, generally-used format (e.g. OBJ, DAE, etc.) and convert that into the engine-specific, proprietary format. But as I mentioned, reading/transferring/working-with a text-based format is easier than a binary format, so you usually would choose a text-based common format to export your models into, then run the converter on them to the optimized, binary, engine-specific runtime format.
You might prefer binary format if:
You want more compact encoding (fewer bytes - because text encoding will probably take more space).
Precision - because if you encode as text you might lose precision - but maybe there are ways to encode as text without losing precision*.
Performance is probably also another advantage of binary encoding.
Since you mention data in question is 3D model simulation, compactness of encoding (maybe also performance) and precision maybe relevant for you. On the other hand, text encoding is human readable.
That said, with binary encoding you typically have issues like endianness, and that float representation maybe different on different machines, but here is a way to encode floats (or doubles) in binary format in a portable way:
uint64_t pack754(long double f, unsigned bits, unsigned expbits)
{
long double fnorm;
int shift;
long long sign, exp, significand;
unsigned significandbits = bits - expbits - 1; // -1 for sign bit
if (f == 0.0) return 0; // get this special case out of the way
// check sign and begin normalization
if (f < 0) { sign = 1; fnorm = -f; }
else { sign = 0; fnorm = f; }
// get the normalized form of f and track the exponent
shift = 0;
while(fnorm >= 2.0) { fnorm /= 2.0; shift++; }
while(fnorm < 1.0) { fnorm *= 2.0; shift--; }
fnorm = fnorm - 1.0;
// calculate the binary form (non-float) of the significand data
significand = fnorm * ((1LL<<significandbits) + 0.5f);
// get the biased exponent
exp = shift + ((1<<(expbits-1)) - 1); // shift + bias
// return the final answer
return (sign<<(bits-1)) | (exp<<(bits-expbits-1)) | significand;
}
*: In C, since C99 there seems a way to do this, but still I think it will take more space.

How do I compress a large number of similar doubles?

I want to store billions (10^9) of double precision floating point numbers in memory and save space. These values are grouped in thousands of ordered sets (they are time series), and within a set, I know that the difference between values is usually not large (compared to their absolute value). Also, the closer to each other, the higher the probability of the difference being relatively small.
A perfect fit would be a delta encoding that stores only the difference of each value to its predecessor. However, I want random access to subsets of the data, so I can't depend on going through a complete set in sequence. I'm therefore using deltas to a set-wide baseline that yields deltas which I expect to be within 10 to 50 percent of the absolute value (most of the time).
I have considered the following approaches:
divide the smaller value by the larger one, yielding a value between 0 and 1 that could be stored as an integer of some fixed precision plus one bit for remembering which number was divided by which. This is fairly straightforward and yields satisfactory compression, but is not a lossless method and thus only a secondary choice.
XOR the IEEE 754 binary64 encoded representations of both values and store the length of the long stretches of zeroes at the beginning of the exponent and mantissa plus the remaining bits which were different. Here I'm quite unsure how to judge the compression, although I think it should be good in most cases.
Are there standard ways to do this? What might be problems about my approaches above? What other solutions have you seen or used yourself?
Rarely are all the bits of a double-precision number meaningful.
If you have billions of values that are the result of some measurement, find the calibration and error of your measurement device. Quantize the values so that you only work with meaningful bits.
Often, you'll find that you only need 16 bits of actual dynamic range. You can probably compress all of this into arrays of "short" that retain all of the original input.
Use a simple "Z-score technique" where every value is really a signed fraction of the standard deviation.
So a sequence of samples with a mean of m and a standard deviation of s gets transformed into a bunch of Z score. Normal Z-score transformations use a double, but you should use a fixed-point version of that double. s/1000 or s/16384 or something that retains only the actual precision of your data, not the noise bits on the end.
for u in samples:
z = int( 16384*(u-m)/s )
for z in scaled_samples:
u = s*(z/16384.0)+m
Your Z-scores retain a pleasant easy-to-work with statistical relationship with the original samples.
Let's say you use a signed 16-bit Z-score. You have +/- 32,768. Scale this by 16,384 and your Z-scores have an effective resolution of 0.000061 decimal.
If you use a signed 24-but Z-score, you have +/- 8 million. Scale this by 4,194,304 and you have a resolution of 0.00000024.
I seriously doubt you have measuring devices this accurate. Further, any arithmetic done as part of filter, calibration or noise reduction may reduce the effective range because of noise bits introduced during the arithmetic. A badly thought-out division operator could make a great many of your decimal places nothing more than noise.
Whatever compression scheme you pick, you can decouple that from the problem of needing to be able to perform arbitrary seeks by compressing into fixed-size blocks and prepending to each block a header containing all the data required to decompress it (e.g. for a delta encoding scheme, the block would contain deltas enconded in some fashion that takes advantage of their small magnitude to make them take less space, e.g. fewer bits for exponent/mantissa, conversion to fixed-point value, Huffman encoding etc; and the header a single uncompressed sample); seeking then becomes a matter of cheaply selecting the appropriate block, then decompressing it.
If the compression ratio is so variable that much space is being wasted padding the compressed data to produce fixed size blocks, a directory of offsets into the compressed data could be built instead and the state required to decompress recorded in that.
If you know a group of doubles has the same exponent, you could store the exponent once, and only store the mantissa for each value.

Compression for a unique stream of data

I've got a large number of integer arrays. Each one has a few thousand integers in it, and each integer is generally the same as the one before it or is different by only a single bit or two. I'd like to shrink each array down as small as possible to reduce my disk IO.
Zlib shrinks it to about 25% of its original size. That's nice, but I don't think its algorithm is particularly well suited for the problem. Does anyone know a compression library or simple algorithm that might perform better for this type of information?
Update: zlib after converting it to an array of xor deltas shrinks it to about 20% of the original size.
If most of the integers really are the same as the previous, and the inter-symbol difference can usually be expressed as a single bit flip, this sounds like a job for XOR.
Take an input stream like:
1101
1101
1110
1110
0110
and output:
1101
0000
0010
0000
1000
a bit of pseudo code
compressed[0] = uncompressed[0]
loop
compressed[i] = uncompressed[i-1] ^ uncompressed[i]
We've now reduced most of the output to 0, even when a high bit is changed. The RLE compression in any other tool you use will have a field day with this. It'll work even better on 32-bit integers, and it can still encode a radically different integer popping up in the stream. You're saved the bother of dealing with bit-packing yourself, as everything remains an int-sized quantity.
When you want to decompress:
uncompressed[0] = compressed[0]
loop
uncompressed[i] = uncompressed[i-1] ^ compressed[i]
This also has the advantage of being a simple algorithm that is going to run really, really fast, since it is just XOR.
Have you considered Run-length encoding?
Or try this: Instead of storing the numbers themselves, you store the differences between the numbers. 1 1 2 2 2 3 5 becomes 1 0 1 0 0 1 2. Now most of the numbers you have to encode are very small. To store a small integer, use an 8-bit integer instead of the 32-bit one you'll encode on most platforms. That's a factor of 4 right there. If you do need to be prepared for bigger gaps than that, designate the high-bit of the 8-bit integer to say "this number requires the next 8 bits as well".
You can combine that with run-length encoding for even better compression ratios, depending on your data.
Neither of these options is particularly hard to implement, and they all run very fast and with very little memory (as opposed to, say, bzip).
You want to preprocess your data -- reversibly transform it to some form that is better-suited to your back-end data compression method, first. The details will depend on both the back-end compression method, and (more critically) on the properties you expect from the data you're compressing.
In your case, zlib is a byte-wise compression method, but your data comes in (32-bit?) integers. You don't need to reimplement zlib yourself, but you do need to read up on how it works, so you can figure out how to present it with easily compressible data, or if it's appropriate for your purposes at all.
Zlib implements a form of Lempel-Ziv coding. JPG and many others use Huffman coding for their backend. Run-length encoding is popular for many ad hoc uses. Etc., etc. ...
Perhaps the answer is to pre-filter the arrays in a way analogous to the Filtering used to create small PNG images. Here are some ideas right off the top of my head. I've not tried these approaches, but if you feel like playing, they could be interesting.
Break your ints up each into 4 bytes, so i0, i1, i2, ..., in becomes b0,0, b0,1, b0,2, b0,3, b1,0, b1,1, b1,2, b1,3, ..., bn,0, bn,1, bn,2, bn,3. Then write out all the bi,0s, followed by the bi,1s, bi,2s, and bi,3s. If most of the time your numbers differ only by a bit or two, you should get nice long runs of repeated bytes, which should compress really nicely using something like Run-length Encoding or zlib. This is my favourite of the methods I present.
If the integers in each array are closely-related to the one before, you could maybe store the original integer, followed by diffs against the previous entry - this should give a smaller set of values to draw from, which typically results in a more compressed form.
If you have various bits differing, you still may have largish differences, but if you're more likely to have large numeric differences that correspond to (usually) one or two bits differing, you may be better off with a scheme where you create ahebyte array - use the first 4 bytes to encode the first integer, and then for each subsequent entry, use 0 or more bytes to indicate which bits should be flipped - storing 0, 1, 2, ..., or 31 in the byte, with a sentinel (say 32) to indicate when you're done. This could result the raw number of bytes needed to represent and integer to something close to 2 on average, which most bytes coming from a limited set (0 - 32). Run that stream through zlib, and maybe you'll be pleasantly surprised.
Did you try bzip2 for this?
http://bzip.org/
It's always worked better than zlib for me.
Since your concern is to reduce disk IO, you'll want to compress each integer array independently, without making reference to other integer arrays.
A common technique for your scenario is to store the differences, since a small number of differences can be encoded with short codewords. It sounds like you need to come up with your own coding scheme for differences, since they are multi-bit differences, perhaps using an 8 bit byte something like this as a starting point:
1 bit to indicate that a complete new integer follows, or that this byte encodes a difference from the last integer,
1 bit to indicate that there are more bytes following, recording more single bit differences for the same integer.
6 bits to record the bit number to switch from your previous integer.
If there are more than 4 bits different, then store the integer.
This scheme might not be appropriate if you also have a lot of completely different codes, since they'll take 5 bytes each now instead of 4.
"Zlib shrinks it by a factor of about 4x." means that a file of 100K now takes up negative 300K; that's pretty impressive by any definition :-). I assume you mean it shrinks it by 75%, i.e., to 1/4 its original size.
One possibility for an optimized compression is as follows (it assumes a 32-bit integer and at most 3 bits changing from element to element).
Output the first integer (32 bits).
Output the number of bit changes (n=0-3, 2 bits).
Output n bit specifiers (0-31, 5 bits each).
Worst case for this compression is 3 bit changes in every integer (2+5+5+5 bits) which will tend towards 17/32 of original size (46.875% compression).
I say "tends towards" since the first integer is always 32 bits but, for any decent sized array, that first integer would be negligable.
Best case is a file of identical integers (no bit changes for every integer, just the 2 zero bits) - this will tend towards 2/32 of original size (93.75% compression).
Where you average 2 bits different per consecutive integer (as you say is your common case), you'll get 2+5+5 bits per integer which will tend towards 12/32 or 62.5% compression.
Your break-even point (if zlib gives 75% compression) is 8 bits per integer which would be
single-bit changes (2+5 = 7 bits) : 80% of the transitions.
double-bit changes (2+5+5 = 12 bits) : 20% of the transitions.
This means your average would have to be 1.2 bit changes per integer to make this worthwhile.
One thing I would suggest looking at is 7zip - this has a very liberal licence and you can link it with your code (I think the source is available as well).
I notice (for my stuff anyway) it performs much better than WinZip on a Windows platform so it may also outperform zlib.