Compression algorithms with small dictionaries - compression

I'm looking for a compression algorithm which works with symbols smaller than a byte. I did a quick research about compression algorithms and it's being hard to find out the size of the used symbols. Anyway, there are streams with symbols smaller than 8-bit. Is there a parameter for DEFLATE to define the size of its symbols?

plaintext symbols smaller than a byte
The original descriptions of LZ77 and LZ78 describe them in terms of a sequence of decimal digits (symbols that are approximately half the size of a byte).
If you google for "DNA compression algorithm", you can get a bunch of information on algorithms specialized for compression files that are almost entirely composed of the 4 letters A G C T, a dictionary of 4 symbols, each one about 1/4 as small as a byte.
Perhaps one of those algorithms might work for you with relatively little tweaking.
The LZ77-style compression used in LZMA may appear to use two bytes per symbol for the first few symbols that it compresses.
But after compressing a few hundred plaintext symbols
(the letters of natural-language text, or sequences of decimal digits, or sequences of the 4 letters that represent DNA bases, etc.), the two-byte compressed "chunks" that LZMA puts out often represent a dozen or more plaintext symbols.
(I suspect the same is true for all similar algorithms, such as the LZ77 algorithm used in DEFLATE).
If your files use only a restricted alphabet of much less than all 256 possible byte values,
in principle a programmer could adapt a variant of DEFLATE (or some other algorithm) that could make use of information about that alphabet to produce compressed files a few bits smaller in size than the same files compressed with standard DEFLATE.
However, many byte-oriented text compression algorithms -- LZ77, LZW, LZMA, DEFLATE, etc. build a dictionary of common long strings, and may give compression performance (with sufficiently large source file) within a few percent of that custom-adapted variant -- often the advantages of using a standard compressed file format is worth sacrificing a few percent of potential space savings.
compressed symbols smaller than a byte
Many compression algorithms, including some that give the best known compression on benchmark files, output compressed information bit-by-bit (such as most of the PAQ series of compressors, and some kinds of arithmetic coders), while others output variable-length compressed information without regard for byte boundaries (such as Huffman compression).
Some ways of describing arithmetic coding talk about pieces of information, such as individual bits or pixels, that are compressed to "less than one bit of information".
EDIT:
The "counting argument" explains why it's not possible to compress all possible bytes, much less all possible bytes and a few common sequences of bytes, into codewords that are all less than 8 bits long.
Nevertheless, several compression algorithms can and often do represent represent some bytes or (more rarely) some sequences of bytes, each with a codeword that is less than 8 bit long, by "sacrificing" or "escaping" less-common bytes that end up represented by other codewords that (including the "escape") are more than 8 bits long.
Such algorithms include:
The Pike Text compression using 4 bit coding
byte-oriented Huffman
several combination algorithms that do LZ77-like parsing of the file into "symbols", where each symbol represents a sequence of bytes, and then Huffman-compressing those symbols -- such as DEFLATE, LZX, LZH, LZHAM, etc.
The Pike algorithm uses the 4 bits "0101" to represent 'e' (or in some contexts 'E'), the 8 bits "0000 0001" to represent the word " the" (4 bytes, including the space before it) (or in some contexts " The" or " THE"), etc.
It has a small dictionary of about 200 of the most-frequent English words,
including a sub-dictionary of 16 extremely common English words.
When compressing English text with byte-oriented Huffman coding, the sequence "e " (e space) is compressed to two codewords with a total of typically 6 bits.
Alas, when Huffman coding is involved, I can't tell you the exact size of those "small" codewords, or even tell you exactly what byte or byte sequence a small codeword represents, because it is different for every file.
Often the same codeword represents a different byte (or different byte sequence) at different locations in the same file.
The decoder decides which byte or byte sequence a codeword represents based on clues left behind by the compressor in the headers, and on the data decompressed so far.
With range coding or arithmetic coding, the "codeword" may not even be an integer number of bits.

You may want to look into a Golomb-Code. A golomb code use a divide and conquer algorithm to compress the inout. It's not a dictionary compression but it's worth to mention.

Related

When using Huffman encoding for binary data, how are characters determined?

All the examples of Huffman encoding I've seen use letters (A, B, C) as the character being encoded, in which they calculate the frequencies of each to generate the Huffman tree. What happens when the data you want to encode is binary? I've seen people treat each byte as a character, but why? It seems arbitrary to use 8 bits as the cutoff for a "character", why not 16? Why not 32 for 32-bit architecture?
That is perceptive of you to realize that Huffman encoding can work with more than 256 symbols.
A few implementations of Huffman coding work with far more than 256 symbols, such as
HuffWord, which parses English text into more-or-less English words (typically blocks of text with around 32,000 unique words) and generates a Huffman tree where each leaf represents a English word, encoded with a unique Huffman code
HuffSyllable, which parses text into syllables, and generates a Huffman tree where each leaf represents (approximately) an English syllable, encoded with a unique Huffman code
DEFLATE, which first replaces repeated strings into (length, offset) symbols, has several different Huffman tables, one optimized for representing distances (offsets), and another with 287 symbols where each leaf represents either a specific length (part of the (length, offset) symbol) or a literal byte.
Some of the length-limited Huffman trees used in JPEG compression encode JPEG quantized brightness values (from -2047 to +2047 ?) with maximum of 16-bit code lengths.
On a 16-bit architecture or 32-bit architecture computer, ASCII text files and UTF-8 text files and photographs are pretty much the same as on 8-bit computers, so there's no real reason to switch to a different approach.
On a 16-bit architecture or 32-bit architecture,
typically machine code is 16-bit aligned, so static Huffman with 16-bit symbols may make sense.
Static huffman has the overhead of transmitting information about bitlengths for each symbol, so that the receiver can reconstruct the codewords necessary to decompress.
The 257 or so bitlengths in the header of 8-bit static Huffman are already too much for "short string compression".
As sascha pointed out, using 16 bits for a "character" would require much more overhead (65,000 or so bitlengths), so static Huffman coding with 16-bit inputs would only make sense for long files where that overhead is less significant.

Does Gzip/Deflate recognize patterns

I am studying the inner-workings of Gzip, and I understand that it uses a combination of Huffman Coding and LZ77.
I also realize that a Gzip file is divided into blocks, and each block has a dictionary built for it. Then frequent occurrences of similar data are replaced by pointers pointing at locations in the dictionary.
So the phrase "horses race other horses" would have the word horses replaced by a pointer.
However what if I have an array of 32 bit integers, but it only stores numbers up to 24 bits? For arguments sake lets say these 24 bit numbers are very random and hard to compress and hard to find repetition in.
This would make the first 8 bits of every integer an easy to compress string of 0's, but each string would need a pointer and each pointer still takes up some amount of data. Even a 1 bit pointer (which I know is smaller than what's realistically possible) would still take up 12.5% of the original space.
That would seem somewhat redundant when the array could easily be reduced to a "24 bit" array, with basic pattern recognition.
So my question is:
Does Gzip contain any mechanisms to better compress a file than dictionary pointers?
How well can Gzip compress small amounts of repetitive data, followed by small amounts of hard to compress data?
Each deflate block does not have a "dictionary built for it". What is built for each deflate block is a set of Huffman codes for the literal/length symbols and the distance symbols.
The dictionary you refer to is simply the 32K bytes of uncompressed input that immediately precede the bytes currently being compressed. That's it. Each length/distance pair can refer to a string of 3 to 258 bytes in the last 32K. That is independent of deflate blocks, and such references often go back one or more blocks.
Deflate will not do well trying to compress a sequence of three random bytes, zero byte, three random bytes, zero byte ... There will be no useful repeated strings, where deflate will only be able to Huffman code the literals, with zeros being more frequent. It would code zeros as two bits, since they occur a little more than 25% of the time, and the rest of the literals to at least 8.25 bits each. For this data that would give an average of about 6.7 bits per byte or a compressed ratio of 0.85. In fact gzip gives about 0.86 on this data.
If you want to compress that sequence, simply remove the zero bytes! Then you are done, with no further compression possible at a ratio of 0.75.

How do I represent an LZW output in bytes?

I found an implementation of the LZW algorithm and I was wondering how can I represent its output, which is an int list, to a byte array.
I had tried with one byte but in case of long inputs the dictionary has more than 256 entries and thus I cannot convert.
Then I tried to add an extra byte to indicate how many bytes are used to store the values, but in this case I have to use 2 bytes for each value, which doesn't compress enough.
How can I optimize this?
As bits, not bytes. You just need a simple routine that writes an arbitrary number of bits to a stream of bytes. It simply keeps a one-byte buffer into which you put bits until you have eight bits. Then write than byte, clear the buffer, and start over. The process is reversed on the other side.
When you get to the end, just write the last byte buffer if not empty with the remainder of the bits set to zero.
You only need to figure out how many bits are required for each symbol at the current state of the compression. That same determination can be made on the other side when pulling bits from the stream.
In his 1984 article on LZW, T.A. Welch did not actually state how to "encode codes", but described mapping "strings of input characters into fixed-length codes", continuing "use of 12-bit codes is common". (Allows bijective mapping between three octets and two codes.)
The BSD compress(1) command didn't literally follow, but introduced a header, the interesting part being a specification of the maximum number if bits to use to encode an LZW output code, allowing decompressors to size decompression tables appropriately or fail early and in a controlled way. (But for the very first,) Codes were encoded with just the number of integral bits necessary, starting with 9.
An alternative would be to use Arithmetic Coding, especially if using a model different from every code is equally probable.

How do I write files that gzip well?

I'm working on a web project, and I need to create a format to transmit files very efficiently (lots of data). The data is entirely numerical, and split into a few sections. Of course, this will be transferred with gzip compression.
I can't seem to find any information on what makes a file compress better than another file.
How can I encode floats (32bit) and short integers (16bit) in a format that results in the smallest gzip size?
P.s. it will be a lot of data, so saving 5% means a lot here. There won't likely be any repeats in the floats, but the integers will likely repeat about 5-10 times in each file.
The only way to compress data is to remove redundancy. This is essentially what any compression tool does - it looks for redundant/repeatable parts and replaces them with link/reference to the same data that was observed before in your stream.
If you want to make your data format more efficient, you should remove everything that could be possibly removed. For example, it is more efficient to store numbers in binary rather than in text (JSON, XML, etc). If you have to use text format, consider removing unnecessary spaces or linefeeds.
One good example of efficient binary format is google protocol buffers. It has lots of benefits, and not least of them is storing numbers as variable number of bytes (i.e. number 1 consumes less space than number 1000000).
Text or binary, but if you can sort your data before sending, it can increase possibility for gzip compressor to find redundant parts, and most likely to increase compression ratio.
Since you said 32-bit floats and 16-bit integers, you are already coding them in binary.
Consider the range and useful accuracy of your numbers. If you can limit those, you can recode the numbers using fewer bits. Especially the floats, which may have more bits than you need.
If the right number of bits is not a multiple of eight, then treat your stream of bytes as a stream of bits and use only the bits needed. Be careful to deal with the end of your data properly so that the added bits to go to the next byte boundary are not interpreted as another number.
If your numbers have some correlation to each other, then you should take advantage of that. For example, if the difference between successive numbers is usually small, which is the case for a representation of a waveform for example, then send the differences instead of the numbers. Differences can be coded using variable-length integers or Huffman coding or a combination, e.g. Huffman codes for ranges and extra bits within each range.
If there are other correlations that you can use, then design a predictor for the next value based on the previous values. Then send the difference between the actual and predicted value. In the previous example, the predictor is simply the last value. An example of a more complex predictor is a 2D predictor for when the numbers represent a 2D table and both adjacent rows and columns are correlated. The PNG image format has a few examples of 2D predictors.
All of this will require experimentation with your data, ideally large amounts of your data, to see what helps and what doesn't or has only marginal benefit.
Use binary instead of text.
A float in its text representation with 8 digits (a float has a precision of eight decimal digits), plus decimal separator, plus field separator, consumes 10 bytes. In binary representation, it takes only 4.
If you need to use text, use hex. It consumes less digits.
But although this makes a lot of difference for the uncompressed file, these differences might disappear after compression, since the compression algo should implicitly take care if that. But you may try.

compression zero bit sequences

I tried to find some library (C++) or algorithm which could compress array of bits with these properties:
There are seqences of zero bits and sequences of bits, which carry the information (1 or 0).
The sequences are usually 8-24 bits long.
I need a loseless compression which would take advantage of those zero bits.
How did I come to such sequences:
I serialize various variables into byte array. I do this quite often to create snapshots, so these variables usually don't change much. I want to use this fact for compression. I don't know the type of those variables, just byte length. So I take the bytes and create diff information with the previous snapshot using XOR.
If the variable changed just a bit, there will usually be many zero bits. That's the zero bit sequence. The rest of the bits carry the information, that's the information sequence.
For every variable, there will probably be 1 zero bit sequence and 1 information sequence.
EDIT:
So far I was considering these algorithms:
RLE - the information sequences would mess up the result
Some symbol coding (Huffman etc.) - the data probably won't share much "symbols", it's not a text and the sequences are short. The whole array will be usually around 1000 bytes long.
If the ~1000 byte sequence has a lot of zero bytes, then just use a standard byte-oriented compression algorithm, such as zlib. You will get compression.