I am trying to understand the deflate algorithm, and I have read up on Huffman codes as well as LZ77 compression. I was toying around with compression sizes of different strings, and I stumbled across something I could not explain. The string aaa when compressed, both through zlib and gzip, turns out to be the same size as aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa (36 as).
Before reading about this I would have assumed the compressor does something like storing 36*a instead of each character individually, but I could not find anywhere in the specifications where that is mentioned.
Using fixed Huffman code yielded the same result, so I assume the space-saving lies in LZ77, but that only uses distance-length pairs. How would that allow a 3 length string to expand by 12 times without increasing in size?
Interrupting the string of as with one or several bs in the middle drastically increases the size. If distance-length pairs is what's doing the job, why could it not just skip over the bs when searching backwards? Or is Huffman codes being utilized and I misunderstood what fixed Huffman codes implies?
The 36 "a"s are effectively run-length encoded by LZ77 by giving the first "a" as a literal, and then a match with a distance of one, and a length of 35. The length can be as much as 258 for deflate.
Look online for tutorials on LZ77, Huffman coding, and deflate. You can disassemble the resulting compressed data with infgen to get more insight into how the data is being represented.
Related
I started to write a little program that allow to compress a single file using LZ77 compression algorithm. It works fine. Now I'm thinking how to store the data. In LZ77, compressed data consists in a series of triplets. Each triplet has the following format:
<"start reading at n. positions backwards", "go ahead for n. positions", "next character">
What could be a right way to store these triplets? I thought about: <11, 5, 8> bits, then:
2048 positions for look backward
32 max length of matched string
next character is 1 byte.
This format works quite well in text compression, but it sucks for my purpose (video made of binary images), it also increase size if compared to the original filesize. Do you have any suggestions?
What I think you mean is more like: <go back n, copy k, insert literal byte>.
You need to look at the statistics of your matches. You are likely getting many literal bytes with zero-length matches. For that case, a good start would be to use a single bit to decide between a match and no match. If the bit is a one, then it is followed by a distance, length, and literal byte. If it is a zero, it is followed by only a literal bytes.
You can do better still by Huffman coding the literals, lengths, and distances. The lengths and literal could be combined into a single code, as deflate does, to remove even the one bit.
I'm trying to understand how the DEFLATE algorithm works. I found this document published by UC Davis. I don't understand the part where it talks about how Huffman trees are transmitted
Probably the trickiest part of the DEFLATE specification to understand
is the way trees are encoded to go along with the data, when that data
is compressed with specialized trees.
The trees are transmitted by their codelengths, as previously
discussed. The codelengths are put all together into a sequence of
numbers between 0 and 15 (the Huffman trees that are created must be
kept to codelengths of no more than 15; this is the tricky part, not
the part about constraining the order of the elements).
Not all the elements have to be given codelengths; if the last
elements of an alphabet are of 0 codelengths, they can and probably
should be left out. The number of elements in each of the two
alphabets will be transmitted, so the trimmed alphabets go together
into a single sequence.
First of all, what does codelength mean exactly and why can it be 0?
Also I didn't understand run-length compression quite well, they mention it right after the last paragraph.
Once this sequence of codelengths is assembled, it is compressed with
a form of what is called run-length compression. When several elements
in a row have the same codelength (often 0) special symbols may be
used to indicate the number of elements with this codelength. Our
sequence is now a sequence of numbers between 0 and 18 (possibly with
extra bits forming integers to modify base values, as was the case
with the length and distance codes).
A Huffman tree is created for this alphabet of 0-18. Sigh. The
sequence of 0-18 codes and extra bits is prepared with the Huffman
codes replacing the 0-18 elements.
A codelength is the length of the code in bits for that symbol.
A zero codelength means that that symbol does not appear in the compressed data, so there is no code for that symbol.
Run-length encoding means, in this case, that a sequence of repeated codelengths, e.g. "7, 7, 7, 7, 7, 7", is replaced by "7, repeat the last length 5 times".
Ive recently implemented Hoffman compression in c++, if I were to store the results as binary it would take up a lot more space as each 1 and 0 is a character. Alternatively I was thinking maybe I could break the binary into sections of 8 and put characters in the text file, but that would kinda be annoying (so hopefully that can be avoided). My question here is what is the best way to store binary in a text file in terms of character efficietcy?
[To recap the comments...]
My question here is what is the best way to store binary in a text file in terms of character efficiently?
If you can store the data as-is, then do so (in other words, do not use any encoding; simply save the raw bytes).
If you need to store the data within a text file (for instance as a paragraph or as a quoted string), then you have many ways of doing so. For instance, base64 is a very common one, but there are many others.
Why does LZ77 DEFLATE use Huffman encoding for it's second pass instead of LZW? Is there something about their combination that is optimal? If so, what is the nature of the output of LZ77 that makes it more suitable for Huffman compression than LZW or some other method entirely?
LZW tries to take advantage of repeated strings, just like the first "stage" as you call it of LZ77. It then does a poor job of entropy coding that information. LZW has been completely supplanted by more modern approaches. (Except for its legacy use in the GIF format.) Once LZ77 generates a list of literals and matches, there is nothing left for LZW to take advantage of, and it would then make an almost completely ineffective entropy coder for that information.
Mark Adler could best answer this question.
The details of how the LZ77 and Huffman work together need some closer examination. Once the raw data has been turned into a string of characters and special length, distance pairs, these elements must be represented with Huffman codes.
Though this is NOT, repeat, NOT standard terminology, call the point where we start reading in bits a "dial tone." After all, in our analogy, the dial tone is where you can start specifying a series of numbers that will end up mapping to a specific phone. So call the very beginning a "dial tone." At that dial tone, one of three things could follow: a character, a length-distance pair, or the end of the block. Since we must be able to tell which it is, all the possible characters ("literals"), elements that indicate ranges of possible lengths ("lengths"), and a special end-of-block indicator are all merged into a single alphabet. That alphabet then becomes the basis of a Huffman tree. Distances don't need to be included in this alphabet, since they can only appear directly after lengths. Once the literal has been decoded, or the length-distance pair decoded, we are at another "dial-tone" point and we start reading again. If we got the end-of-block symbol, of course, we're either at the beginning of another block or at the end of the compressed data.
Length codes or distance codes may actually be a code that represents a base value, followed by extra bits that form an integer to be added to the base value.
...
Read the whole deal here.
Long story short. LZ77 provides duplicate elimination. Huffman coding provides bit reduction. It's also on the wiki.
I have difficulties with how to use RLE on sequences of symbols.
For example, I can do RLE encoding on strings like
"ASSSAAAEERRRRRRRR"
which will be transformed to:
"A1S3A3E2R8".
But I'd like to perform RLE on strings like
"XXXYYYYY(1ADEFC)(EDCADD)(1ADEFC)(1ADEFC)(1ADEFC)"
which will be transformed to:
"X3Y5(1ADEFC)1(EDCADD)1(1ADEFC)3"
Is there is a way to reach it? This job becomes a bit easer because long strings always follows in brackets. Could give an advice to do this in C++?
If there is a better way to store values than using brackets, it will be also great if you recommend me.
You should break down this problem into smaller parts. First, you should have a function that tokenizes your stream and returns each individual part. For this example input stream:
"XXXYYYYY(1ADEFC)(EDCADD)(1ADEFC)(1ADEFC)(1ADEFC)"
this function will return the following elements, one per call:
X
X
X
Y
Y
Y
Y
Y
(1ADEFC)
(EDCADD)
(1ADEFC)
(1ADEFC)
(1ADEFC)
<eof>
If you get this function correctly implemented, then the RLE algorithm that you already implemented for single characters should be easily adapted to support longer strings.
Since you mention your intention is to RLE encode the data to later use gzip compression and achieve better compression, my answer is don't bother encoding it first. gzip compression uses DEFLATE, which is a generalization of run-length encoding that can take advantage of runs of strings of characters. You won't get better compression for applying the same algorithm twice, and in fact you may even loose compression a bit.
If you insist in performing your own RLE, then it may be better to store the set length instead of using parenthesis. That is, instead of (1ADEFC)3 use 61ADEFC3. Also note that you intend to compress pixels, which use the full range of byte values. Keep that in mind, as an algorithm written to work with strings would not be appropiate for raw data with embedded nulls and non-printable characters all around.