How are Huffman trees transmitted? - compression

I'm trying to understand how the DEFLATE algorithm works. I found this document published by UC Davis. I don't understand the part where it talks about how Huffman trees are transmitted
Probably the trickiest part of the DEFLATE specification to understand
is the way trees are encoded to go along with the data, when that data
is compressed with specialized trees.
The trees are transmitted by their codelengths, as previously
discussed. The codelengths are put all together into a sequence of
numbers between 0 and 15 (the Huffman trees that are created must be
kept to codelengths of no more than 15; this is the tricky part, not
the part about constraining the order of the elements).
Not all the elements have to be given codelengths; if the last
elements of an alphabet are of 0 codelengths, they can and probably
should be left out. The number of elements in each of the two
alphabets will be transmitted, so the trimmed alphabets go together
into a single sequence.
First of all, what does codelength mean exactly and why can it be 0?
Also I didn't understand run-length compression quite well, they mention it right after the last paragraph.
Once this sequence of codelengths is assembled, it is compressed with
a form of what is called run-length compression. When several elements
in a row have the same codelength (often 0) special symbols may be
used to indicate the number of elements with this codelength. Our
sequence is now a sequence of numbers between 0 and 18 (possibly with
extra bits forming integers to modify base values, as was the case
with the length and distance codes).
A Huffman tree is created for this alphabet of 0-18. Sigh. The
sequence of 0-18 codes and extra bits is prepared with the Huffman
codes replacing the 0-18 elements.

A codelength is the length of the code in bits for that symbol.
A zero codelength means that that symbol does not appear in the compressed data, so there is no code for that symbol.
Run-length encoding means, in this case, that a sequence of repeated codelengths, e.g. "7, 7, 7, 7, 7, 7", is replaced by "7, repeat the last length 5 times".

Related

LZ77: storing format

I started to write a little program that allow to compress a single file using LZ77 compression algorithm. It works fine. Now I'm thinking how to store the data. In LZ77, compressed data consists in a series of triplets. Each triplet has the following format:
<"start reading at n. positions backwards", "go ahead for n. positions", "next character">
What could be a right way to store these triplets? I thought about: <11, 5, 8> bits, then:
2048 positions for look backward
32 max length of matched string
next character is 1 byte.
This format works quite well in text compression, but it sucks for my purpose (video made of binary images), it also increase size if compared to the original filesize. Do you have any suggestions?
What I think you mean is more like: <go back n, copy k, insert literal byte>.
You need to look at the statistics of your matches. You are likely getting many literal bytes with zero-length matches. For that case, a good start would be to use a single bit to decide between a match and no match. If the bit is a one, then it is followed by a distance, length, and literal byte. If it is a zero, it is followed by only a literal bytes.
You can do better still by Huffman coding the literals, lengths, and distances. The lengths and literal could be combined into a single code, as deflate does, to remove even the one bit.

C++ test for validation UTF-8

I need to write unit tests for UTF-8 validation, but I don't know how to write incorrect UTF-8 cases in C++:
TEST(validation, Tests)
{
std::string str = "hello";
EXPECT_TRUE(validate_utf8(str));
// I need incorrect UTF-8 cases
}
How can I write incorrect UTF-8 cases in C++?
You can specify individual bytes in the string with the \x escape sequence in hexadecimal form or the \000 escape sequence in octal form.
For example:
std::string str = "\xD0";
which is incomplete UTF8.
Have a look at https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt for valid and malformed UTF8 test cases.
In UTF-8 any character having a most significant bit of 0 is an ordinary ASCII character, any other one is part of a multi-byte sequence (MBS).
If second most significant one is yet another one then this is the first byte of a MBS, otherwise it is one of the follow-up bytes.
In the first byte of a MBS the number of subsequent highest significant one-bits gives you the number of bytes of the entire sequence, e. g. 0b110xxxxx with arbitrary values for x is the start byte of a two-byte sequence.
Theoretically you could now produce sequences up to seven bytes, currently they are limited to four or five bytes (not fully sure here, you need to look up).
You can now produce arbitrary code points by defining appropriate sequences, e.g. "\xc8\x85" would represent the sequence 0b11001000 0b10000101 which is a legal pattern and represents code point 0b 01000 000101 (note how the leading bits representing the UTF-8 headers are cut away) corresponding to a value of 0x405 or 1029. If that's a valid code point at all you need to look up, I just formed an arbitrary bit pattern as an example.
The same way you can now represent longer valid sequences by increasing the number of most significant one-bits joined with the appropriate number of follow-up bytes (note again: number of initial one-bits is total number of bytes including the first byte of the MSB).
Similarly you now produce invalid sequences such that the total number of bytes of the sequence does not match (too many or too few) the number of initial one-bits.
So far you can produce arbitrary valid or invalid sequences where the valid one represent arbitrary code points. You now might need to look up which of these code points are actually valid ones.
Finally you might additionally consider composed characters (with diacritics) – they can be represented as a character (not byte!) or a normalised single character – if you want to go that far then you'd need to look up in the standard which combinations are legal and conform to which normalised code points.

How does DEFLATE optimize this so much?

I am trying to understand the deflate algorithm, and I have read up on Huffman codes as well as LZ77 compression. I was toying around with compression sizes of different strings, and I stumbled across something I could not explain. The string aaa when compressed, both through zlib and gzip, turns out to be the same size as aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa (36 as).
Before reading about this I would have assumed the compressor does something like storing 36*a instead of each character individually, but I could not find anywhere in the specifications where that is mentioned.
Using fixed Huffman code yielded the same result, so I assume the space-saving lies in LZ77, but that only uses distance-length pairs. How would that allow a 3 length string to expand by 12 times without increasing in size?
Interrupting the string of as with one or several bs in the middle drastically increases the size. If distance-length pairs is what's doing the job, why could it not just skip over the bs when searching backwards? Or is Huffman codes being utilized and I misunderstood what fixed Huffman codes implies?
The 36 "a"s are effectively run-length encoded by LZ77 by giving the first "a" as a literal, and then a match with a distance of one, and a length of 35. The length can be as much as 258 for deflate.
Look online for tutorials on LZ77, Huffman coding, and deflate. You can disassemble the resulting compressed data with infgen to get more insight into how the data is being represented.

DEFLATE method reasoning

Why does LZ77 DEFLATE use Huffman encoding for it's second pass instead of LZW? Is there something about their combination that is optimal? If so, what is the nature of the output of LZ77 that makes it more suitable for Huffman compression than LZW or some other method entirely?
LZW tries to take advantage of repeated strings, just like the first "stage" as you call it of LZ77. It then does a poor job of entropy coding that information. LZW has been completely supplanted by more modern approaches. (Except for its legacy use in the GIF format.) Once LZ77 generates a list of literals and matches, there is nothing left for LZW to take advantage of, and it would then make an almost completely ineffective entropy coder for that information.
Mark Adler could best answer this question.
The details of how the LZ77 and Huffman work together need some closer examination. Once the raw data has been turned into a string of characters and special length, distance pairs, these elements must be represented with Huffman codes.
Though this is NOT, repeat, NOT standard terminology, call the point where we start reading in bits a "dial tone." After all, in our analogy, the dial tone is where you can start specifying a series of numbers that will end up mapping to a specific phone. So call the very beginning a "dial tone." At that dial tone, one of three things could follow: a character, a length-distance pair, or the end of the block. Since we must be able to tell which it is, all the possible characters ("literals"), elements that indicate ranges of possible lengths ("lengths"), and a special end-of-block indicator are all merged into a single alphabet. That alphabet then becomes the basis of a Huffman tree. Distances don't need to be included in this alphabet, since they can only appear directly after lengths. Once the literal has been decoded, or the length-distance pair decoded, we are at another "dial-tone" point and we start reading again. If we got the end-of-block symbol, of course, we're either at the beginning of another block or at the end of the compressed data.
Length codes or distance codes may actually be a code that represents a base value, followed by extra bits that form an integer to be added to the base value.
...
Read the whole deal here.
Long story short. LZ77 provides duplicate elimination. Huffman coding provides bit reduction. It's also on the wiki.

Binary file special characters

I'm coding a suffix array sorting, and this algorithm appends a sentinel character to the original string. This character must not be in the original string.
Since this algorithm will process binary files bytes, is there any special byte character that I can ensure I won't find in any binary file?
If it exists, how do I represent this character in C++ coding?
I'm on linux, I'm not sure if it makes a difference.
No, there is not. Binary files can contain every combination of byte values. I wouldn't call them 'characters' though, because they are binary data, not (necessarily) representing characters. But whatever the name, they can have any value.
This is more like a question you should answer yourself. We do not know what binary data you have and what characters can be there and what cannot. If you are talking about generic binary data - there could be any combination of bits and bytes, and characters, so there is no such character.
From the other point of view, you are talking about strings. What kind of strings? ASCII strings? ASCII codes have very limited range, for example, so you can use 128, for example. Some old protocols use SOH (\1) for similar purposes. So there might be a way around if you know exactly what strings you are processing.
To the best of my knowledge, suffix array cannot be applied to arbitrary binary data (well, it can, but it won't make any sense).
A file could contains bits only. Groups of bits could be interpreted as an ASCII character, floating point number, a photo in JPEG format, anything you could imagine. The interpretation is based on a coding scheme (such as ASCII, BCD) you choose. If your coding scheme doesn't fill the entire table of possible codes, you could pick one for your special purpouses (for example digits could be encoded naively on 4 bits, 2^4=16, so you have 6 redundant codewords).