DEFLATE method reasoning - compression

Why does LZ77 DEFLATE use Huffman encoding for it's second pass instead of LZW? Is there something about their combination that is optimal? If so, what is the nature of the output of LZ77 that makes it more suitable for Huffman compression than LZW or some other method entirely?

LZW tries to take advantage of repeated strings, just like the first "stage" as you call it of LZ77. It then does a poor job of entropy coding that information. LZW has been completely supplanted by more modern approaches. (Except for its legacy use in the GIF format.) Once LZ77 generates a list of literals and matches, there is nothing left for LZW to take advantage of, and it would then make an almost completely ineffective entropy coder for that information.

Mark Adler could best answer this question.
The details of how the LZ77 and Huffman work together need some closer examination. Once the raw data has been turned into a string of characters and special length, distance pairs, these elements must be represented with Huffman codes.
Though this is NOT, repeat, NOT standard terminology, call the point where we start reading in bits a "dial tone." After all, in our analogy, the dial tone is where you can start specifying a series of numbers that will end up mapping to a specific phone. So call the very beginning a "dial tone." At that dial tone, one of three things could follow: a character, a length-distance pair, or the end of the block. Since we must be able to tell which it is, all the possible characters ("literals"), elements that indicate ranges of possible lengths ("lengths"), and a special end-of-block indicator are all merged into a single alphabet. That alphabet then becomes the basis of a Huffman tree. Distances don't need to be included in this alphabet, since they can only appear directly after lengths. Once the literal has been decoded, or the length-distance pair decoded, we are at another "dial-tone" point and we start reading again. If we got the end-of-block symbol, of course, we're either at the beginning of another block or at the end of the compressed data.
Length codes or distance codes may actually be a code that represents a base value, followed by extra bits that form an integer to be added to the base value.
...
Read the whole deal here.
Long story short. LZ77 provides duplicate elimination. Huffman coding provides bit reduction. It's also on the wiki.

Related

Is there a way to restrict string manipulation e.g substring?

The problem is that I'm processing some UTF8 strings and I would like to design a class or a way to prevent string manipulations.
String manipulation is not desirable for strings of multibyte characters as splitting the string at a random position (which is measured in bytes) may split a character half way.
I have thought about using const std::string& but the user/developer can create a substring by calling std::substr.
Another way would be create a wrapper around const std::string& and expose only the string through getters.
Is this even possible?
Another way would be create a wrapper around const std::string& and expose only the string through getters.
You need a class wrapping a std::string or std::u8string, not a reference to one. The class then owns the string and its contents, basically just using it as a storage, and can provide an interface as you see fit to operate on unicode code points or characters instead of modifying the storage directly.
However, there is nothing in the standard library that will help you implement this. So a better approach would be to use a third party library that already does this for you. Operating on code points in a UTF-8 string is still reasonably simple and you can implement that part yourself, but if you want to operate on characters (in the sense of grapheme clusters or whatever else is suitable) implementation is going to be a project in itself.
I would use a wrapper where your external interface provides access to either code points, or to characters. So, foo.substr(3, 4) (for example) would skip the first 3 code points, and give you the next 4 code points. Alternatively, it would skip the first 3 characters, and give you the next 4 characters.
Either way, that would be independent of the number of bytes used to represent those code points or characters.
Quick aside on terminology for anybody unaccustomed to Unicode terminology: ISO 10646 is basically a long list of code points, each assigned a name and a number from 0 to (about) 220-1. UTF-8 encodes a code point number in a sequence of 1 to 4 bytes.
A character can consist of a (more or less) arbitrary number of code points. It will consist of a base character (e.g., a letter) followed by some number of combining diacritical marks. For example, à would normally be encoded as an a followed by a "combining grave accent" (U+0300).
The a and the U+0300 are each a code point. When encoded in UTF-8, the a would be encoded in a single byte and the U+0300 would be encoded in three bytes. So, it's one character composed of two code points encoded in 4 characters.
That's not quite all there is to characters (as opposed to code points) but it's sufficient for quite a few languages (especially, for the typical European languages like Spanish, German, French, and so on).
There are a fair number of other points that become non-trivial though. For example, German has a letter "ß". This is one character, but when you're doing string comparison, it should (at least normally) compare as equal to "ss". I believe there's been a move to change this but at least classically, it hasn't had an upper-case equivalent either, so both comparison and case conversion with it get just a little bit tricky.
And that's fairly mild compared to situations that arise in some of the more "exotic" languages. But it gives a general idea of the fact that yes, if you want to deal intelligently with Unicode strings, you basically have two choices: either have your code use ICU1 to do most of the real work, or else resign yourself to this being a multi-year project in itself.
1. In theory, you could use another suitable library--but in this case, I'm not aware of such a thing existing.

LZ77: storing format

I started to write a little program that allow to compress a single file using LZ77 compression algorithm. It works fine. Now I'm thinking how to store the data. In LZ77, compressed data consists in a series of triplets. Each triplet has the following format:
<"start reading at n. positions backwards", "go ahead for n. positions", "next character">
What could be a right way to store these triplets? I thought about: <11, 5, 8> bits, then:
2048 positions for look backward
32 max length of matched string
next character is 1 byte.
This format works quite well in text compression, but it sucks for my purpose (video made of binary images), it also increase size if compared to the original filesize. Do you have any suggestions?
What I think you mean is more like: <go back n, copy k, insert literal byte>.
You need to look at the statistics of your matches. You are likely getting many literal bytes with zero-length matches. For that case, a good start would be to use a single bit to decide between a match and no match. If the bit is a one, then it is followed by a distance, length, and literal byte. If it is a zero, it is followed by only a literal bytes.
You can do better still by Huffman coding the literals, lengths, and distances. The lengths and literal could be combined into a single code, as deflate does, to remove even the one bit.

How does DEFLATE optimize this so much?

I am trying to understand the deflate algorithm, and I have read up on Huffman codes as well as LZ77 compression. I was toying around with compression sizes of different strings, and I stumbled across something I could not explain. The string aaa when compressed, both through zlib and gzip, turns out to be the same size as aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa (36 as).
Before reading about this I would have assumed the compressor does something like storing 36*a instead of each character individually, but I could not find anywhere in the specifications where that is mentioned.
Using fixed Huffman code yielded the same result, so I assume the space-saving lies in LZ77, but that only uses distance-length pairs. How would that allow a 3 length string to expand by 12 times without increasing in size?
Interrupting the string of as with one or several bs in the middle drastically increases the size. If distance-length pairs is what's doing the job, why could it not just skip over the bs when searching backwards? Or is Huffman codes being utilized and I misunderstood what fixed Huffman codes implies?
The 36 "a"s are effectively run-length encoded by LZ77 by giving the first "a" as a literal, and then a match with a distance of one, and a length of 35. The length can be as much as 258 for deflate.
Look online for tutorials on LZ77, Huffman coding, and deflate. You can disassemble the resulting compressed data with infgen to get more insight into how the data is being represented.

C++ and RLE for sequences of symbols

I have difficulties with how to use RLE on sequences of symbols.
For example, I can do RLE encoding on strings like
"ASSSAAAEERRRRRRRR"
which will be transformed to:
"A1S3A3E2R8".
But I'd like to perform RLE on strings like
"XXXYYYYY(1ADEFC)(EDCADD)(1ADEFC)(1ADEFC)(1ADEFC)"
which will be transformed to:
"X3Y5(1ADEFC)1(EDCADD)1(1ADEFC)3"
Is there is a way to reach it? This job becomes a bit easer because long strings always follows in brackets. Could give an advice to do this in C++?
If there is a better way to store values than using brackets, it will be also great if you recommend me.
You should break down this problem into smaller parts. First, you should have a function that tokenizes your stream and returns each individual part. For this example input stream:
"XXXYYYYY(1ADEFC)(EDCADD)(1ADEFC)(1ADEFC)(1ADEFC)"
this function will return the following elements, one per call:
X
X
X
Y
Y
Y
Y
Y
(1ADEFC)
(EDCADD)
(1ADEFC)
(1ADEFC)
(1ADEFC)
<eof>
If you get this function correctly implemented, then the RLE algorithm that you already implemented for single characters should be easily adapted to support longer strings.
Since you mention your intention is to RLE encode the data to later use gzip compression and achieve better compression, my answer is don't bother encoding it first. gzip compression uses DEFLATE, which is a generalization of run-length encoding that can take advantage of runs of strings of characters. You won't get better compression for applying the same algorithm twice, and in fact you may even loose compression a bit.
If you insist in performing your own RLE, then it may be better to store the set length instead of using parenthesis. That is, instead of (1ADEFC)3 use 61ADEFC3. Also note that you intend to compress pixels, which use the full range of byte values. Keep that in mind, as an algorithm written to work with strings would not be appropiate for raw data with embedded nulls and non-printable characters all around.

Binary file special characters

I'm coding a suffix array sorting, and this algorithm appends a sentinel character to the original string. This character must not be in the original string.
Since this algorithm will process binary files bytes, is there any special byte character that I can ensure I won't find in any binary file?
If it exists, how do I represent this character in C++ coding?
I'm on linux, I'm not sure if it makes a difference.
No, there is not. Binary files can contain every combination of byte values. I wouldn't call them 'characters' though, because they are binary data, not (necessarily) representing characters. But whatever the name, they can have any value.
This is more like a question you should answer yourself. We do not know what binary data you have and what characters can be there and what cannot. If you are talking about generic binary data - there could be any combination of bits and bytes, and characters, so there is no such character.
From the other point of view, you are talking about strings. What kind of strings? ASCII strings? ASCII codes have very limited range, for example, so you can use 128, for example. Some old protocols use SOH (\1) for similar purposes. So there might be a way around if you know exactly what strings you are processing.
To the best of my knowledge, suffix array cannot be applied to arbitrary binary data (well, it can, but it won't make any sense).
A file could contains bits only. Groups of bits could be interpreted as an ASCII character, floating point number, a photo in JPEG format, anything you could imagine. The interpretation is based on a coding scheme (such as ASCII, BCD) you choose. If your coding scheme doesn't fill the entire table of possible codes, you could pick one for your special purpouses (for example digits could be encoded naively on 4 bits, 2^4=16, so you have 6 redundant codewords).