How is a Huffman compression decoded? - compression

I'm struggling to understand how to decode, say, a text file that has been compressed using Huffman's method. Let's say I'm reading a text file, I get a list of all the characters and the frequency in which they occur, I create a Huffman tree and all the characters have a specific code assigned do them. Say,
a: 110
b: 11
c: 010
etc.
When I want to decompress this text file and print/read its contents, how do I do that? How do I know if the file reads "abc" or "bac"?
A small solution I made up was after the Huffman tree has been created, I read the file all over again and create an array to insert every character code as I read it.
Say, a while loop where I read a character until I've reached EOF.
character = a; insert 110 into array. Character = b; insert 11 into array until we are left with 11011010.
But I feel like there should be a better way.
EDIT: The codes for a,b, and c are random, not actual Huffman codes. I put in random ones as it's irrelevant for the question, I'm only interested in how it would be decoded with or without a real life example. But here's an example of Huffman code for "Hello World."
l: 11
o: 001
H: 100
e: 0101
spacebar: 0000
w: 0001
r: 101
d: 011
.: 0100

A Huffman code is a prefix code, which means that no code can be a prefix of any other code. Your example of a Huffman code is most definitely not a Huffman code. There you have 11 (c), which is a prefix of 110 (b). That cannot be the result of a correct implementation of Huffman's algorithm.
Update for question update:
You are incorrect. The codes are extremely relevant for the question. The examples you gave cannot be unambiguously decoded.
Second update of question:
It is still not clear what you're asking, but here is an answer to the question: "How do I decode a stream of bits that are a Huffman-coded sequence of symbols?"
Here is the tree for the example prefix code:
You see that if you follow any sequence of branches to a symbol, the branches you followed are the bits in that code. That is exactly how you decode the incoming stream of bits.
Start at the node at the top of the tree.
Read one bit from the stream.
Follow the branch for that value, left for 0, right for 1.
If you arrive at another node, go to step 2.
Otherwise, emit the symbol in the leaf, and go to step 1.

Related

How to save a Huffman table in a file In a way that use the least storage?

It's my first question in stack overflow. it's long but I have explained it in detail and I think it's understandable.
I'm writing huffman code by c++ and saved characters and codes in a table like this:
Text: AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE
Table: (Made by huffman tree)
Table
Now, I want to save this table to a file in the best way.
I can't save like this: A1B001C010D001E000
When it change to bits: 01000001101000010001010000110100100010000101000101000
Because I can't decode this.
If I save table in normal way, every character use 8 bit for saving it's code.
While my characters have 1bit or 3bit code. (In this case.)
this way use much storage.
My idea is add a separator character and set a code for it.
If we add a separator character and make huffman tree and write codes, have a table like this.
table2
Now, we can write codes in this way.
A0SepB110SepC100SepD1111sepE1110sep.
binary= 0100000101010100001011010101000011100101010001001111101010001011110101
I decode it in this way:
sep = 101.
Read 8 bit : 01000001 -> it's A.
rest = 01010100001011010101000011100101010001001111101010001011110101.
Read 1 bit : 0 (unlike sep1)
Read 1 bit : 1 (like sep1), Read 1 bit : 0 (like sep2), Read 1 bit : 1 (like sep3(end))
Sep was found so A = everything was befor sep = 0;
rest = 0100001011010101000011100101010001001111101010001011110101.
Read 8 bit : 01000010 -> it's B.
rest = 11010101000011100101010001001111101010001011110101.
Read 1 bit : 1 (like sep1)- Read 1 bit : 1 (unlike sep2)
Read 1 bit : 0 (unlike sep1)
Read 1 bit : 1 (like sep1) - Read 1 bit : 0 (like sep2) - Read 1 bit :1 (like sep3(end))
Sep was found so B = everything was befor sep = 110;
And so on ...
This way still use a little storage for separator ( number of characters * separator size )
My question: Is there a way to save first table in a file and use less storage?
For example like this: A1B001C010D001E000.
Don't save the table with the codes. Just save the lengths. See Canonical Huffman Code.
You can store the lengths of the codes (as Mark said) as a 256 byte header at the start of your compressed data. Each byte stores the length of the code, and because you're working with bytes with 256 possible values, and the huffman tree can only be of a certain depth (number of possible values - 1) you only need 8 bits to store the codes.
The first byte would store the code length of the value 0x00, the second byte stores the code length of 0x01, and so on and so forth.
However, if compressing English text, there is a better way to store the table.
Store the shape of the tree, 0s for nodes and 1s for leaves. Then, after you store the nodes and the leaves, you store the values of the leaves.
The tree for AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE looks like this:
*
/ \
* A
/ \
* *
/ \ / \
E D C B
So you would store the shape of the tree as such:
000110111EDCBA
The reason why storing the huffman codes in this way is better for when you are compressing English text is that storing the shape of the tree costs 10n - 1 bits (where n is the number of unique characters in the data you are trying to compress) while storing the code lengths costs a flat 2048 bits. Therefore, for numbers of unique characters less than 205, storing the shape of the tree is more efficient, and because the average English string of text isn't going to have all that many of the possible 256 possible ASCII characters, you're usually better off storing the tree shape.
If you aren't just compressing text, and you're compressing more general data where there is a high likelihood that the number of unique characters could be greater than or equal to 205, you should probably use the code length storing format, or include 1 bit at the start of your header that says whether there's going to be a tree or a bunch of code lengths, and then write your decoder to decode either one depending on what that bit is set to.

Find all partial matches to vector of unsigned

For an AI project of mine, I need to apply to a factored state all rules that apply to its partial components. This needs to be done very frequently so I'm looking for a way to make this as fast as possible.
I'm going to describe my problem with strings, however the true problem works in the same way with vectors of unsigned integers.
I have a bunch of entries (of length N) like this which I need to store in some way:
__a_b
c_e__
___de
abcd_
fffff
__a__
My input is a single entry ciede to which I must find, as fast as possible, all stored entries which match to it. For example in this case the matches would be c_e__ and ___de. Removal and adding of entries should be supported, however I don't care how slow it is. What I would like to be as fast as possible is:
for ( const auto & entry : matchedEntries(input) )
My problem, as I said, is one where each letter is actually an unsigned integer, and the vector is of an unspecified (but known) length. I have no requirements for how entries should be stored, or what type of metadata is going to be associated with them. The naive algorithm of matching all is O(N), is it possible to do better? The number of reasonable entries I need stored is <=100k.
I'm thinking some kind of sorting might help, or some weird looking tree structure, but I can't seem to figure out a good way to approach this problem. It also looks like something word processers already need to do, so someone might be able to help.
The easiest solution is to build a trie containing your entries. When searching the trie, you start in the root and recursively follow an edge, that matches character from your input. There will be at most two of those edges in each node, one for the wildcard _ and one for the actual letter.
In the worst case you have to follow two edges from each node, which would add up to O(2^n) complexity, where n is the length of the input, while the space complexity is linear.
A different approach would be to preprocess the entries, to allow for linear search. This is basically what compiling regular expressions does. For your example, consider following regular expression, which matches your desired input:
(..a.b|c.e..|...de|abcd.|fffff|..a..)
This expression can be implemented as a nondeterministic finite state automaton, with initial state having ε-moves to a deterministic automaton for each of the single entries. This NFSA can then be turned to a deterministic FSA, using the standard powerset construction.
Although this construction can increase the number of states substantially, searching the input word can then be done in linear time, simply simulating the deterministic automaton.
Below is an example for entries ab, a_, ba, _a and __. First start with a nondeterministic automaton, which upon removing ε-moves and joining equivalent states is actually a trie for the set.
Then turn it into a deterministic machine, with states corresponding to subsets of states of the NFSA. Start in the state 0 and for each edge, other than _, create the next state as the union of the states in the original machine, that are reachable from any state in the current set.
For example, when DFSA is in state 16, that means the NFSA could be either in state 1 or 6. Upon transition on a, the NFSA could get to states 3 (from 1), 7 or 8 (from 6) - that will be your next state in the DFSA.
The standard construction would preserve the _-edges, but we can omit them, as long as the input does not contain _.
Now if you have a word ab on the input, you simulate this automaton (i.e. traverse its transition graph) and end up in state 238, from which you can easily recover the original entries.
Store the data in a tree, 1st layer represents 1st element (character or integer), and so on. This means the tree will have a constant depth of 5 (excluding the root) in your example. Don't care about wildcards ("_") at this point. Just store them like the other elements.
When searching for the matches, traverse the tree by doing a breadth first search and dynamically build up your result set. Whenever you encounter a wildcard, add another element to your result set for all other nodes of this layer that do not match. If no subnode matches, remove the entry from your result set.
You should also skip reduntant entries when building up the tree: In your example, __a_b is reduntant, because whenever it matches, __a__ also matches.
I've got an algorithm in mind which I plan to implement and benchmark, but I'll describe the approach already. It needs n_templates * template_length * n_symbols bits of storage (so for 100k templates of length 100 and 256 distinct symbols needs 2.56 Gb = 320 MB of RAM. This does not scale nicely to large number of symbols unless succinct data structure is used.
Query takes O(n_templates * template_length * n_symbols) time but should perform quite well thanks to bit-wise operations.
Let's say we have the given set of templates:
__a_b
c_e__
___de
abcd_
_ied_
bi__e
The set of symbols is abcdei, for each symbol we pre-calculate a bit mask indicating whether the template differs from the symbol at that location or not:
aaaaa bbbbb ccccc ddddd eeeee iiiii
....b ..a.. ..a.b ..a.b ..a.b ..a.b
c.e.. c.e.. ..e.. c.e.. c.... c.e..
...de ...de ...de ....e ...d. ...de
.bcd. a.cd. ab.d. abc.. abcd. abcd.
.ied. .ied. .ied. .ie.. .i.d. ..ed.
bi..e .i..e bi..e bi..e bi... b...e
Same tables expressed in binary:
aaaaa bbbbb ccccc ddddd eeeee iiiii
00001 00100 00101 00101 00101 00101
10100 10100 00100 10100 10000 10100
00011 00011 00011 00001 00010 00011
01110 10110 11010 11100 11110 11110
01110 01110 01110 01100 01010 00110
11001 01001 11001 11001 11000 10001
These are stored in columnar order, 64 templates / unsigned integer. To determine which templates match ciede we check the 1st column of c table, 2st column from i, 3rd from e and so forth:
ciede ciede
__a_b ..a.b 00101
c_e__ ..... 00000
___de ..... 00000
abcd_ abc.. 11100
_ied_ ..... 00000
bi__e b.... 10000
We find matching templates as rows of zeros, which indicates that no differences were found. We can check 64 templates at once, and the algorithm itself is very simple (python-like code):
for i_block in range(n_templates / 64):
mask = 0
for i in range(template_length):
# Accumulate difference-indicating bits
mask |= tables[i_block][word[i]][i]
if mask == 0xFFFFFFFF:
# All templates differ, we can stop early
break
for i in range(64):
if mask & (1 << i) == 0:
print('Match at template ' + (i_block * 64 + i))
As I said I haven't yet actually tried implementing this, so I have no clue how fast it is in practice.

JPEG Huffman Table

I have a question regarding the JPEG Huffman Table and using the Huffman Table to construct the symbol/binary string from a Tree. Suppose, that in an Huffman Table for 3-Bit code Length the number of codes is greater than 6, then how do we add all those codes in the Tree? If I am correct only 6 codes can be added at the 3-bit level/depth of the tree. So, how do we add the remaining codes if they won't fit in that level? Do we just ignore them?
Example
code length | Total Codes | Codes
3-Bit | 10 | 25 43 34 53 92 A2 B2 63 73 C2
In the above example if we go by order of constructing symbols/binary string for the code then up 'til A2 we can add codes in the tree at level 3-Bit, but what about B2,63,73,C2 etc? It's not possible to add them at 3-Bit level of the tree? So what do we do with them?
Well, clearly, the absolutely highest number of "things" that can be represented in 3 bits is 8 - (000, 001, 010, 011, 100, 101, 110, 111).
In Huffman encoding, bits represent "left" or "right" in a trie data-structure, to be able to "continue", you have to use SOME codes for "this continues another level", which is why not all 8 values can be encoded in 3 bits. If you have more values to encode, you need to use more bits (for some values - this is the whole point of Huffman coding, that SOME combinations are short, others are longer, and sometimes even longer than the original, but because it's based on what is the most common, it's fine, because they will be rare...)
How to construct and decode a Huffman tree is about four-five pages in your typical Algorithms book, and if you haven't got one of those, you probably want to find one - either a real paper one, or an e-book. There are LOTS of them - I'm not going to recommend one, since the ones I have are all about 15+ years old.
I should add that I think your question is missing something. Clearly, 3 bits can not possibly represent 10 values. And you can't build a [meaningful] Huffman tree on 10 values that all different - unless the idea is to split the values into pairs of {2,5}, {4,3}, {3,4}, {5,3}, {9,2}, {A,2}, {B,2}, {6,3}, {7,3}, {C,2} - which gives a fair number of repeated values - frequency of those are:
2 : 5
3 : 5
4 : 2
5 : 2
6 : 1
7 : 1
9 : 1
A : 1
B : 1
C : 1
But that's stil too many to represent anything meaningful...
Or is it the other way around, that we are supposed to use the bit values of those to decode? In which case we'd need the tree built from the original data to decode it...
In JPEG, a Huffman code can be up to 16-bits. The DHT market contains an array of 16 elements giving the number of codes for each length.
The JPEG standard explains how to use the code counts to do the Huffman translation. It is one of the few things explained in detail.
This book explains how it is done from a programmers perspective.
JPEG Book
The number of codes that exists at any code length depends upon the counts for other lengths.
I am wondering if you are really looking at the count of codes for length 4 rather than 3.
It looks like you're not following the correct procedure when creating your Huffman codes from the JPEG table. The count provided will fit in the number of bits unless the table has been corrupted. The reading out of the codes from a DHT marker is really simple. The more complicated part is how you define your lookup table from that data. A logical (but not practical) way is to create a reverse lookup table that's the maximum code length in size (16-bits = 65536 entries in the table). Then to decode your JPEG data, just pick up 16-bits of compressed data from the input stream and use it as an index in the table where you'll have the symbol and actual length of the code. I came up with a way to use a single, much smaller lookup table. I'm not going to share my specific code table method. What I will share is the basic format of the loop to create the codes from a DHT marker:
int iCurrentCode; // the current Huffman code
int iLength; // the code length in bits that you're working on
int i;
int iCount; // the number of codes defined for this length
int iSymbol; // JPEG symbol defined for each Huffman code
unsigned char *pData; // pointer to the data in the DHT marker
iCurrentCode = 0; // start with a Huffman code of 0
for (iLength = 1; iLength <= 16; iLength++)
{
iCount = *pData++; // get number of symbols for this bit length
for (i=0; i<iCount; i++) // read each of the codes for this bit length
{
iSymbol = *pData++; // get the JPEG symbol value (e.g. RRRR/SSSS value)
// It's up to you to create a lookup table from the code and its value
iCurrentCode++; // the Huffman bit pattern just increments for each code value
} // for each code defined at this bit length
iCurrentCode <<= 1; // shift the code left 1 bit to advance to the next bit length
} // for each bit length

Outputting Huffman codes to file

I have a program that reads a file and saves the frequency of each character. It then constructs a huffman tree based on each character's frequency and then outputs to a file the huffman codes for the tree.
So an input like "Hello World" would output this sequence to a file:
01010101 0010 010 010 01010 0101010 000 01010 00101 010 0001
This makes sense because the most frequent characters have the shortest codes. The issue is, this increases the file size ten-fold. I realized the reason why is because each 1 and 0 is being represented in memory as its own character, so they get each get expanded out to a byte of data.
I was thinking what I could do is convert each code (E.G. "010") to a character and save that to file - but that still would pad the code to be a byte long (Or mess it up if the code is longer than a byte).
How do I go about this? I can give code snippets if needed - I'm basically saving each code into a string so that's why the file's coming out so big (It's outputting each "bit" as a byte). If I were to convert the code to a long for example, then a code like 00010 would be represented as 2 and a code like 010 would also be represented as 2.
You basically have to do it a byte (or a word) at a time. Maintain a byte which you fill with bits, and a record of how many bits have been filled in so far. When you get to 8, write the byte and start over with an empty one.

Lempel-Ziv 76 complexity

Can someone explain to me Lempel-Ziv 76 complexity? I was under the impression that you initialize with the first letter of the string in your dictionary, and then check subsequent blocks for existence in the previous substring, growing one letter each time a substring is found. If no substring exists in the previous substring, that substring is called a block and the next letter becomes the next substring to be searched.
For example,
01011010001101110010
0|1
since 1 is not in 0, we get 0|1|0
since 0 is in 01, we get 0|1|01
since 01 is in 01, we get 0|1|011|0
since 0 is in 01011, we get 0|1|011|01
since 01 is in 01011, we get 0|1|011|010
since 010 is in 01011, we get 0|1|011|0100|0
and so on until, we get 0|1|011|0100|011011|1001|0,
where the last letter can be a repeat if necessary.
What am I doing wrong? Because I am being told that for a string 1111111, the decomposition is 1|111111. Thanks!
I agree with your decomposition:
01011010001101110010 -> 0.1.011.0100.011011.1001.0
I also believe that what you were told is correct:
1111111 -> 1.111111
However, how you arrived at your original decomposition is not quite right! Hence the confusion about decomposing 1111111. I think, according to your reasoning:
1111111 -> 1.11.1111
which I'm pretty sure is not correct.
Extensions to the existing sequence of words (blocks) is not as simple as checking to see if the extension previously appears as a substring of the previous history. It boils down to the concept of reproducibility of an extension that Lempel and Ziv describe in their paper On the Complexity of Finite Sequences (I'm assuming that's what you're working from!). An extension is formed such that it is the shortest word that is not reproducible from the previous history. The concept of reproducibility of an extension that they describe is rather complicated, but it boils down to being able to find a starting position in the previous history, from which you can sequentially copy symbols from that starting position, onto the end of the history, to form the extension.
From your original sequence, assume the symbols have positions from 1 to 20. The first symbol is always a word/block by itself:
0.
The next extension is not reproducible from the previous history:
0.1.
The next extension is:
0.1.011.
The reason why it can't be 0 or 01 is as follows: 0 is reproducible from the previous history, by copying one symbol from position 1 to the end; 01 is reproducible by copying two symbols from position 1 to the end; 011 is not reproducible.
The next extension is:
0.1.011.0100.
0 is reproducible by copying one symbol from position 1 to the end; 01 is reproducible by copying two symbols from position 1 to the end; 010 is reproducible by copying three symbols from position 1 to the end; 0100 is not reproducible.
And so on.
The decomposition of 1111111 is as follows: the first symbol is a block by itself:
1.
The next extension is:
1.111111
1 is reproducible by copying one symbol from position 1 to the end of the previous history. 11 is reproducible by copying two symbols from position 1 to the end. This is where it gets complicated - in this case, when you copy two symbols, the source of the copy actually extends past the end of the previous history! In other words, the second 1 that you copy is actually part of the extension that results from copying the first 1 onto the end. Similarly, 111, 1111, 11111, 111111 are all reproducible, due to this recursive copying process. However, since we have now reached the end of the original sequence, the last extension is deemed to be 111111.
Hopefully my explanation has made some sense.
This paper does not agree with your description of the algorithm. According to the paper, you have a new partition if it doesn't match any previous partition. You don't get to make partitions based on the entire unpartitioned preceding sequence, as you have done. So for your examples (I am using . instead of | to partition, since that's easier to read):
01011010001101110010
partitions into:
    0.1.01.10.100.011.0111.00.10
so the LZ76 weight is 9 (not 7).
1111111
partitions into:
1.11.111.1
They both provide an example of the case where the final partition is contained in a previous one. Hence the < r instead of <= r in the definition.
I don't have the original paper, so I can't check whether this paper got it wrong or not. However I doubt that they incorrectly copied the definition from the original paper.