Find all partial matches to vector of unsigned - c++

For an AI project of mine, I need to apply to a factored state all rules that apply to its partial components. This needs to be done very frequently so I'm looking for a way to make this as fast as possible.
I'm going to describe my problem with strings, however the true problem works in the same way with vectors of unsigned integers.
I have a bunch of entries (of length N) like this which I need to store in some way:
__a_b
c_e__
___de
abcd_
fffff
__a__
My input is a single entry ciede to which I must find, as fast as possible, all stored entries which match to it. For example in this case the matches would be c_e__ and ___de. Removal and adding of entries should be supported, however I don't care how slow it is. What I would like to be as fast as possible is:
for ( const auto & entry : matchedEntries(input) )
My problem, as I said, is one where each letter is actually an unsigned integer, and the vector is of an unspecified (but known) length. I have no requirements for how entries should be stored, or what type of metadata is going to be associated with them. The naive algorithm of matching all is O(N), is it possible to do better? The number of reasonable entries I need stored is <=100k.
I'm thinking some kind of sorting might help, or some weird looking tree structure, but I can't seem to figure out a good way to approach this problem. It also looks like something word processers already need to do, so someone might be able to help.

The easiest solution is to build a trie containing your entries. When searching the trie, you start in the root and recursively follow an edge, that matches character from your input. There will be at most two of those edges in each node, one for the wildcard _ and one for the actual letter.
In the worst case you have to follow two edges from each node, which would add up to O(2^n) complexity, where n is the length of the input, while the space complexity is linear.
A different approach would be to preprocess the entries, to allow for linear search. This is basically what compiling regular expressions does. For your example, consider following regular expression, which matches your desired input:
(..a.b|c.e..|...de|abcd.|fffff|..a..)
This expression can be implemented as a nondeterministic finite state automaton, with initial state having ε-moves to a deterministic automaton for each of the single entries. This NFSA can then be turned to a deterministic FSA, using the standard powerset construction.
Although this construction can increase the number of states substantially, searching the input word can then be done in linear time, simply simulating the deterministic automaton.
Below is an example for entries ab, a_, ba, _a and __. First start with a nondeterministic automaton, which upon removing ε-moves and joining equivalent states is actually a trie for the set.
Then turn it into a deterministic machine, with states corresponding to subsets of states of the NFSA. Start in the state 0 and for each edge, other than _, create the next state as the union of the states in the original machine, that are reachable from any state in the current set.
For example, when DFSA is in state 16, that means the NFSA could be either in state 1 or 6. Upon transition on a, the NFSA could get to states 3 (from 1), 7 or 8 (from 6) - that will be your next state in the DFSA.
The standard construction would preserve the _-edges, but we can omit them, as long as the input does not contain _.
Now if you have a word ab on the input, you simulate this automaton (i.e. traverse its transition graph) and end up in state 238, from which you can easily recover the original entries.

Store the data in a tree, 1st layer represents 1st element (character or integer), and so on. This means the tree will have a constant depth of 5 (excluding the root) in your example. Don't care about wildcards ("_") at this point. Just store them like the other elements.
When searching for the matches, traverse the tree by doing a breadth first search and dynamically build up your result set. Whenever you encounter a wildcard, add another element to your result set for all other nodes of this layer that do not match. If no subnode matches, remove the entry from your result set.
You should also skip reduntant entries when building up the tree: In your example, __a_b is reduntant, because whenever it matches, __a__ also matches.

I've got an algorithm in mind which I plan to implement and benchmark, but I'll describe the approach already. It needs n_templates * template_length * n_symbols bits of storage (so for 100k templates of length 100 and 256 distinct symbols needs 2.56 Gb = 320 MB of RAM. This does not scale nicely to large number of symbols unless succinct data structure is used.
Query takes O(n_templates * template_length * n_symbols) time but should perform quite well thanks to bit-wise operations.
Let's say we have the given set of templates:
__a_b
c_e__
___de
abcd_
_ied_
bi__e
The set of symbols is abcdei, for each symbol we pre-calculate a bit mask indicating whether the template differs from the symbol at that location or not:
aaaaa bbbbb ccccc ddddd eeeee iiiii
....b ..a.. ..a.b ..a.b ..a.b ..a.b
c.e.. c.e.. ..e.. c.e.. c.... c.e..
...de ...de ...de ....e ...d. ...de
.bcd. a.cd. ab.d. abc.. abcd. abcd.
.ied. .ied. .ied. .ie.. .i.d. ..ed.
bi..e .i..e bi..e bi..e bi... b...e
Same tables expressed in binary:
aaaaa bbbbb ccccc ddddd eeeee iiiii
00001 00100 00101 00101 00101 00101
10100 10100 00100 10100 10000 10100
00011 00011 00011 00001 00010 00011
01110 10110 11010 11100 11110 11110
01110 01110 01110 01100 01010 00110
11001 01001 11001 11001 11000 10001
These are stored in columnar order, 64 templates / unsigned integer. To determine which templates match ciede we check the 1st column of c table, 2st column from i, 3rd from e and so forth:
ciede ciede
__a_b ..a.b 00101
c_e__ ..... 00000
___de ..... 00000
abcd_ abc.. 11100
_ied_ ..... 00000
bi__e b.... 10000
We find matching templates as rows of zeros, which indicates that no differences were found. We can check 64 templates at once, and the algorithm itself is very simple (python-like code):
for i_block in range(n_templates / 64):
mask = 0
for i in range(template_length):
# Accumulate difference-indicating bits
mask |= tables[i_block][word[i]][i]
if mask == 0xFFFFFFFF:
# All templates differ, we can stop early
break
for i in range(64):
if mask & (1 << i) == 0:
print('Match at template ' + (i_block * 64 + i))
As I said I haven't yet actually tried implementing this, so I have no clue how fast it is in practice.

Related

How does this GolfScript code print 1000 digits of pi?

How does this code work?
;''
6666,-2%{2+.2/#*\/10.3??2*+}*
`1000<~\;
It seem to use an array #* and a cycle {/**/}, but what is 6666? what is \/?
The first three characters; ;'', are unneeded for the program to function. They simply discard all input and replace it with an empty string, in case your compiler needs an input necessarily.
6666, prints out an array 6666 elements long, each of which are the numbers 0-6665.
-2% is a mapping function. It reverses the function and deletes every two elements. You now you have an array that is 3333 elements long, and it goes [6665 6663 6661 … 5 3 1]
{foo}* is a folding block call. For every element, do the following to the combination of elements. For example, 5,{+}* would add together the numbers 0-4.
So, let's see what we're doing in this folding block call.
2+ add two to the element.
. duplicate the element.
2/ halve it. Your sub-stack looks like this; (n+2),((n+2)/2)
# pulls the third element to the top.
This is the first function we cannot do, since our original stack is only two tall. We'll get back to this later.
*\/ will be skipped for now, we'll get back to it once we discuss folding more.
10.3?? Duplicate 10, then push a 3. [10 10 3]. ? is exponentiation, so we have [10 1000], then again gives us a 1 with 1000 zeroes afterwards.
2* Multiply it by two. So now we have a 2 with 1000 zeroes after.
+ Adds the rest of our math to 2e(1e3)
So, let's go back to that pesky #.
#*\/ will grab the third element and bring it to the top, then multiply it by the next top element ((n+2)/2), then we divide n by this number.
This is an expansion of the Leibniz Series.
\`1000< turns the int into a string, then throws a decimal after the 3.
~ dumps the string into a number again.
\; deleted the rest of the stack.
To answer your specific questions;
6666 was chosen, since half is 3333 (length of array), and we want more than pi times the number of digits of accuracy we want. We could make it smaller if we wanted, but 6666 is a cute number to use.
\/ Is the "inverse division" pair. Take a, take b, then calculate b/a. This is because the \ changes the order of the top two elements in the array, and / divides them.

Checking if a string contains an English sentence

As of right now, I decided to take a dictionary and iterate through the entire thing. Every time I see a newline, I make a string containing from that newline to the next newline, then I do string.find() to see if that English word is somewhere in there. This takes a VERY long time, each word taking about 1/2-1/4 a second to verify.
It is working perfectly, but I need to check thousands of words a second. I can run several windows, which doesn't affect the speed (Multithreading), but it still only checks like 10 a second. (I need thousands)
I'm currently writing code to pre-compile a large array containing every word in the English language, which should speed it up a lot, but still not get the speed I want. There has to be a better way to do this.
The strings I'm checking will look like this:
"hithisisastringthatmustbechecked"
but most of them contained complete garbage, just random letters.
I can't check for impossible compinations of letters, because that string would be thrown out because of the 'tm', in between 'thatmust'.
You can speed up the search by employing the Knuth–Morris–Pratt (KMP) algorithm.
Go through every dictionary word, and build a search table for it. You need to do it only once. Now your search for individual words will proceed at faster pace, because the "false starts" will be eliminated.
There are a lot of strategies for doing this quickly.
Idea 1
Take the string you are searching and make a copy of each possible substring beginning at some column and continuing through the whole string. Then store each one in an array indexed by the letter it begins with. (If a letter is used twice store the longer substring.
So the array looks like this:
a - substr[0] = "astringthatmustbechecked"
b - substr[1] = "bechecked"
c - substr[2] = "checked"
d - substr[3] = "d"
e - substr[4] = "echecked"
f - substr[5] = null // since there is no 'f' in it
... and so forth
Then, for each word in the dictionary, search in the array element indicated by its first letter. This limits the amount of stuff that has to be searched. Plus you can't ever find a word beginning with, say 'r', anywhere before the first 'r' in the string. And some words won't even do a search if the letter isn't in there at all.
Idea 2
Expand upon that idea by noting the longest word in the dictionary and get rid of letters from those strings in the arrays that are longer than that distance away.
So you have this in the array:
a - substr[0] = "astringthatmustbechecked"
But if the longest word in the list is 5 letters, there is no need to keep any more than:
a - substr[0] = "astri"
If the letter is present several times you have to keep more letters. So this one has to keep the whole string because the "e" keeps showing up less than 5 letters apart.
e - substr[4] = "echecked"
You can expand upon this by using the longest words starting with any particular letter when condensing the strings.
Idea 3
This has nothing to do with 1 and 2. Its an idea that you could use instead.
You can turn the dictionary into a sort of regular expression stored in a linked data structure. It is possible to write the regular expression too and then apply it.
Assume these are the words in the dictionary:
arun
bob
bill
billy
body
jose
Build this sort of linked structure. (Its a binary tree, really, represented in such a way that I can explain how to use it.)
a -> r -> u -> n -> *
|
b -> i -> l -> l -> *
| | |
| o -> b -> * y -> *
| |
| d -> y -> *
|
j -> o -> s -> e -> *
The arrows denote a letter that has to follow another letter. So "r" has to be after an "a" or it can't match.
The lines going down denote an option. You have the "a or b or j" possible letters and then the "i or o" possible letters after the "b".
The regular expression looks sort of like: /(arun)|(b(ill(y+))|(o(b|dy)))|(jose)/ (though I might have slipped a paren). This gives the gist of creating it as a regex.
Once you build this structure, you apply it to your string starting at the first column. Try to run the match by checking for the alternatives and if one matches, more forward tentatively and try the letter after the arrow and its alternatives. If you reach the star/asterisk, it matches. If you run out of alternatives, including backtracking, you move to the next column.
This is a lot of work but can, sometimes, be handy.
Side note I built one of these some time back by writing a program that wrote the code that ran the algorithm directly instead of having code looking at the binary tree data structure.
Think of each set of vertical bar options being a switch statement against a particular character column and each arrow turning into a nesting. If there is only one option, you don't need a full switch statement, just an if.
That was some fast character matching and really handy for some reason that eludes me today.
How about a Bloom Filter?
A Bloom filter, conceived by Burton Howard Bloom in 1970 is a
space-efficient probabilistic data structure that is used to test
whether an element is a member of a set. False positive matches are
possible, but false negatives are not; i.e. a query returns either
"inside set (may be wrong)" or "definitely not in set". Elements can
be added to the set, but not removed (though this can be addressed
with a "counting" filter). The more elements that are added to the
set, the larger the probability of false positives.
The approach could work as follows: you create the set of words that you want to check against (this is done only once), and then you can quickly run the "in/not-in" check for every sub-string. If the outcome is "not-in", you are safe to continue (Bloom filters do not give false negatives). If the outcome is "in", you then run your more sophisticated check to confirm (Bloom filters can give false positives).
It is my understanding that some spell-checkers rely on bloom filters to quickly test whether your latest word belongs to the dictionary of known words.
This code was modified from How to split text without spaces into list of words?:
from math import log
words = open("english125k.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
costsum = 0
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
costsum += c
i -= k
return costsum
Using the same dictionary of that answer and testing your string outputs
>>> infer_spaces("hithisisastringthatmustbechecked")
294.99768817854056
The trick here is finding out what threshold you can use, keeping in mind that using smaller words makes the cost higher (if the algorithm can't find any usable word, it returns inf, since it would split everything to single-letter words).
In theory, I think you should be able to train a Markov model and use that to decide if a string is probably a sentence or probably garbage. There's another question about doing this to recognize words, not sentences: How do I determine if a random string sounds like English?
The only difference for training on sentences is that your probability tables will be a bit larger. In my experience, though, a modern desktop computer has more than enough RAM to handle Markov matrices unless you are training on the entire Library of Congress (which is unnecessary- even 5 or so books by different authors should be enough for very accurate classification).
Since your sentences are mashed together without clear word boundaries, it's a bit tricky, but the good news is that the Markov model doesn't care about words, just about what follows what. So, you can make it ignore spaces, by first stripping all spaces from your training data. If you were going to use Alice in Wonderland as your training text, the first paragraph would, perhaps, look like so:
alicewasbeginningtogetverytiredofsittingbyhersisteronthebankandofhavingnothingtodoonceortwiceshehadpeepedintothebookhersisterwasreadingbutithadnopicturesorconversationsinitandwhatistheuseofabookthoughtalicewithoutpicturesorconversation
It looks weird, but as far as a Markov model is concerned, it's a trivial difference from the classical implementation.
I see that you are concerned about time: Training may take a few minutes (assuming you have already compiled gold standard "sentences" and "random scrambled strings" texts). You only need to train once, you can easily save the "trained" model to disk and reuse it for subsequent runs by loading from disk, which may take a few seconds. Making a call on a string would take a trivially small number of floating point multiplications to get a probability, so after you finish training it, it should be very fast.

Find a repeating symmetric bit pattern in a small stream of 128 bits

How can I quickly scan groups of 128 bits that are exact equal repeating binary patterns, such 010101... Or 0011001100...?
I have a number of 128 bit blocks, and wish to see if they match the patterns where the number of 1s is equal to number of 0s, eg 010101.... Or 00110011... Or 0000111100001111... But NOT 001001001...
The problem is that patterns may not start on their boundary, so the pattern 00110011.. May begin as 0110011..., and will end 1 bit shifted also (note the 128 bits are not circular, so start doesn't join to the end)
The 010101... Case is easy, it is simply 0xAAAA... Or 0x5555.... However as the patterns get longer, the permutations get longer. Currently I use repeating shifting values such as outlined in this question Fastest way to scan for bit pattern in a stream of bits but something quicker would be nice, as I'm spending 70% of all CPU in this routine. Other posters have solutions for general cases but I am hoping the symmetric nature of my pattern might lead to something more optimal.
If it helps, I am only interested in patterns up to 63 bits long, and most interested in the power of 2 patterns (0101... 00110011... 0000111100001111... Etc) while patterns such as 5 ones/5 zeros are present, these non power 2 sequences are less than 0.1%, so can be ignored if it helps the common cases go quicker.
Other constraints for a perfect solution would be small number of assembler instructions, no wildly random memory access (ie, large rainbow tables not ideal).
Edit. More precise pattern details.
I am mostly interested in the patterns of 0011 and 0000,1111 and 0000,0000,1111,1111 and 16zeros/ones and 32 zeros/ones (commas for readabily only) where each pattern repeats continuously within the 128 bits. Patterns that are not 2,4,8,16,32 bits long for the repeating portion are not as interesting and can be ignored. ( eg 000111... )
The complexity for scanning is that the pattern may start at any position, not just on the 01 or 10 transition. So for example, all of the following would match the 4 bit repeating pattern of 00001111... (commas every 4th bit for readability) (ellipses means repeats identically)
0000,1111.... Or 0001,1110... Or 0011,1100... Or 0111,1000... Or 1111,0000... Or 1110,0001... Or 1100,0011... Or 1000,0111
Within the 128bits, the same pattern needs to repeat, two different patterns being present is not of interest. Eg this is NOT a valid pattern. 0000,1111,0011,0011... As we have changed from 4 bits repeating to 2 bits repeating.
I have already verified the number of 1s is 64, which is true for all power 2 patterns, and now need to identify how many bits make up the repeating pattern (2,4,8,16,32) and how much the pattern is shifted. Eg pattern 0000,1111 is a 4 bit pattern, shifted 0. While 0111,1000... Is a 4 bit pattern shifted 3.
Lets start with the case where the patterns do start on their boundary. You can check the first bit and use it to determine your state. Then start looping through your block, check the first bit, increment a count, left shift and repeat until you find that you've gotten the opposite bit. You can now use this initial length as the bitset length. Reset the count to 1 then count the next set of opposite bits. When you switch, check the length against the initial length and error out if they're not equal. Here's a quick function - it seems to work as expected for chars, and it shouldn't be too hard to expand it to deal with blocks of 32 bytes.
unsigned char myblock = 0x33;
unsigned char mask = 0x80, prod = 0x00;
int setlen = 0, count = 0, ones=0;
prod = myblock & mask;
if(prod == 0x80)
ones = 1;
for(int i=0;i<8;i++){
prod = myblock & mask;
myblock = myblock << 1;
if((prod == 0x80 && ones) || (prod == 0x00 && !ones)){
count++;
}else{
if(setlen == 0) setlen = count;
if(count != setlen){
printf("Bad block\n");
return -1;
}
count = 1;
ones = ( ones == 1 ) ? 0 : 1;
}
}
printf("Good block of with % repeating bits\n",setlen);
return setlen;
Now to deal with blocks where there's an offset, I'd suggest counting the number of bits until the first 'flip'. Store this number, then run the above routine until you hit the last segment which should have length unequal to the rest of the sets. Add the initial bits to the last segment's length, and then you should be able to compare it with the size of the rest of the sets correctly.
This code is pretty small, and bit shifting through a buffer shouldn't require too much work on the CPU's part. I'd be interested to see how this solution ends up performing against your current one.
The Generic solution for this kind of problems is to create a good hashing function for the patterns and store each pattern in a hash map. Once you have the hash map created for the patterns then try to lookup in the table using the input stream. I don't have code yet but let me know if you are struck in code.. Please post it and I can work on it..
I've thought about making a state machine, so every next byte (out of 16) would advance its state and after some 16 state transitions you'd have the pattern identified. But that doesn't look very promising. Data structures and logic look more complex.
Instead, why not precompute all those 126 patterns (from 01 to 32 zeroes + 32 ones), sort them and perform binary search? That would give you at most 7 iterations of binary search. And you don't need to store all 16 bytes of every pattern as its halves are identical. That gives you 126*16/2=1008 bytes for the array of patterns. You also need something like 2 bytes per pattern to store the length of zero (one) runs and the shift relative to whatever pattern you consider unshifted. That's a total of 126*(16/2+2)=1260 bytes of data (should be gentle on the data cache) and very simple and tiny binary search algorithm. Basically, its just an improvement over the answer that you mentioned in the question.
You might want to try switching to linear search after 4-5 iterations of binary search. That may give a small boost to the overall algorithm.
Ultimately, the winner is determined by testing/profiling. And that's what you should do, get a few implementations and compare them on the real data in the real system.
The restriction of the pattern repeating it self all over the 128-stream makes the number of combinations limited and also the sequence will have properties making it easy to check:
One needs to iteratively check if high and low parts are same; if they are opposites, check if that particular length contains consecutive ones.
8-bit repeat at offset 3: 00011111 11100000 00011111 11100000
==> high and low 16 bits are the same
00011111 11100000 ==> high and low parts are inverted.
Not same, nor inverted means rejection of pattern.
At that point one needs to check if there's a sequence of ones -- add '1' to the left side and check if it's power of two: n==(n & -n) is the textbook check for that.

Lempel-Ziv 76 complexity

Can someone explain to me Lempel-Ziv 76 complexity? I was under the impression that you initialize with the first letter of the string in your dictionary, and then check subsequent blocks for existence in the previous substring, growing one letter each time a substring is found. If no substring exists in the previous substring, that substring is called a block and the next letter becomes the next substring to be searched.
For example,
01011010001101110010
0|1
since 1 is not in 0, we get 0|1|0
since 0 is in 01, we get 0|1|01
since 01 is in 01, we get 0|1|011|0
since 0 is in 01011, we get 0|1|011|01
since 01 is in 01011, we get 0|1|011|010
since 010 is in 01011, we get 0|1|011|0100|0
and so on until, we get 0|1|011|0100|011011|1001|0,
where the last letter can be a repeat if necessary.
What am I doing wrong? Because I am being told that for a string 1111111, the decomposition is 1|111111. Thanks!
I agree with your decomposition:
01011010001101110010 -> 0.1.011.0100.011011.1001.0
I also believe that what you were told is correct:
1111111 -> 1.111111
However, how you arrived at your original decomposition is not quite right! Hence the confusion about decomposing 1111111. I think, according to your reasoning:
1111111 -> 1.11.1111
which I'm pretty sure is not correct.
Extensions to the existing sequence of words (blocks) is not as simple as checking to see if the extension previously appears as a substring of the previous history. It boils down to the concept of reproducibility of an extension that Lempel and Ziv describe in their paper On the Complexity of Finite Sequences (I'm assuming that's what you're working from!). An extension is formed such that it is the shortest word that is not reproducible from the previous history. The concept of reproducibility of an extension that they describe is rather complicated, but it boils down to being able to find a starting position in the previous history, from which you can sequentially copy symbols from that starting position, onto the end of the history, to form the extension.
From your original sequence, assume the symbols have positions from 1 to 20. The first symbol is always a word/block by itself:
0.
The next extension is not reproducible from the previous history:
0.1.
The next extension is:
0.1.011.
The reason why it can't be 0 or 01 is as follows: 0 is reproducible from the previous history, by copying one symbol from position 1 to the end; 01 is reproducible by copying two symbols from position 1 to the end; 011 is not reproducible.
The next extension is:
0.1.011.0100.
0 is reproducible by copying one symbol from position 1 to the end; 01 is reproducible by copying two symbols from position 1 to the end; 010 is reproducible by copying three symbols from position 1 to the end; 0100 is not reproducible.
And so on.
The decomposition of 1111111 is as follows: the first symbol is a block by itself:
1.
The next extension is:
1.111111
1 is reproducible by copying one symbol from position 1 to the end of the previous history. 11 is reproducible by copying two symbols from position 1 to the end. This is where it gets complicated - in this case, when you copy two symbols, the source of the copy actually extends past the end of the previous history! In other words, the second 1 that you copy is actually part of the extension that results from copying the first 1 onto the end. Similarly, 111, 1111, 11111, 111111 are all reproducible, due to this recursive copying process. However, since we have now reached the end of the original sequence, the last extension is deemed to be 111111.
Hopefully my explanation has made some sense.
This paper does not agree with your description of the algorithm. According to the paper, you have a new partition if it doesn't match any previous partition. You don't get to make partitions based on the entire unpartitioned preceding sequence, as you have done. So for your examples (I am using . instead of | to partition, since that's easier to read):
01011010001101110010
partitions into:
    0.1.01.10.100.011.0111.00.10
so the LZ76 weight is 9 (not 7).
1111111
partitions into:
1.11.111.1
They both provide an example of the case where the final partition is contained in a previous one. Hence the < r instead of <= r in the definition.
I don't have the original paper, so I can't check whether this paper got it wrong or not. However I doubt that they incorrectly copied the definition from the original paper.

Corner case in Huffman Algorithm

I am trying to solve by hand two different scenarios of compression via Huffman algorithm, In each one of them, we start with an ordered queue that contains tuples (symbol, frequency) as its items, and we'll try to form a dictionary:
step 0:
[c:3] [b:4] [a:5]
step 1:
[a:5] [7]
[c:3] [b:4]
step 2:
[12]
[a:5] [7]
[c:3] [b:4]
if we consider the left to be 0 and the right to be 1, then we have as dictionary:
a -> 0
b -> 11
c -> 10
Until now, everything looks right. But let's assume our initial queue was something like the following, instead:
step 0:
[c:1] [b:2] [a:4]
step 1:
[3] [a:4]
[c:1] [b:2]
step 2:
[7]
[3] [a:4]
[c:1] [b:2]
that yields the following dictionary:
a -> 1
b -> 01
c -> 00
which doesn't seem right (both a and b are equal).
What am I doing wrong? I could just do an if to see in the root of the tree which one of the branches is actually a leaf, selecting that direction to be the 1's direction, but this doesn't seem clean to me. I guess I must be missing something?
The bit sequences are not equal. If you have a bit string like:
01100
Then it can only be decompressed as "bac". You have to store the compression result in a way that preserves leading zeroes in the sequence, so for example the above could be padded to 01100000 or 00001100 to fill a byte of output and then stored along with the length 5. Of course the issue is only with the first or last byte of output, depending on which side you choose to pad.
The point is that the sequence of bits in the dictionary should not cause ambiguity when decoding. The value of the sequence doesn't matter.
Ambiguity is resolved in Huffman coding by the condition that only the leaf in the coding tree can hold the character to be encoded. The tree above follows this rule, so there is no problem with the resulting encoding.