JPEG Huffman Table - c++

I have a question regarding the JPEG Huffman Table and using the Huffman Table to construct the symbol/binary string from a Tree. Suppose, that in an Huffman Table for 3-Bit code Length the number of codes is greater than 6, then how do we add all those codes in the Tree? If I am correct only 6 codes can be added at the 3-bit level/depth of the tree. So, how do we add the remaining codes if they won't fit in that level? Do we just ignore them?
Example
code length | Total Codes | Codes
3-Bit | 10 | 25 43 34 53 92 A2 B2 63 73 C2
In the above example if we go by order of constructing symbols/binary string for the code then up 'til A2 we can add codes in the tree at level 3-Bit, but what about B2,63,73,C2 etc? It's not possible to add them at 3-Bit level of the tree? So what do we do with them?

Well, clearly, the absolutely highest number of "things" that can be represented in 3 bits is 8 - (000, 001, 010, 011, 100, 101, 110, 111).
In Huffman encoding, bits represent "left" or "right" in a trie data-structure, to be able to "continue", you have to use SOME codes for "this continues another level", which is why not all 8 values can be encoded in 3 bits. If you have more values to encode, you need to use more bits (for some values - this is the whole point of Huffman coding, that SOME combinations are short, others are longer, and sometimes even longer than the original, but because it's based on what is the most common, it's fine, because they will be rare...)
How to construct and decode a Huffman tree is about four-five pages in your typical Algorithms book, and if you haven't got one of those, you probably want to find one - either a real paper one, or an e-book. There are LOTS of them - I'm not going to recommend one, since the ones I have are all about 15+ years old.
I should add that I think your question is missing something. Clearly, 3 bits can not possibly represent 10 values. And you can't build a [meaningful] Huffman tree on 10 values that all different - unless the idea is to split the values into pairs of {2,5}, {4,3}, {3,4}, {5,3}, {9,2}, {A,2}, {B,2}, {6,3}, {7,3}, {C,2} - which gives a fair number of repeated values - frequency of those are:
2 : 5
3 : 5
4 : 2
5 : 2
6 : 1
7 : 1
9 : 1
A : 1
B : 1
C : 1
But that's stil too many to represent anything meaningful...
Or is it the other way around, that we are supposed to use the bit values of those to decode? In which case we'd need the tree built from the original data to decode it...

In JPEG, a Huffman code can be up to 16-bits. The DHT market contains an array of 16 elements giving the number of codes for each length.
The JPEG standard explains how to use the code counts to do the Huffman translation. It is one of the few things explained in detail.
This book explains how it is done from a programmers perspective.
JPEG Book
The number of codes that exists at any code length depends upon the counts for other lengths.
I am wondering if you are really looking at the count of codes for length 4 rather than 3.

It looks like you're not following the correct procedure when creating your Huffman codes from the JPEG table. The count provided will fit in the number of bits unless the table has been corrupted. The reading out of the codes from a DHT marker is really simple. The more complicated part is how you define your lookup table from that data. A logical (but not practical) way is to create a reverse lookup table that's the maximum code length in size (16-bits = 65536 entries in the table). Then to decode your JPEG data, just pick up 16-bits of compressed data from the input stream and use it as an index in the table where you'll have the symbol and actual length of the code. I came up with a way to use a single, much smaller lookup table. I'm not going to share my specific code table method. What I will share is the basic format of the loop to create the codes from a DHT marker:
int iCurrentCode; // the current Huffman code
int iLength; // the code length in bits that you're working on
int i;
int iCount; // the number of codes defined for this length
int iSymbol; // JPEG symbol defined for each Huffman code
unsigned char *pData; // pointer to the data in the DHT marker
iCurrentCode = 0; // start with a Huffman code of 0
for (iLength = 1; iLength <= 16; iLength++)
{
iCount = *pData++; // get number of symbols for this bit length
for (i=0; i<iCount; i++) // read each of the codes for this bit length
{
iSymbol = *pData++; // get the JPEG symbol value (e.g. RRRR/SSSS value)
// It's up to you to create a lookup table from the code and its value
iCurrentCode++; // the Huffman bit pattern just increments for each code value
} // for each code defined at this bit length
iCurrentCode <<= 1; // shift the code left 1 bit to advance to the next bit length
} // for each bit length

Related

How is a Huffman compression decoded?

I'm struggling to understand how to decode, say, a text file that has been compressed using Huffman's method. Let's say I'm reading a text file, I get a list of all the characters and the frequency in which they occur, I create a Huffman tree and all the characters have a specific code assigned do them. Say,
a: 110
b: 11
c: 010
etc.
When I want to decompress this text file and print/read its contents, how do I do that? How do I know if the file reads "abc" or "bac"?
A small solution I made up was after the Huffman tree has been created, I read the file all over again and create an array to insert every character code as I read it.
Say, a while loop where I read a character until I've reached EOF.
character = a; insert 110 into array. Character = b; insert 11 into array until we are left with 11011010.
But I feel like there should be a better way.
EDIT: The codes for a,b, and c are random, not actual Huffman codes. I put in random ones as it's irrelevant for the question, I'm only interested in how it would be decoded with or without a real life example. But here's an example of Huffman code for "Hello World."
l: 11
o: 001
H: 100
e: 0101
spacebar: 0000
w: 0001
r: 101
d: 011
.: 0100
A Huffman code is a prefix code, which means that no code can be a prefix of any other code. Your example of a Huffman code is most definitely not a Huffman code. There you have 11 (c), which is a prefix of 110 (b). That cannot be the result of a correct implementation of Huffman's algorithm.
Update for question update:
You are incorrect. The codes are extremely relevant for the question. The examples you gave cannot be unambiguously decoded.
Second update of question:
It is still not clear what you're asking, but here is an answer to the question: "How do I decode a stream of bits that are a Huffman-coded sequence of symbols?"
Here is the tree for the example prefix code:
You see that if you follow any sequence of branches to a symbol, the branches you followed are the bits in that code. That is exactly how you decode the incoming stream of bits.
Start at the node at the top of the tree.
Read one bit from the stream.
Follow the branch for that value, left for 0, right for 1.
If you arrive at another node, go to step 2.
Otherwise, emit the symbol in the leaf, and go to step 1.

Calculating bytes from sextets in C/C++ for steganography

This should be a simple question. I'm well advanced in developing steganographic code in C, which required manipulating the least significant bit in each R, G, and B channel of a 24 bit (3 byte) pixel of an image. A pair of pixels has 6 bits (which I call a sextet for want of a better word) that can be used, and I have developed code that converts a buffer in bytes to a buffer in sextets, where each byte in the latter buffer only uses the 6 lower order bits, with the upper 2 bits being discarded when changing pixels. This all works correctly, and I can encode text in any language in an image.
In doing this the application calculates the number of sextets that can be embedded in an image. However, it is useful to know how many bytes can be processed, as both the input is originally in bytes, and the output is recovered in bytes. As 4 sextets correspond to 3 bytes, I'm using the statement:
maxNumBytes = (3 * maxNumSexts - 2 * (maxNumSexts % 4)) / 4;
which converts and rounds down to a multiple of 3, where maxNumSexts and maxNumBytes are respectively the maximum number of sextets and bytes that can be hidden in an RGB image, and these two variable have the type int32_t. This formula works but is rather cumbersome, and I was wondering if someone could find something simpler that works correctly.
Incidentally, although the code is in C, this applies exactly in C++, hence that has been included as a tag, and some C++ code may be added later.
Many thanks for any suggestions.
I want all values between 24 and 27 to evaluate to 18, and likewise values between 28 and 31 to evaluate to 21, etc.
Since you want only multiples of 3, the last operation should be the multiplication by 3. And the "steps" on the input value is by 4 increments. So you can use this formula in integer arithmetic:
maxNumBytes = 3 * (maxNumSexts / 4);
Note 1: However, the actual number of bytes encoded by 27 sextets is 20, because 27 sextets contain 81 bits.
Note 2: Yes, a half byte is called a "nibble", from the verb. The form "nybble" is known, but rarely used.

How to save a Huffman table in a file In a way that use the least storage?

It's my first question in stack overflow. it's long but I have explained it in detail and I think it's understandable.
I'm writing huffman code by c++ and saved characters and codes in a table like this:
Text: AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE
Table: (Made by huffman tree)
Table
Now, I want to save this table to a file in the best way.
I can't save like this: A1B001C010D001E000
When it change to bits: 01000001101000010001010000110100100010000101000101000
Because I can't decode this.
If I save table in normal way, every character use 8 bit for saving it's code.
While my characters have 1bit or 3bit code. (In this case.)
this way use much storage.
My idea is add a separator character and set a code for it.
If we add a separator character and make huffman tree and write codes, have a table like this.
table2
Now, we can write codes in this way.
A0SepB110SepC100SepD1111sepE1110sep.
binary= 0100000101010100001011010101000011100101010001001111101010001011110101
I decode it in this way:
sep = 101.
Read 8 bit : 01000001 -> it's A.
rest = 01010100001011010101000011100101010001001111101010001011110101.
Read 1 bit : 0 (unlike sep1)
Read 1 bit : 1 (like sep1), Read 1 bit : 0 (like sep2), Read 1 bit : 1 (like sep3(end))
Sep was found so A = everything was befor sep = 0;
rest = 0100001011010101000011100101010001001111101010001011110101.
Read 8 bit : 01000010 -> it's B.
rest = 11010101000011100101010001001111101010001011110101.
Read 1 bit : 1 (like sep1)- Read 1 bit : 1 (unlike sep2)
Read 1 bit : 0 (unlike sep1)
Read 1 bit : 1 (like sep1) - Read 1 bit : 0 (like sep2) - Read 1 bit :1 (like sep3(end))
Sep was found so B = everything was befor sep = 110;
And so on ...
This way still use a little storage for separator ( number of characters * separator size )
My question: Is there a way to save first table in a file and use less storage?
For example like this: A1B001C010D001E000.
Don't save the table with the codes. Just save the lengths. See Canonical Huffman Code.
You can store the lengths of the codes (as Mark said) as a 256 byte header at the start of your compressed data. Each byte stores the length of the code, and because you're working with bytes with 256 possible values, and the huffman tree can only be of a certain depth (number of possible values - 1) you only need 8 bits to store the codes.
The first byte would store the code length of the value 0x00, the second byte stores the code length of 0x01, and so on and so forth.
However, if compressing English text, there is a better way to store the table.
Store the shape of the tree, 0s for nodes and 1s for leaves. Then, after you store the nodes and the leaves, you store the values of the leaves.
The tree for AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE looks like this:
*
/ \
* A
/ \
* *
/ \ / \
E D C B
So you would store the shape of the tree as such:
000110111EDCBA
The reason why storing the huffman codes in this way is better for when you are compressing English text is that storing the shape of the tree costs 10n - 1 bits (where n is the number of unique characters in the data you are trying to compress) while storing the code lengths costs a flat 2048 bits. Therefore, for numbers of unique characters less than 205, storing the shape of the tree is more efficient, and because the average English string of text isn't going to have all that many of the possible 256 possible ASCII characters, you're usually better off storing the tree shape.
If you aren't just compressing text, and you're compressing more general data where there is a high likelihood that the number of unique characters could be greater than or equal to 205, you should probably use the code length storing format, or include 1 bit at the start of your header that says whether there's going to be a tree or a bunch of code lengths, and then write your decoder to decode either one depending on what that bit is set to.

Find a repeating symmetric bit pattern in a small stream of 128 bits

How can I quickly scan groups of 128 bits that are exact equal repeating binary patterns, such 010101... Or 0011001100...?
I have a number of 128 bit blocks, and wish to see if they match the patterns where the number of 1s is equal to number of 0s, eg 010101.... Or 00110011... Or 0000111100001111... But NOT 001001001...
The problem is that patterns may not start on their boundary, so the pattern 00110011.. May begin as 0110011..., and will end 1 bit shifted also (note the 128 bits are not circular, so start doesn't join to the end)
The 010101... Case is easy, it is simply 0xAAAA... Or 0x5555.... However as the patterns get longer, the permutations get longer. Currently I use repeating shifting values such as outlined in this question Fastest way to scan for bit pattern in a stream of bits but something quicker would be nice, as I'm spending 70% of all CPU in this routine. Other posters have solutions for general cases but I am hoping the symmetric nature of my pattern might lead to something more optimal.
If it helps, I am only interested in patterns up to 63 bits long, and most interested in the power of 2 patterns (0101... 00110011... 0000111100001111... Etc) while patterns such as 5 ones/5 zeros are present, these non power 2 sequences are less than 0.1%, so can be ignored if it helps the common cases go quicker.
Other constraints for a perfect solution would be small number of assembler instructions, no wildly random memory access (ie, large rainbow tables not ideal).
Edit. More precise pattern details.
I am mostly interested in the patterns of 0011 and 0000,1111 and 0000,0000,1111,1111 and 16zeros/ones and 32 zeros/ones (commas for readabily only) where each pattern repeats continuously within the 128 bits. Patterns that are not 2,4,8,16,32 bits long for the repeating portion are not as interesting and can be ignored. ( eg 000111... )
The complexity for scanning is that the pattern may start at any position, not just on the 01 or 10 transition. So for example, all of the following would match the 4 bit repeating pattern of 00001111... (commas every 4th bit for readability) (ellipses means repeats identically)
0000,1111.... Or 0001,1110... Or 0011,1100... Or 0111,1000... Or 1111,0000... Or 1110,0001... Or 1100,0011... Or 1000,0111
Within the 128bits, the same pattern needs to repeat, two different patterns being present is not of interest. Eg this is NOT a valid pattern. 0000,1111,0011,0011... As we have changed from 4 bits repeating to 2 bits repeating.
I have already verified the number of 1s is 64, which is true for all power 2 patterns, and now need to identify how many bits make up the repeating pattern (2,4,8,16,32) and how much the pattern is shifted. Eg pattern 0000,1111 is a 4 bit pattern, shifted 0. While 0111,1000... Is a 4 bit pattern shifted 3.
Lets start with the case where the patterns do start on their boundary. You can check the first bit and use it to determine your state. Then start looping through your block, check the first bit, increment a count, left shift and repeat until you find that you've gotten the opposite bit. You can now use this initial length as the bitset length. Reset the count to 1 then count the next set of opposite bits. When you switch, check the length against the initial length and error out if they're not equal. Here's a quick function - it seems to work as expected for chars, and it shouldn't be too hard to expand it to deal with blocks of 32 bytes.
unsigned char myblock = 0x33;
unsigned char mask = 0x80, prod = 0x00;
int setlen = 0, count = 0, ones=0;
prod = myblock & mask;
if(prod == 0x80)
ones = 1;
for(int i=0;i<8;i++){
prod = myblock & mask;
myblock = myblock << 1;
if((prod == 0x80 && ones) || (prod == 0x00 && !ones)){
count++;
}else{
if(setlen == 0) setlen = count;
if(count != setlen){
printf("Bad block\n");
return -1;
}
count = 1;
ones = ( ones == 1 ) ? 0 : 1;
}
}
printf("Good block of with % repeating bits\n",setlen);
return setlen;
Now to deal with blocks where there's an offset, I'd suggest counting the number of bits until the first 'flip'. Store this number, then run the above routine until you hit the last segment which should have length unequal to the rest of the sets. Add the initial bits to the last segment's length, and then you should be able to compare it with the size of the rest of the sets correctly.
This code is pretty small, and bit shifting through a buffer shouldn't require too much work on the CPU's part. I'd be interested to see how this solution ends up performing against your current one.
The Generic solution for this kind of problems is to create a good hashing function for the patterns and store each pattern in a hash map. Once you have the hash map created for the patterns then try to lookup in the table using the input stream. I don't have code yet but let me know if you are struck in code.. Please post it and I can work on it..
I've thought about making a state machine, so every next byte (out of 16) would advance its state and after some 16 state transitions you'd have the pattern identified. But that doesn't look very promising. Data structures and logic look more complex.
Instead, why not precompute all those 126 patterns (from 01 to 32 zeroes + 32 ones), sort them and perform binary search? That would give you at most 7 iterations of binary search. And you don't need to store all 16 bytes of every pattern as its halves are identical. That gives you 126*16/2=1008 bytes for the array of patterns. You also need something like 2 bytes per pattern to store the length of zero (one) runs and the shift relative to whatever pattern you consider unshifted. That's a total of 126*(16/2+2)=1260 bytes of data (should be gentle on the data cache) and very simple and tiny binary search algorithm. Basically, its just an improvement over the answer that you mentioned in the question.
You might want to try switching to linear search after 4-5 iterations of binary search. That may give a small boost to the overall algorithm.
Ultimately, the winner is determined by testing/profiling. And that's what you should do, get a few implementations and compare them on the real data in the real system.
The restriction of the pattern repeating it self all over the 128-stream makes the number of combinations limited and also the sequence will have properties making it easy to check:
One needs to iteratively check if high and low parts are same; if they are opposites, check if that particular length contains consecutive ones.
8-bit repeat at offset 3: 00011111 11100000 00011111 11100000
==> high and low 16 bits are the same
00011111 11100000 ==> high and low parts are inverted.
Not same, nor inverted means rejection of pattern.
At that point one needs to check if there's a sequence of ones -- add '1' to the left side and check if it's power of two: n==(n & -n) is the textbook check for that.

Huffman code with lookup table

I have read an article on Internet and know that the natural way of decoding by traversing from root but I want to do it faster with a lookup table.
After reading, I still cannot get the points.
ex:
input:"abcdaabaaabaaa"
code data
0 a
10 b
110 c
111 d
The article says that due to variable length, it determine the length by taking a bit of string of
max code length and use it as index.
output:"010110111001000010000"
Index Index(binary) Code Bits required
0 000 a 1
1 001 a 1
2 010 a 1
3 011 a 1
4 100 b 2
5 101 b 2
6 110 c 3
7 111 d 3
My questions are:
What does it means due to variable length, it determine the length by taking a bit of string of
max code length? How to determine the length?
How to generate the lookup table and how to use it? What is the algorithm behind?
For your example, the maximum code length is 3 bits. So you take the first 3 bits from your stream (010) and use that to index the table. This gives code, 'a' and bits = 1. You consume 1 bit from your input stream, output the code, and carry on. On the second go around you will get (101), which indexes as 'b' and 2 bits, etc.
To build the table, make it as large as 1 << max_code_length, and fill in details as if you are decoding the index as a huffman code. If you look at your example all the indices which begin '0' are a, indices beginning '10' are b, and so on.