Huffman storing code in bits - c++

I have build the huffman tree. But I have no idea to store the code to bits due to I don't know how
to handle the variable length.
I want to create a table that store the huffman code in bits for print the encoded result.
I cannot use the STL containter like bitset.
I have try like that
void traverse( string code = "")const
{
if( frequency == 0 ) return;
if ( left ) {
left->traverse( code + '0' );
right->traverse( code + '1' );
}
else {//leaf node
huffmanTable[ch] = code;
}
}
Can you give me some algorithm to handle it?
I want to store the '0' use 1 bit and "1" use 1 bit.
Thx in advance.

You'll need a buffer, a variable to track the size of the buffer in bytes, and a variable to track the number of valid bits in the buffer.
To store a bit:
Check if adding a bit will increase the number of bytes stored. If not, skip to step 4.
Is there room in the buffer to store an additional byte? If so, skip to step 4.
Reallocate a storage buffer a few bytes larger. Copy the existing data. Increase the variable holding the size of the buffer.
Compute the byte position and bit position at which the next bit will be stored. Set or clear that bit as appropriate.
Increment the variable holding the number of bits stored.

You can use a fixed size structure to store the table and just bits to store encoded input:
struct TableEntry {
uint8_t size;
uint8_t code;
};
TableEntry huffmanTable[256];
void traverse(uint8_t size; uint8_t code) const {
if( frequency == 0 ) return;
if ( left ) {
left->traverse(size+1, code << 1 );
right->traverse(size+1, (code << 1) | 1 );
}
else {//leaf node
huffmanTable[ch].code = code;
huffmanTable[ch].size = size;
}
}
For encoding, you can use the algorithm posted by David.

Basically I'd use one of two different approaches here, based on the maximum key length/depth of the tree:
If you've got a fixed length and it's shorter than your available integer data types (like long int), you can use the approach shown by perreal.
If you don't know the maximum depth and think you might be running out of space, I'd use std::vector<bool> as the code value. This is a special implementation of the vector using a single bit per value (essentially David's approach).

Related

Map arbitrary set of symbols to consecutive integers

Given a set of byte-representable symbols (e.g. characters, short strings, etc), is there a way to find a mapping from that set to a set of consecutive natural numbers that includes 0? For example, suppose there is the following unordered set of characters (not necessarily in any particular character set).
'a', '(', '๐ŸŒ'
Is there a way to find a "hash" function of sorts that would map each symbol (e.g. by means of its byte representation) uniquely to one of the integers 0, 1, and 2, in any order? For example, 'a'=0, '('=1, '๐ŸŒ'=2 is just as valid as 'a'=2, '('=0, '๐ŸŒ'=1.
Why?
Because I am developing something for a memory-constrained (think on the order of kiB) embedded target that has a lot of fixed reverse-lookup tables, so something like std::unordered_map would be out of the question. The ETL equivalent etl::unordered_map would be getting there, but there's quite a bit of size overhead, and collisions can happen, so lookup timings could differ. A sparse lookup table would work, where the byte representation of the symbol would be the index, but that would be a lot of wasted space, and there are many different tables.
There's also the chance that the "hash" function may end up costing more than the above alternatives, but my curiosity is a strong driving force. Also, although both C and C++ are tagged, this question is specific to neither of them. I just happen to be using C/C++.
The normal way to do things like this, for example when coding a font for a custom display, is to map everything to a sorted, read-only look-up table array with indices 0 to 127 or 0 to 255. Where symbols corresponding to the old ASCII table are mapped to their respective index. And other things like your banana symbol could be mapped beyond index 127.
So when you use FONT [97] or FONT ['a'], you end up with the symbol corresponding to 'a'. That way you can translate from ASCII strings to your custom table, or from your source editor font to the custom table.
Using any other data type such as a hash table sounds like muddy program design to me. Embedded systems should by their nature be deterministic, so overly complex data structures don't make sense most of the time. If you for some reason unknown must have the data unordered, then you should describe the reason why in detail, or otherwise you are surely asking an "XY question".
Yes, there is such a map. Just put all of them in an array of strings... then sort it, and make a function that searchs for the word in the array and returns the index in the array.
static char *strings[] = {
"word1", "word2", "hello", "world", NULL, /* to end the array */
};
int word_number(const char *word)
{
for(int i = 0; strings[i] != NULL; i++) {
if (strcmp(strings[i], word) == 0)
return i;
}
return -1; /* not found */
}
The cost of this (in space terms) is very low (considering that the compiler assigning pointers can optimice string allocation based on common suffixes (making a string overlap others if it is a common suffix of them) and if you give the compiler an already sorted array of literals, you can use bsearch() algorithm (which is O(log(n)) of the number of elements in the table)
static char *strings[] = { /* this time sorted */
"Hello",
"rella", /* this and the next are merged into positions on the same literal below
* This can be controlled with a compiler option. */
"umbrella",
"world"
};
const int strings_len = 4;
int string_cmp(const void *_s1, const void *_s2)
{
const char *s1 = _s1, *s2 = _s2;
return strcmp(s1, s2);
}
int word_number(const char *word)
{
char *result = bsearch(strings, 4, sizeof *strings, string_cmp);
return result ? result - strings : -1;
}
If you want a function that gives you a number for any string, and maps biyectively that string with that number... It's even easier. First start with zero. For each byte in the string, just multiply your number by 256 (the number of byte values) and add the next byte to the result, then return back that result once you have done this operation with every char in the string. You will get a different number for each possible string, covering all possible strings and all possible numbers. But I think this is not what you want.
super_long_integer char2number(const unsigned char *s)
{
super_long_integer result = 0;
int c;
while ((c = *s++) != 0) {
result *= 256;
result += c;
}
return result;
}
But that integer must be capable of supporting numbers in the range [0...256^(maximum lenght of accepted string)] which is a very large number.

How can I know if the memory address I'm reading from is empty or not in C++?

So on an embedded system I'm reading and writing some integers in to the flash memory. I can read it with this function:
read(uint32_t *buffer, uint32_t num_words){
uint32_t startAddress = FLASH_SECTOR_7;
for(uint32_t i = 0; i < num_words; i++){
buffer[i] = *(uint32_t *)(startAddress + (i*4));
}
}
then
uint32_t buf[10];
read(buf,10);
How can I know if buff[5] is empty (has anything on it) or not?
Right now on the items that are empty I get something like this 165 'ยฅ' or this 255 'รฟ'
Is there a way to find that out?
You need first to define "empty", since you are using uint32_t. A good ide is to use value 0xFFFFFFFF (4294967295 decimal) to be the empty value, but you need to be sure that this value isn't used to other things. Then you can test if if ( buf [ 5 ] == 0xFFFFFFFF ).
But if your using the whole range of uint32_t, then there is no way to detect if it's empty.
Another way is to use structures, and define a empty bit.
struct uint31_t
{
uint32_t empty : 0x01; // If set, then uint31_t.value is empty
uint32_t value : 0x1F;
};
Then you can check if the empty bit is set, but the negative part is that you lose a whole bit.
If your array is an array of pointers you can check to see by comparing it to {nullptr}, otherwise, you cannot unless you initialize all the initial indexes to the same value, and then check if the value is still the same.

How do I represent binary numbers in C++ (used for Huffman encoder)?

I am writing my own Huffman encoder, and so far I have created the Huffman tree by using a minHeap to pop off the two lowest frequency nodes and make a node that links to them and then pushing the new node back one (lather, rinse, repeat until only one node).
So now I have created the tree, but I need to use this tree to assign codes to each character. My problem is I don't know how to store the binary representation of a number in C++. I remember reading that unsigned char is the standard for a byte, but I am unsure.
I know I have to recusively traverse the tree and whenever I hit a leaf node I must assign the corresponding character whatever code is current representing the path.
Here is what I have so far:
void traverseFullTree(huffmanNode* root, unsigned char curCode, unsigned char &codeBook){
if(root->leftChild == 0 && root->rightChild == 0){ //you are at a leaf node, assign curCode to root's character
codeBook[(int)root->character] = curCode;
}else{ //root has children, recurse into them with the currentCodes updated for right and left branch
traverseFullTree(root->leftChild, **CURRENT CODE SHIFTED WITH A 0**, codeBook );
traverseFullTree(root->rightChild, **CURRENT CODE SHIFTED WITH A 1**, codeBook);
}
return 0;
}
CodeBook is my array that has a place for the codes of up to 256 characters (for each possible character in ASCII), but I am only going to actually assign codes to values that appear in the tree.
I am not sure if this is the corrent way to traverse my Huffman tree, but this is what immediately seems to work (though I haven't tested it). Also how do I call the traverse function of the root of the whole tree with no zeros OR ones (the very top of the tree)?
Should I be using a string instead and appending to the string either a zero or a 1?
Since computers are binary ... ALL numbers in C/C++ are already in binary format.
int a = 10;
The variable a is binary number.
What you want to look at is bit manipulation, operators such as & | << >>.
With the Huffman encoding, you would pack the data down into an array of bytes.
It's been a long time since I've written C, so this is an "off-the-cuff" pseudo-code...
Totally untested -- but should give you the right idea.
char buffer[1000]; // This is the buffer we are writing to -- calc the size out ahead of time or build it dynamically as go with malloc/ remalloc.
void set_bit(bit_position) {
int byte = bit_position / 8;
int bit = bit_position % 8;
// From http://stackoverflow.com/questions/47981/how-do-you-set-clear-and-toggle-a-single-bit-in-c
byte |= 1 << bit;
}
void clear_bit(bit_position) {
int byte = bit_position / 8;
int bit = bit_position % 8;
// From http://stackoverflow.com/questions/47981/how-do-you-set-clear-and-toggle-a-single-bit-in-c
bite &= ~(1 << bit);
}
// and in your loop, you'd just call these functions to set the bit number.
set_bit(0);
clear_bit(1);
Since the curCode has only zero and one as its value, BitSet might suit your need. It is convenient and memory-saving. Reference this: http://www.sgi.com/tech/stl/bitset.html
Only a little change to your code:
void traverseFullTree(huffmanNode* root, unsigned char curCode, BitSet<N> &codeBook){
if(root->leftChild == 0 && root->rightChild == 0){ //you are at a leaf node, assign curCode to root's character
codeBook[(int)root->character] = curCode;
}else{ //root has children, recurse into them with the currentCodes updated for right and left branch
traverseFullTree(root->leftChild, **CURRENT CODE SHIFTED WITH A 0**, codeBook );
traverseFullTree(root->rightChild, **CURRENT CODE SHIFTED WITH A 1**, codeBook);
}
return 0;
}
how to store the binary representation of a number in C++
You can simply use bitsets
#include <iostream>
#include <bitset>
int main() {
int a = 42;
std::bitset<(sizeof(int) * 8)> bs(a);
std::cout << bs.to_string() << "\n";
std::cout << bs.to_ulong() << "\n";
return (0);
}
as you can see they also provide methods for conversions to other types, and the handy [] operator.
Please don't use a string.
You can represent the codebook as two arrays of integers, one with the bit-lengths of the codes, one with the codes themselves. There is one issue with that: what if a code is longer than an integer? The solution is to just not make that happen. Having a short-ish maximum codelength (say 15) is a trick used in most practical uses of Huffman coding, for various reasons.
I recommend using canonical Huffman codes, and that slightly simplifies your tree traversal: you'd only need the lengths, so you don't have to keep track of the current code. With canonical Huffman codes, you can generate the codes easily from the lengths.
If you are using canonical codes, you can let the codes be wider than integers, because the high bits would be zero anyway. However, it is still a good idea to limit the lengths. Having a short maximum length (well not too short, that would limit compression, but say around 16) enables you to use the simplest table-based decoding method, a simple single-level table.
Limiting code lengths to 25 or less also slightly simplifies encoding, it lets you use a 32bit integer as a "buffer" and empty it byte by byte, without any special handling of the case where the buffer holds fewer than 8 bits but encoding the current symbol would overflow it (because that case is entirely avoided - in the worst case there would be 7 bits in the buffer and you try to encode a 25-bit symbol, which works just fine).
Something like this (not tested in any way)
uint32_t buffer = 0;
int bufbits = 0;
for (int i = 0; i < symbolCount; i++)
{
int s = symbols[i];
buffer <<= lengths[s]; // make room for the bits
bufbits += lengths[s]; // buffer got longer
buffer |= values[s]; // put in the bits corresponding to the symbol
while (bufbits >= 8) // as long as there is at least a byte in the buffer
{
bufbits -= 8; // forget it's there
writeByte((buffer >> bufbits) & 0xFF); // and save it
}
}

How to return a byte array of unknown size from method

I have a class that parses some incoming serial data. After the parsing a method should return a byte array with some of the parsed data. The incoming data is of unknown length so my return array will always be different.
So far my method allocates an array bigger than what I need to return and fills it up with my data bytes and I keep an index so that I know how much data I put in the byte array. My problem is that I don't know how to return this from an instance method.
void HEXParser::getParsedData()
{
byte data[HEX_PARSER_MAX_DATA_SIZE];
int dataIndex = 0;
// fetch data, do stuff
// etc, etc...
data[dataIndex] = incomingByte;
_dataIndex++;
// At the very end of the method I know that all the bytes I need to return
// are stored in data, and the data size is dataIndex - 1
}
On other languages this is trivial to do but I'm not very proficient in C++ and I'm completely stuck.
Thanks!
You are working on a microcontroller with just a little bit of RAM. You need to carefully evaluate if "unknown length" also implies unbounded length. You cannot deal with unbounded length. Your best approach for reliable operation is to use fixed buffers setup for the maximum size.
A common pattern for this type of action is to pass the buffer to the function, and return what has been used. Your function would then look much like many of the C character string functions:
const size_t HEX_PARSER_MAX_DATA_SIZE = 20;
byte data[HEX_PARSER_MAX_DATA_SIZE];
n = oHexP.getParsedData(data, HEX_PARSER_MAX_DATA_SIZE);
int HEXParser::getParsedData(byte* data, size_t sizeData)
{
int dataIndex = 0;
// fetch data, do stuff
// etc, etc...
data[dataIndex] = incomingByte;
dataIndex++;
if (dataIndex >= sizeData) {
// stop
}
// At the very end of the method I know that all the bytes I need to return
// are stored in data, and the data size is dataIndex - 1
return dataIndex;
}

What's the proper the structure for storing a big array which will be frequently updated

I'm seeking for a proper structure for a big array which will be frequently updated. Thanks for your help!
Here's the background:
I want to draw a continuous curve to represent a sound wave in a certain time period. For the accuracy, the array length will be nearly 44100(the CD format)๏ผŽAnd I just want to represent the last second wave, so the array will be updated very frequently - for every 1/44100 sec, the first element will be eliminated and a new last element will be inserted to the array.
For avoiding the frequent "malloc/realloc/new", what my current solution is using an Circular Queue which has a fixed size as 44100, but somehow I don't feel this is most proper solution, if I want to dynamically resize the queue, it will be a heavy cost.
This kind of situation should be quite often, I think there maybe some good patent for this issue.
Thanks guys!
I assume you're always having a fixed number of items in the array. As such I'd just use a ring buffer in any case (not sure whether that's what you refer to as a "Circular Queue", but I assume you'd use a dynamic length? If so, why? Is there no specific absolute (and practical) maximum?), i.e. a static array with a variable entry point as its start:
const unsigned int buffer_length = 500000;
float *buffer = new float[buffer_length];
unsigned int buffer_write = 0;
// append a value...
buffer[buffer_write] = my_value;
// ...and move the write/end position:
buffer_write = (buffer_write + 1) % buffer_length;
To output/use the values, you can use the following formula for index of the first entry to read:
unsigned int start_position = (buffer_length + buffer_write - length_to_read) % buffer_length;
To iterate, you just add position after position, again using modulo to jump back to the beginning of the array.