Poor mans Huffman Compression - compression

Im trying to better understand how the huffman decoder works. Ive got a code table but im having a hard time trying to understand how the decoder would work because of ambiguity in the binary string.
(im learning this in prep for my final year at uni)
my table:
Data Hcode
0, 0
1, 1
2, 10
3, 11
17, 100
18, 101
19, 110
29, 111
If i have a huffman code string like 010011 i can return many different combinations of data so how can i discriminate?
i understand the huffman logic in BST representation and you follow a path to a given leaf which the path resembles the code for that given value between (0-255(ascii)) but i still dont know how you can discriminate between returning data: 0,1,0 or data: 0,17
do i really have to enforce 2 bit codes on data 0 and 1? (00 and 01)
i hope ive explained the best i can XD
If your wondering how I generated the table - your gonna kill me because i didnt use tree logic to generate it. Althought i sorted the data (random bytes) on frequency - i generated the Hcodes by converting the element position number into binary (hency why i called this post Poor Mans Huffman).
Many Thanks for any advice.

The code table is wrong. Huffman odes are supposed to be prefix free. This is neccessary in order to decode them afterwards without ambiguities.
If you would use a binary tree for creating the codes, this would automatically ensure the "prefix freeness". See: http://en.wikipedia.org/wiki/Huffman_coding
And now, I am going to kill you ... ;)

Not only is the code table wrong, the lengths of the codes are also wrong. If you have two one-bit codes, you have already used up all of the code space, and can have no other codes. What you have shown is not only not a Huffman code and not a prefix code -- it is in fact not a code at all.

Related

How to efficiently decompress huffman coded file

I've found a lot of questions asking this but some of the explanations were very difficult to understand and I couldn't quite grasp the concept of how to efficiently decompress the file.
I have found these related questions:
Huffman code with lookup table
How to decode huffman code quickly?
But I fail to understand the explanation. I know how to encode and decode a huffman tree regularly. Right now in my compression program I can write any of the following information to file
symbol
huffman code (unsigned long)
huffman code length
What I plan to do is get a text file, separate it into small text files and compress each individually and then decompress that file by sending all the small compressed files with their respective lookup table (don't know how to do this part) to a Nvidia GPU to try to decompress the file in parallel using some sort of look up table.
I have 3 questions:
What information should I write to file in the header to construct the look up table?
How do I recreate this table from file?
How do I use it to decode the huffman encoded file quickly?
Don't bother writing it yourself, unless this is a didactic exercise. Use zlib, lz4, or any of several other free compression/decompression libraries out there that are far better tested than anything you'll be able to do.
You are only talking about Huffman coding, indicating that you would only get a small portion of the available compression. Most of the compression in the libraries mentioned come from matching strings. Look up "LZ77".
As for efficient Huffman decoding, you can look at how zlib's inflate does it. It creates a lookup table for the most-significant nine bits of the code. Each entry in the table has either a symbol and numbers of bits for that code (less than or equal to nine), or if the provided nine bits is a prefix of a longer code, that entry has a pointer to another table to resolve the rest of the code and the number of bits needed for that secondary table. (There are several of these secondary tables.) There are multiple entries for the same symbol if the code length is less than nine. In fact, 29-n multiple entries for an n-bit code.
So to decode you get nine bits from the input and get the entry from the table. If it is a symbol, then you remove the number of bits indicated for the code from your stream and emit the symbol. If it is a pointer to a secondary table, then you remove nine bits from the stream, get the number of bits indicated by the table, and look it up there. Now you will definitely get a symbol to emit, and the number of remaining bits to remove from the stream.

Artificial Neural Network with large inputs & outputs

I've been following Dave Miller's ANN C++ Tutorial, and I've been having some problems getting it to function as expected.
You can view the code I'm working with here. It's an XCode project, but includes the main.cpp and data set file.
Previously, this program would only gives outputs between -1 and 1, I'm presuming due to the use of the tanh function. I've manipulated the data inputs so I can input my data that is much larger and have valid outputs. I've simply done this by multiplying the input values by 0.0001, and multiplying the output values by 10000.
The training data I'm using is the included CSV file. The last column is the expected output, the rest are inputs. Am I using the wrong mathematical function for these data?
Would you say that this is actually learning? This whole thing has stressed me out so much, I understand the theory behind ANN's but just can't implement from scratch for myself.
The net recent average error definitely gets smaller and smaller, which to me would say it is learning.
I'm sorry if I haven't explained myself very well, I'm very new to ANN's and this whole thing is very confusing to me. My university lecturers are useless when it comes to the practical side, they only teach us the theory of it.
I've been playing around with the eta and alpha values, along with the number of hidden layers.
You explained yourself quite well, if the net recent average is getting lower and lower it probably means that the network is actually learning, but here is my suggestion about how to be completely sure.
Take you CSV file and split it into 2 files one should be about 10% of the all data and the other all the remaining.
You start with an untrained network and you run your 10% file trough the net and for each line you save the difference between actual output and expected result.
Then you train the network only with the 90% of the CSV file you have and finally you re run trough the NET the first 10% file again and you compare the differences you had on the first run with the the latest ones.
You should find out that the new results are much closer to the expected values than the first time, and this would be the final proof that your network is learning.
Does this make any sense ? if not please send share some code or send me a link to the exercise you are running and I will try to explain it in code.

Storing table of codes in a compressed file after Huffman compression and building tree for decompression from this table

I was writing a program of Huffman compression using C++ but I faced with a problem of compressed file's structure. It needs to store some structure in my new file that can help me to decode this file. I decided to write a table of codes in the beginning of this file and then build a tree from this table to decode the next content, but I do not know in which way it is better to store the table (I mean I do not know structure of the table, I know how to write things in binary mode) and how to build the tree from this table. Sorry for my English. Thank you in advance.
You do not need to transmit the probabilities or the tree. All the decoder needs is the number of bits assigned to each symbol, and a canonical way to assign the bit values to each symbol that is agreed to by both the encoder and decoder. See Canonical Huffman Code.
You could try writing a header in the compressed file with the sequence of characters according to their probability of appearing in the text. Or writing the letters followed by their probabilities. With that, you use the same process for building the tree for compressing and decompressing. As for how to build the tree itself, I suppose you'll have to do a little research and come back you if have problems.

Not getting expected results from complex view

This is a somewhat involved question as the data I am working with is a little large.
I have the following document structure: https://gist.github.com/gaigepr/5b28a7c67ced0cd71e4e
and the following map function: https://gist.github.com/gaigepr/a721bcc8ef6f681f3807
A little description, this function goes through the example document to collect a list of all combinations of characters from 1 to 5 and supplies them with a 1 or 0 to indicate a win or a loss for that particular combo of characters. This is accomplished by getting the powerset of the team and ignoring the empty set when emitting the array key and integer to indicate a win or loss.
The problem I am having is with reducing the data. My goal is to get the win rate of a particular group of characters in the game this data is from. the view takes a key formatted as such: [1] and should output the win rate and games played by that pair of characters.
so my reduce function should be something like this:
However when I do this, I do not actually get all the games played by that pair in the reduction. in my test database, I have 96 games played by the above pair [1, 18] but when I run map and reduce on the with that key, I get that there were only 2 games played and null for the win rate.
A note: This seems to only happen inconsistently. With my data, when I query with the key [1, 18] I get accurate results.
I am a little bit at a loss for what to do to debug this and would appreciate some help. I am happy to add more details, gists, even pictures of the futon output if that would be helpful.
I do not have a lot of reason for this yet, or confirmation, but it seems that the data passed to the reduce function is not formatted how I expect it to be but I am not sure why that is.

Simple Curve Fitting Implimentation in C++ (SVD Least Sqares Fit or similar)

I have been scouring the internet for quite some time now, trying to find a simple, intuitive, and fast way to approximate a 2nd degree polynomial using 5 data points.
I am using VC++ 2008.
I have come across many libraries, such as cminipack, cmpfit, lmfit, etc... but none of them seem very intuitive and I have had a hard time implementing the code.
Ultimately I have a set of discrete values put in a 1D array, and I am trying to find the 'virtual max point' by curve fitting the data and then finding the max point of that data at a non-integer value (where an integer value would be the highest accuracy just looking at the array).
Anyway, if someone has done something similar to this, and can point me to the package they used, and maybe a simple implementation of the package, that would be great!
I am happy to provide some test data and graphs to show you what kind of stuff I'm working with, but I feel my request is pretty straightforward. Thank you so much.
EDIT: Here is the code I wrote which works!
http://pastebin.com/tUvKmGPn
change size to change how many inputs are used
0 0
1 1
2 4
4 16
7 49
a: 1 b: 0 c: 0
Press any key to continue . . .
Thanks for the help!
Assuming that you want to fit a standard parabola of the form
y = ax^2 + bx + c
to your 5 data points, then all you will need is to solve a 3 x 3 matrix equation. Take a look at this example http://www.personal.psu.edu/jhm/f90/lectures/lsq2.html - it works through the same problem you seem to be describing (only using more data points). If you have a basic grasp of calculus and are able to invert a 3x3 matrix (or something nicer numerically - which I am guessing you do given you refer specifically to SVD in your question title) then this example will clarify what you need to do.
Look at this Wikipedia page on Poynomial Regression