Measure compression of Huffman Algorithm

Measure compression of Huffman Algorithm - compression

I'm revamping my programming skills and implemented the Huffman algorithm. For now, I'm just considering [a-z] with no special characters. The probability values for a-z have been used from wikipedia.
When I run it, I get roughly 2x compression for random paragraphs.
But for this calculation I assume original letters require 8 bits each (ASCII).
But if I think about it, to represent 26 items, i just need 5 bits. If I calculate based on this fact, then compression factor drops to almost 1.1
So my question is, how is the compression factor determined in real world applications?
2nd question - if I write an encoder / decoder which uses 5 bits for representing a-z ( say a=0, b=1, etc) - is this also a considered a valid "compression" algorithm?

You have essentially the right answer, which is that you can't expect a lot of compression if all that you're working with is the letter frequencies of the English language.
The correct way to calculate the gain resulting from knowledge of the letter frequencies is to consider the entropy of a 26-symbol alphabet of equal probabilities with the entropy of the letters in English.
(I wish stackoverflow allowed TeX equations like math.stackexchange.com does. Then I could write decent equations here. Oh well.)
The key formula is -p log(p), where p is the probability of that symbol and the log is in base 2 to get the answer in bits. You calculate this for each symbol and then sum over all symbols.
Then in an ideal arithmetic coding scheme, an equiprobable set of 26-symbols would be coded in 4.70 bits per symbol. For the distribution in English (using the probabilities from the Wikipedia article), we get 4.18 bits per symbol. A reduction of only about 11%.
So that's all the frequency bias by itself can buy you. (It buys you a lot more in Scrabble scores, but I digress.)
We can also look at the same thing in the approximate space of Huffman coding, where each code is an integral number of bits. In this case you would not assume five bits per letter (with six codes wasted). Applying Huffman coding to 26 symbols of equal probability gives six codes that are four bits in length and 20 codes that are five bits in length. This results in 4.77 bits per letter on average. Huffman coding using the letter frequencies occurring in English gives an average of 4.21 bits per letter. A reduction of 12%, which is about the same as the entropy calculation.
There are many ways that real compressors do much better than this. First, they code what is actually in the file, using the frequencies of what's there instead of what they are across the English language. This makes it language independent, optimizes for the actual contents, and doesn't even code symbols that are not present. Second, you can break up the input into pieces and make a new code for each. If the pieces are big enough, then the overhead of transmitting a new code is small, and the gain is usually larger to optimize on a smaller chunk. Third, you can look for higher order effects. Instead of the frequency of single letters, you can take into account the previous letter and look at the probability of the next letter given its predecessor. Now you have 26^2 probabilities (for just letters) to track. These can also be generated dynamically for the actual data at hand, but now you need more data to get a gain, more memory, and more time. You can go to third order, fourth order, etc. for even greater compression performance at the cost of memory and time.
There are other schemes to pre-process the data by, for example, doing run-length encoding, looking for matching strings, applying block transforms, tokenizing XML, delta-coding audio or images, etc., etc. to further expose redundancies for an entropy coder to then take advantage of. I alluded to arithmetic coding, which can be used instead of Huffman to code very probable symbols in less than a bit and all symbols to fractional bit accuracy for better performance in the entropy step.
Back to your question of what constitutes compression, you can begin with any starting point you like, e.g. one eight-bit byte per letter, make assertions about your input, e.g. all lower case letters (accepting that if the assertion is false, the scheme fails), and then assess the compression effectiveness. So long as you use all of the same assumptions when comparing two different compression schemes. You must be careful that anything that is data dependent must also be considered part of the compressed data. E.g. a custom Huffman code derived from a block of data must be sent with that block of data.

If you ran an unrestricted Huffman-coding compression on the same text you'd get the same result, so I think it's reasonable to say that you're getting 2x compression over an ASCII encoding of the same text. I would be more inclined to say that your program is getting the expected compression, but currently has a limitation that it can't handle arbitrary input, and other simpler compression schemes to get compression over ASCII as well if that limitation is in place.
Why not extend your algorithm to handle arbitrary byte values? That way it's easier to make a true heads-up comparison.

It's not 5 bits for 26 character it's log(26) / log(2) = 4,7 bits. This is the maximum entropy but you need to know the specific entropy. For the german language it's 4,0629. When you know that you can use the formula R=Hmax - H. Look here: http://de.wikipedia.org/wiki/Entropie_(Informationstheorie)
http://en.wikipedia.org/wiki/Information_theory#Entropy

Related

Choosing Symbols in an efficient way, in Arithmetic Code Algorithm, C++

I'm trying to implement Arithmetic Code Algorithm to compress binary images (JPG images transformed to binary base using opencv). The problem is that I've to save in the compressed file, the encoded string and the symbols which I used to generate this encoded string and also their frequencies, so I can be able to decode it. The symbols take a lot of space even if I'm transforming them to ascii and if I tried to take less number of characters for each symbol the size of the encoded string becomes bigger. So I wonder if there's an efficient way to save symbols in the compressed file with minimum possible size. And I want to know the most efficient way to choose the symbols from the original file.
Thanks in advance :)

325,592,005 bytes is 310 megabytes. You managed to compress this image into 2.8+6.1=8.9 megabytes so you decreased the size by 97%. It's a good result and I wouldn't worry here. Besides 6.1 megabytes of 64 bits long symbols means that you have around 800K of them. It is much less than the maximum possible number of possible symbols i.e. 2^64 - 1. It is again a good result.
As to your question about using multiple compression algorithms. Firstly, you have to know that in the case of the loosless compression the optimal number of bits per symbol is equal to the entropy. And the arithmetic encoding is close to be optimal (see this, this or this). It means that there is no much sense in using more than 1 algorithm one after another, if one of them is the arithmetic encoding.
As to the arithmetic coding vs Huffman codes. The latter is actually the special case of the former. And as far as I know the arithmetic encoding is always at least as good as Huffman codes.
It is also worth adding one thing. If you can consider lossy compression there is actually no limit in the compression rate. Or in other words, you can compress data as much as you want as long as quality loss is still acceptable for you. However, even in this case using multiple compression algorithms is not required. The one is enough.

Is there an algorithm for "perfect" compression?

Let me clarify, I'm not talking about perfect compression in the sense of an algorithm that is able to compress any given source material, I realize that is impossible. What I'm trying to get at is an algorithm that is able to encode any source string of bits to it's absolute maximum compressed state, as determined by it's Shannon entropy.
I believe I have heard some things about Huffman Coding being in some sense optimal, so I believe that this encryption scheme might be based off that, but here is my issue:
Consider the bit-strings: a = "101010101010", b = "110100011010".
Using plain Shannon entropy, these bit strings should have the exact same entropy when we consider the bit strings as simply symbols of 0's and 1's, but this approach is flawed, because we can intuitively see that bitstring a has less entropy than bitstring b because it is simply a pattern of repeated 10's. With this in mind, we could get a better idea of the actual entropy of the source by calculating the Shannon entropy for the composite symbols 00, 10, 01, and 11.
This is just my understanding, and I could be totally off base, but from what I understand, for an ergodic source to be truly random, for an ergodic source with length n. the statistical probability of all n-length groups of symbols must be equally likely.
I suppose to be more specific than the question in the title, I have three main questions:
Does Huffman encoding using single bits as symbols compress a bitstring like a optimally, even with an obvious pattern that occurs when we analyze the string at the level of 2-bit symbols? If not, could one optimally compress a source by cycling through different "levels" (sorry if I'm butchering the terminology here) of Huffman coding until the best compression rate is found? Could going through different "rounds" of Huffman coding further increase the compression rate in some instances? (e.a. first go through Huffman coding with symbols that are 5 bits long, then going through Huffman coding for symbols that are 4 bits long? huff_4bits(huff_5bits(bitstring)) )

As stated by Mark, the general answer is "no", due to Kolmogorov complexity. Let me expand a bit on that.
Compression is basically two steps :
1) Model
2) Entropy
The role of the model is to "guess" the next bytes or fields to come.
Model can have any form, and there is no limit to its effectiveness.
A trivial example is a random number generator function : from an external perspective, it looks like a noise, and therefore cannot be compressed. But if you know the generation function, an infinitely long sequence can be compressed into a small set of code, the generator function.
That's why there is "no limit", and Kolmogorov complexity just states that : you can never guarantee that there is not a better way to "model" the data.
The second part is computable : Entropy is where you find the "Shannon Limit".
Given a set of symbols (typically, the output symbols from the model), which are part of an alphabet, you can compute the optimal cost, and find a way to reach the proven ultimate compression limit, which is the Shannon limit.
Huffman is optimal with regards to the Shannon limit if you accept the limitation that each symbol must be encoded using an integer number of bits. This is close but imperfect approximation. Better compression can be achieved by using fractional bits, which is what Arithmetic Coders do offer, or the more recent ANS-based Finite State Entropy coder. Both get much closer to the Shannon limit.
The Shannon limit only applies if you treat a set of symbols "individually". As soon as you try to "combine them", or find any correlations between the symbols, you are "modeling". And this is the territory of Kolmogorov Complexity, which is not computable.

No. It can be proven that there is not even an algorithm to determine how well a perfect compressor will do. See Kolmogorov Complexity.
Huffman coding (or arithmetic coding) by itself does not get close to the best compression. Other techniques need to be used to take advantage of higher order redundancies in the data.

What's the best entropy encoding scheme to compress symbols with a known probability distribution?

I'm looking to encode user_ids in a long list of call records. The parts of these records that takes up the most space are the symbols for the caller and receiver. I will create a map that assigns the most active callers shorter symbols---this will help keep the overall size of the files (and therefore the I/O time) down.
I know in advance how many times each symbol will be used---in other words I know the relative probability distribution. Furthermore, it is not important that the codes that are produced be "prefix free" such as Huffman codes. So what's the best encoding scheme, i.e., the one that will deliver the most compression and for which a quick implementation exists?
An answer should not only point to a compression scheme, it should also point to an implementation of that encoding scheme.

For general-purpose lossless encoding with a known probability distribution, aside from Huffman coding, the other "textbook" answer is arithmetic coding.
In practice, there are a variety of implementations. See these general-purpose coders. Each has different properties. Without further information, we can't give you a more precise answer.

#conradlee: re "In what cases is arithmetic coding better than Huffman coding?" In terms of compression, nearly always. If you have a symbol,S, with a probability, Ps, then the ideal number of bits to code it with, bs, is -log(Ps)/log(2). For example, if Ps is 1/3 then bs is ~ 1.585 bits. With Huffman you have to round up or down to the nearest whole number of bits (so the compression ratio will decrease). Arithmetic encoding will store it with a fractional number of bits.

What is the maximum theoretically possible compression rate?

This is a theoretical question, so expect that many details here are not computable in practice or even in theory.
Let's say I have a string s that I want to compress. The result should be a self-extracting binary (can be x86 assembler, but it can also be some other hypothetical Turing-complete low level language) which outputs s.
Now, we can easily iterate through all possible such binaries and programs, ordered by size. Let B_s be the sub-list of these binaries who output s (of course B_s is uncomputable).
As every set of positive integers must have a minimum, there must be a smallest program b_min_s in B_s.
For what languages (i.e. set of strings) do we know something about the size of b_min_s? Maybe only an estimation. (I can construct some trivial examples where I can always even calculate B_s and also b_min_s, but I am interested in more interesting languages.)

This is Kolmogorov complexity, and you are correct that it's not computable. If it were, you could create a paradoxical program of length n that printed a string with Kolmogorov complexity m > n.
Clearly, you can bound b_min_s for given inputs. However, as far as I know most of the efforts to do so have been existence proofs. For instance, there is an ongoing competition to compress English Wikipedia.

Claude Shannon estimated the information density of the English language to be somewhere between 0.6 and 1.3 bits per character in his 1951 paper Prediction and Entropy of Printed English (PDF, 1.6 MB. Bell Sys. Tech. J (3) p. 50-64).

The maximal (avarage) compression rate possible is 1:1.
The number of possible inputs is equal to the number of outputs.
It has to be to be able to map the output back to the input.
To be able to store the output you need container at the same size as the minimal container for the input - giving 1:1 compression rate.

Basically, you need enough information to rebuild your original information. I guess the other answers are more helpful for your theoretical discussion, but just keep this in mind.

How do I improve breaking substitution ciphers programmatically?

I have written (am writting) a program to analyze encrypted text and attempt to analyze and break it using frequency analysis.
The encrypted text takes the form of each letter being substituted for some other letter ie. a->m, b->z, c->t etc etc. all spaces and non alpha chars are removed and upper case letters made lowercase.
An example would be :
Orginal input - thisisasamplemessageitonlycontainslowercaseletters
Encrypted output - ziololqlqdhstdtllqutozgfsnegfzqoflsgvtkeqltstzztkl
Attempt at cracking - omieieaeanuhtnteeawtiorshylrsoaisehrctdlaethtootde
Here it has only got I, A and Y correctly.
Currently my program cracks it by analysing the frequency of each individual character, and mapping it to the character that appears in the same frequency rank in a non encrypted text.
I am looking for methods and ways to improve the accuracy of my program as at the moment I don't get too many characters right. For example when attempting to crack X amount of characters from Pride and Prejudice, I get:
1600 - 10 letters correct
800 - 7 letters correct
400 - 2 letters correct
200 - 3 letters correct
100 - 3 letters correct.
I am using Romeo and Juliet as a base to get the frequency data.
It has been suggested to me to look at and use the frequency of character pairs, but I am unsure how to use this because unless I am using very large encrypted texts I can imagine a similar approach to how I am doing single characters would be even more inaccurate and cause more errors than successes. I am hoping also to make my encryption cracker more accurate for shorter 'inputs'.

I'm not sure how constrained this problem is, i.e. how many of the decisions you made are yours to change, but here are some comments:
1) Frequency mapping is not enough to solve a puzzle like this, many frequencies are very close to each other and if you aren't using the same text for frequency source and plaintext, you are almost guaranteed to have a few letters off no matter how long the text. Different materials will have different use patterns.
2) Don't strip the spaces if you can help it. This will allow you to validate your potential solution by checking that some percentage of the words exist in a dictionary you have access to.
3) Look into natural language processing if you really want to get into the language side of this. This book has all you could ever want to know about it.
Edit:
I would look into bigraphs and trigraphs first. If you're fairly confident of one or two letters, they can help predict likely candidates for the letters that follow. They're basically probability tables where AB would be the probability of an A being followed by a B. So assuming you have a given letter solved, that can be used to solve the letters next to it, rather than just guessing. For example, if you've got the word "y_u", it's obvious to you that the word is you, but not to the computer. If you've got the letters N, C, and O left, bigraphs will tell you that YN and YC are very uncommon where as YO is much more likely, so even if your text has unusual letter frequencies (which is easy when it's short) you still have a fairly accurate system for solving for unknowns. You can hunt around for a compiled dataset, or do your own analysis, but make sure to use a lot of varied text, a lot of Shakespeare is not the same as half of Shakespeare and half journal articles.

Looking at character pairs makes a lot of sense to me.
Every single letter of the alphabet can be used in valid text, but there are many pairs that are either extremely unlikely or will never happen.
For example, there is no way to get qq using valid English words, as every q must be followed by a u. If you have the same letters repeated in the encrypted text, you can automatically exclude the possibility that they represent q.
The fact that you are removing spaces from the input limits the utility somewhat since combinations that would never exist in a single word e.g. ht can now occur if the h ends one word and the t begins another one. Still, I suspect that these additional data points will enable you to resolve much shorter strings of text.
Also, I would suggest that Romeo and Juliette is only a good basis for statistical data if you intend to analyze writings of the period. There have been some substantial changes to spelling and word usage that may skew the statistics.

First of all, Romeo and Juliet probably isn't a very good basis to use. Second, yes digraphs are helpful (and so are trigraphs). For a substitution cipher like you're looking at, a good place to start would be the Military Cryptanalysis books by William Friedman.

Well, I have solved some simple substitution ciphers in my time, so I can speak freely.
Removing the spaces from the input string makes it nearly impossible to solve.
While it is true that most English sentences have 'e' in higher frequency, that is not all there is to the process.
The part that makes the activity fun, is the series of trial hypothesis/test hypothesis/accept or reject hypothesis that makes the whole thing an iterative process.
Many sentences contain the words 'of' and 'the'. By looking at your sentence, and assuming that one of the two letter words is of, implies further substitutions that can allow you to make inferences about other words. In short, you need a dictionary of high frequency word, to allow you to make further inferences.
As there could be a large amount of backtracking involved, it may be wise to consider a prolog or erlang implementation as a basis for developing the c++ one.
Best of luck to you.
Kindly share your results when done.

Single letter word are a big hint (generally only "A" and "I", rarely "O". Casual language allows "K"). There are also a finite set of two and three letter words. No help if spaces have been stripped.
Pairs are much more diagnostic than you would think. For instance: some letters never appear doubled in English (though this is not absolute if the spaces have been stripped or if foreign vocabulary is allowed), and others are common double; also some heterogeneous pairs are very frequent.
As a general rule, no one analysis will provide certainty. You need to assign each cipher letter a set of possible translation with associated probabilities. And combine several tests until the probabilities become very significant.
You may be able to determine when you've gotten close by checking the Shannon Entropy.

Not a complete answer, but maybe a helpful pointer: you can use a dictionary to determine how good your plaintext candidate is. On a UNIX system with aspell installed, you can extract an English word list with the command
aspell -l en dump master

You might try looking at pairs rather than individual letters. For instance, a t is often followed by an h in English, as is an s. Markov modeling would be useful here.

Frequency Analysis
Frequency analysis is a great place to start. However, Romeo and Juliet is not a very good choice to take character frequencies from to decipher Pride and Prejudice text. I would suggest using frequencies from this page because it uses 7 different texts that are closer in age to Pride and Prejudice. It also lists probabilities for digraphs and trigraphs. However, digraphs and trigraphs may not be as useful when spaces are removed from the text because this introduces the noise of digraphs and trigraphs created by words being mashed together.
Another resource for character frequencies is this site. It claims to use 'a good mix of different literary genres.'
Frequency analysis generally becomes more probabilistically correct with increased length of the encrypted text as you've seen. Frequency analysis also only helps to suggest the right direction in which to go. For instance, the encrypted character with the highest frequency may be the e, but it could also very well be the a which also has a high frequency. One common method is to start with some of the highest frequency letters in the given language, try matching these letters with different letters of high frequency in the text, and look to see if they form common words like the, that, is, as, and, and so on. Then you go from there.
A Good Introductory Book
If you are looking for a good layman introduction to cryptography, you might try The Code Book by Simon Singh. It's very readable and interesting. The books looks at the development of codes and codebreaking throughout history. He covers substitution ciphers fairly early on and describes some common methods for breaking them. Also, he had a Cipher Challenge in the book (which has already been completed) that consisted of some various codes to try to break including some substitution ciphers. You might try reading through how the Swedish team broke these ciphers at this site. However, I might suggest reading at least through the substitution cipher part of the book before reading these solutions.
By the way I'm not affiliated in any way with the publication of this book. I just really enjoyed it.

Regarding digraphs, digrams and word approximations, John Pierce (co-inventor of the transistor and PCM) wrote an excellent book, Introduction to Information Theory, that contains an extended analysis of calculating their characteristics, why you would want to and how to locate them. I found it helpful when writing a frequency analysis decryption code myself.
Also, you will probably want to write an ergodic source to feed your system, rather than relying on a single source (e.g., a novel).

interesting question,i ask a similar question :)
one thing i'm trying to find out and do is:
to scan the bigger words that have repeating letters in them..
then find a corresponding word with a similar pattern to the bigger word from the cipher..
the reason as to why is simply because,the bigger the word the most possible different deciphered letters found at once and because bigger words are easier to decode,just the same as to why a bigger text is easier to decode.. more chances to see patterns emerge :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js