How does pre processing the data using move to front algorithm helps huffman coding? - compression

I need to compress a text using huffman coding. Before doing that I need to pre process the input text using Move to front(MTF) algorithm. As the output of mtf algorithm gives me an array of index of the position of the character with lower index for the frequently used character, how will this help for huffman encoding? Can anyone help me the an example on how these two can be merged?

It depends on your data. If the same letters are used often locally, MTF can reduce the entropy of the data, and Huffman then can take advantage of that reduction to compress the data.
On typical text, MTF won't help a lot (might even hurt), but MTF helps quite a bit after a Burrows-Wheeler transform, which tends to group the same characters in a reversible way. You can also do run-length encoding after the MTF.
How to merge them is simply to do the MTF, and the Huffman-code the resulting integer indices.

Related

Compressing an array of two values

What is the cheapest way to compress an array, to a byte[] array, given only two possible values in it?
Array length has no limit.
The best idea I have seen so far is to insert the repeated number of times each value appears, so for example the array "11111001" is compressed to "521".
I wonder if there is a better way.
Thanks.
First off, convert to a bit array. Right off the bat it will take 1/8th the space.
Then you'd need to examine your data to see if there is any apparent redundancy. If not, you're done. If there is redundancy, you'd need to figure out a way to model it, and then compress it. Run-length coding as you propose is useful only if long runs of zeros and/or ones are common in the data.

finding all repeating patterns in a file using C++

Im looking for a way to find all repeating sequences consisting of at least 3 characters on an input file and then print out the most frequent ones! it seems to require a lot of string processing and intense searching through the input file specially cause there is no upper bound on the max size of the patterns to look for!
Is there any efficient algorithm to do it with the least possible processing and messiness?
Should I use string.h or I'd be better off with char arrays?
Any tips/helpful snippets etc. on how to start?
tnx
I would suggest that you do create a suffix tree from the file. This will have linear complexity with respect to the size of your file and will solve the problem. You can modify the algorithm just a little bit to store how many times is a string met apart from the string itself. Here is a great post explaining how to create a suffix tree.
Finding the most frequent one is quite easy, if you realize that the most frequent sequence is 4 characters long. It can be done in O(n) time, where n is the size of the input file.
You can build a std::map<string,int>, iterate character by character taking sequences of 4 characters at a time, and increment the value with the respective key in the map.

Possible to create a compression algorithm that uses an enormous (100GB?) pseudo-random look-up file?

Would it be possible/practical to create a compression algorithm that splits a file into chunks and then compares those chunks against an enormous (100GB?, 200GB?) psuedo-random file?
The resulting "compressed" file would contain an ordered list of offsets and lengths. Everyone using the algorithm would need the same enormous file in order to compress/decompress files.
Would this work? I assume someone else has thought of this before and tried it but it's a tough one to Google.
It's a common trick, used by many compression "claimers", which regularly announce "revolutionary" compression ratio, up to ridiculous levels.
The trick depends, obviously, on what's in the reference dictionary.
If such a dictionary is just "random", as suggested, then it is useless. Simple math will show that the offset will cost, on average, as much as the data it references.
But if the dictionary happens to contain large parts or the entire input file, then it will be "magically" compressed to a reference, or series of references.
Such tricks are called "hiding the entropy". Matt Mahoney wrote a simple program (barf) to demonstrate this technique, up to the point of reducing anything to 1 byte.
The solution to this trickery is that a comparison exercise should always include the compressed data, the decompression program, and any external dictionary it uses. When all these elements are counted in the equation, then it's no longer possible to "hide" entropy anywhere. And the cheat get revealed....
Cyan is correct. Even more: You wouldn't need to have such a file. You can deterministically produce the same pseudo-random sequence without ever storing it. By looking at it that way you see that your random lookup file has no value.

UTF8 String: simple idea to make index() lookup O(1)

Background
I am using the UTF8-CPP class. The vast majority of my strings are using the ASCII character set (0-127). The problem with UTF8-based strings is that the index function (i.e. to retrieve a character a specific position) is slow.
Idea
A simple technique is to use a flag as a property which basically says if the string is pure ASCII or note (isAscii). This flag would be updated whenever the string is modified.
This solution seems too simple, and there may be things I am overlooking. But, if this solution is viable, does it not provide the best of both worlds (i.e. Unicode when needed and performance for the vast majority of cases), and would it not gaurantee O(1) for index loopkups?
UPDATE
I'm going to attach a diagram to clarify what I mean. I think a lot of people are misunderstanding what I mean (or I am misunderstanding basic concepts).
All good replies.
I think the point here is that while your vast majority of strings is ASCII, in general, the designer of an UTF-8 library should expect general UTF-8 strings. And there, checking and setting this flag is an unnecessary overhead.
In your case, it might be worth the effort to wrap or modify the UTF8 class accordingly. But before you do that, ask your favorite profiler if it's worth it.
"It depends" on your needs for thread safety and updates, and the length of your strings, and how many you've got. In other words, only profiling your idea in your real application will tell you if it makes things better or worse.
If you want to speed up the UTF8 case...
First, consider sequential indexing of code points, thus avoiding counting them from the very beginning of the string again and again. Implement and use routines to index the next and the previous code points.
Second, you may build an array of indices into the UTF8 string's code points and use it as the first step while searching, it will give you an approximate location of the sought code point.
You may either have it (the array) of a fixed size, in which case you will still get search time ~ O(n) with O(1) memory cost, or have it contain equally-spaced indices (that is, indices into every m'th code point, where m is some constant), in which case you will get search time ~ O(m+log(n)) with O(n) memory cost.
You could also embed indices inside the code point data encoding them as reserved/unused/etc code points or use invalid encoding (say, first byte being 11111110 binary, then, for example, 6 10xxxxxx bytes containing the index, or whatever you like).

Is it possible to do simple arithmetic (e.g. addition) on "compressed" integers?

I would like to compress an an array of integers, initially initialized to 0, using a yet-to-be-determined integer compression/decompression method.
Is it possible with some integer compression method to increment (+1) a specific element of an array of compressed integers accurately using C or C++?
Of all the common compression techniques, two stand out as potentially usable in this without a full decompress cycle.
First, sparse arrays were built specifically for this. With a sparse array, you typically store a map of index to value. You don't store array elements that haven't been modified, so if most of your array is 0, it need not be stored. Many arrays (and matrices) in simulations are sparse, and there's a huge literature. Here adding to a value would simply be accessing the index with [] and incrementing - the access will create if nonexistent.
Next, run length encoding may also work if you find that you are working with large sequences of the same number, but those "runs" are not all the same number. Since they are not the same, a sparse array would not work and RLE is a solution. Incrementing a number is not as easy as for sparse, but basically if not a run, you add and check to see if you can make a new run. If part of a run, you split the run. RLE typically only makes sense with visual data or certain mathematical patterns.
You can certainly implement this, if your increment method:
Decompresses the entire array.
Increments the desired entry.
Compresses the entire array again.
If you want to increment in less of a dumb way you'll need intimate knowledge of the compression process, and so would we to give you more assistance.