Tools like binary decision diagrams for compression - compression

I am looking for compression algorithms (of binary data with considerable regularity) that allows me to operate on the compressed data to quickly answer questions like: what is the value of the k'th bit of the input? how many bits are set? what is the intersection of this set with some other set?
An example of this type of algorithm is binary decision diagrams. I am wondering if there are others like that, maybe with more powerful operators.


Which algorithm is most suitable for large text compression?

Currently, I am looking for an lossless compression algorithm that is suitable for a large amount of text, that will be further encrypt by AES and use as the payload in steganography.
Based on A Comparative Study Of Text Compression Algorithms, it seems that
Arithmetic coding is preferable in Statistical compression techniques, while LZB is recommended for Dictionary compression techniques.
So now I am wondering whether Statistical compression or Dictionary compression is more suitable for large English text compression in terms of compression ratio and ease-to-implement.
I have search through but still barely have an idea of the suitable algorithm. Thank you very much for your time in answering. Have a nice day. :)
A lot of the algorithms that you are describing in this question are called entropy coders (Shannon-Fano, Huffman, arithmetic, etc.). Entropy coders are used to compress sequences of symbols (often bytes), where some symbols are much more frequent than others. Simple entropy coding of symbols (letters) for compressing natural language will only yield about a 2:1 compression.
Instead, popular modern lossless compression techniques for text include methods like LZ77, LZW, and BWT. Loosely speaking, the LZ family involves building up a dictionary of recurring short symbol sequences (we'll call them "words") and then uses pointers to reference those words. Some of the implementations of LZ like LZ77 and LZW can be fairly simple to code up but probably do not yield the highest compression ratios. See for example this video: On the other end of the spectrum, LZMA2, is a relatively more complicated variant with a higher compression ratio.
The Burrows-Wheeler transform (BWT) provides a clever alternative to the dictionary methods. I'll refer you to the Wikipedia article,
In a nutshell, though, it produces a (invertible) permutation of the original sequence of bytes that can often be compressed very effectively by run-length encoding followed by an entropy coder.
If I had to code a compression technique from scratch, for simplicity, I'd probably go with LZW or LZ77.
Shannon-Fano coding, Huffman coding, Arithmetic coding, Range coding, and Asymmetric Numeral System coding are all zero-order entropy coders applied after you have first modeled your data, taking advantage of the inherent redundancy.
For text, that redundancy is repeated strings and higher-order correlations in the data. There are several ways to model text. The most common are Lempel-Ziv 77, which looks for matching strings, the Burrows-Wheeler Transform (look it up for a description), and prediction by partial matching.
Look to the Large Text Compression Benchmark to see comparisons in compression, compression speed, memory used, and decompression speed.

Is it necessary with binary encoding in genetic algorithms?

I'm doing a project exploring the use of genetic algorithms in architecture, where we use an evolutionary approach for creating Voronoi tessellation in 3d. This is done using ofxVoro++ for openFrameworks (c++).
Our chromosomes for the Genomes is a vector (list) of points in 3D. We have implemented single- and two-point crossover and a mutation, which randomises these points with a certain probability. In most examples I've seen, the genome is encoded binarily, which I presume would cause mutation and crossover to act differently.
So my question is this: Are there any other benefits to binary encoding (except speed) and how would you handle such an encoding/decoding in c++? Going from binary to a list of 3d-points.
Best regards,
I used different GA in logistic and finance problems. Very often I do not use binary representation.
The first example that I can give you is the TSP problem:
Here I used standard representation: the chromosome is an array of integer, each value represents the city.
So, it depends on the type of problem that you are trying to solve, if you can find a way to implement the GA without a binary representation you do not need any adjustment.
Furthermore I prefer the natural representation because is more simple to understand, while debugging the code, if your GA is working as you want.
You can use real encoding also, but in this case is important what crossover and mutation you use. If your crossover is simply (p1+p2) / 2 or p1*a + p2*(1-a), you will not get good results.
A good crossover operator for real encoding was proposed by K. Deb in 1995. Here is the paper:
Crossover and mutation are different operators. Crossover uses existing genetic. Mutation introduces new genetic material into the population. Without knowing much more info about your algorithm, randomizing points sounds like mutation. Mutation is typically performed a very low percent of the time (maybe 1%) where crossover can be rather high (50%).
So for your algorithm, I would not "modify" anything for crossover. Instead, for crossover, I would try to reposition material or simply take different portions of points from parents.
For mutation, it might make sense to add or subtract a small number to the points, thus modifying the points (mutation).
It is difficult to make suggestions without knowing more about your algorithm and chromosome representation.

Is there an algorithm for "perfect" compression?

Let me clarify, I'm not talking about perfect compression in the sense of an algorithm that is able to compress any given source material, I realize that is impossible. What I'm trying to get at is an algorithm that is able to encode any source string of bits to it's absolute maximum compressed state, as determined by it's Shannon entropy.
I believe I have heard some things about Huffman Coding being in some sense optimal, so I believe that this encryption scheme might be based off that, but here is my issue:
Consider the bit-strings: a = "101010101010", b = "110100011010".
Using plain Shannon entropy, these bit strings should have the exact same entropy when we consider the bit strings as simply symbols of 0's and 1's, but this approach is flawed, because we can intuitively see that bitstring a has less entropy than bitstring b because it is simply a pattern of repeated 10's. With this in mind, we could get a better idea of the actual entropy of the source by calculating the Shannon entropy for the composite symbols 00, 10, 01, and 11.
This is just my understanding, and I could be totally off base, but from what I understand, for an ergodic source to be truly random, for an ergodic source with length n. the statistical probability of all n-length groups of symbols must be equally likely.
I suppose to be more specific than the question in the title, I have three main questions:
Does Huffman encoding using single bits as symbols compress a bitstring like a optimally, even with an obvious pattern that occurs when we analyze the string at the level of 2-bit symbols? If not, could one optimally compress a source by cycling through different "levels" (sorry if I'm butchering the terminology here) of Huffman coding until the best compression rate is found? Could going through different "rounds" of Huffman coding further increase the compression rate in some instances? (e.a. first go through Huffman coding with symbols that are 5 bits long, then going through Huffman coding for symbols that are 4 bits long? huff_4bits(huff_5bits(bitstring)) )
As stated by Mark, the general answer is "no", due to Kolmogorov complexity. Let me expand a bit on that.
Compression is basically two steps :
1) Model
2) Entropy
The role of the model is to "guess" the next bytes or fields to come.
Model can have any form, and there is no limit to its effectiveness.
A trivial example is a random number generator function : from an external perspective, it looks like a noise, and therefore cannot be compressed. But if you know the generation function, an infinitely long sequence can be compressed into a small set of code, the generator function.
That's why there is "no limit", and Kolmogorov complexity just states that : you can never guarantee that there is not a better way to "model" the data.
The second part is computable : Entropy is where you find the "Shannon Limit".
Given a set of symbols (typically, the output symbols from the model), which are part of an alphabet, you can compute the optimal cost, and find a way to reach the proven ultimate compression limit, which is the Shannon limit.
Huffman is optimal with regards to the Shannon limit if you accept the limitation that each symbol must be encoded using an integer number of bits. This is close but imperfect approximation. Better compression can be achieved by using fractional bits, which is what Arithmetic Coders do offer, or the more recent ANS-based Finite State Entropy coder. Both get much closer to the Shannon limit.
The Shannon limit only applies if you treat a set of symbols "individually". As soon as you try to "combine them", or find any correlations between the symbols, you are "modeling". And this is the territory of Kolmogorov Complexity, which is not computable.
No. It can be proven that there is not even an algorithm to determine how well a perfect compressor will do. See Kolmogorov Complexity.
Huffman coding (or arithmetic coding) by itself does not get close to the best compression. Other techniques need to be used to take advantage of higher order redundancies in the data.

Why don't we use word ranks for string compression?

I have 3 main questions:
Let's say I have a large text file. (1)Is replacing the words with their rank an effective way to compress the file? (Got answer to this question. This is a bad idea.)
Also, I have come up with a new compression algorithm. I read some existing compression models that are used widely and I found out they use some pretty advanced concepts like statistical redundancy and probabilistic prediction. My algorithm does not use all these concepts and is a rather simple set of rules that need to be followed while compressing and decompressing. (2)My question is am I wasting my time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes?
(3)Furthermore, if I manage to successfully compress a string can I extend my algorithm to other content like videos, images etc.?
(I understand that the third question is difficult to answer without knowledge about the compression algorithm. But I am afraid the algorithm is so rudimentary and nascent I feel ashamed about sharing it. Please feel free to ignore the third question if you have to)
Your question doesn't make sense as it stands (see answer #2), but I'll try to rephrase and you can let me know if I capture your question. Would modeling text using the probability of individual words make for a good text compression algorithm? Answer: No. That would be a zeroth order model, and would not be able to take advantage of higher order correlations, such as the conditional probability of a given word following the previous word. Simple existing text compressors that look for matching strings and varied character probabilities would perform better.
Yes, you are wasting your time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes. You should first learn about the techniques that have been applied over time to model data, textual and others, and the approaches to use the modeled information to compress the data. You need to study what has already been researched for decades before developing a new approach.
The compression part may extend, but the modeling part won't.
Do you mean like having a ranking table of words sorted by frequency and assign smaller "symbols" to those words that are repeated the most, therefore reducing the amount of information that needs to be transmitted?
That's basically how Huffman Coding works, the problem with compression is that you always hit a limit somewhere along the road, of course, if the set of things that you try to compress follows a particular pattern/distribution then it's possible to be really efficient about it, but for general purposes (audio/video/text/encrypted data that appears to be random) there is no (and I believe that there can't be) "best" compression technique.
Huffman Coding uses frequency on letters. You can do the same with words or with letter frequency in more dimensions, i.e. combinations of letters and their frequency.

What's the best entropy encoding scheme to compress symbols with a known probability distribution?

I'm looking to encode user_ids in a long list of call records. The parts of these records that takes up the most space are the symbols for the caller and receiver. I will create a map that assigns the most active callers shorter symbols---this will help keep the overall size of the files (and therefore the I/O time) down.
I know in advance how many times each symbol will be used---in other words I know the relative probability distribution. Furthermore, it is not important that the codes that are produced be "prefix free" such as Huffman codes. So what's the best encoding scheme, i.e., the one that will deliver the most compression and for which a quick implementation exists?
An answer should not only point to a compression scheme, it should also point to an implementation of that encoding scheme.
For general-purpose lossless encoding with a known probability distribution, aside from Huffman coding, the other "textbook" answer is arithmetic coding.
In practice, there are a variety of implementations. See these general-purpose coders. Each has different properties. Without further information, we can't give you a more precise answer.
#conradlee: re "In what cases is arithmetic coding better than Huffman coding?" In terms of compression, nearly always. If you have a symbol,S, with a probability, Ps, then the ideal number of bits to code it with, bs, is -log(Ps)/log(2). For example, if Ps is 1/3 then bs is ~ 1.585 bits. With Huffman you have to round up or down to the nearest whole number of bits (so the compression ratio will decrease). Arithmetic encoding will store it with a fractional number of bits.