Compressing a string of 1's and 0s containing the same number of 1's as 0's - compression

I have a string of 1's and 0's in which the number of 1's and 0's is the same. I would like to compress this into a number that is smaller in terms of the number of bits needed to store it. Also, converting between the compressed form and non compressed form needs to not require a lot of work.
For example, ordering all possible strings and numbering them off and letting this number be the compressed data would be too much work.
An easy solution would be to allow the compressed data to be just the first n-1 characters of the string where the string is of length n. Converting between the compressed and decompressed data would be easy but this offers little compression, only one bit per string.
I would like an algorithm that would compress a string with this property (same number of ones and zeros) that can be generalized to a string with any even length. I would also like it to compress more than the method described above.
Thanks for help.

This is a combination problem, N items taken k at a time.
In your comment as an example of Length 10, taken 5 at a time, means that there are only 252 unique patterns. Which can fit into an 8 bit value, instead of a 10 bit value. SEE: WIKI: Combinations
Expanding the indexed value from the 0-251 , there are examples here:
SEE: Algorithm to return all combinations of k elements from n
While extracting, you can use the extracted value to set the Bit position in the reconstructed value, which is O(1) time per expansion. If the list is not millions+ you could pre-compute a lookup table, which is much faster to translate the index value to the decoded value. IE: build a list of all possible, and lookup the translation.

Related

How can we compress DNA string efficiently

DNA strings can be of any length comprising any combination of the 5 alphabets (A, T, G, C, N).
What could be the efficient way of compressing DNA string of alphabet comprising 5 alphabets (A, T, G, C, N). Instead of considering 3 bits per alphabet, can we compress and retrieve efficiently using less number of bits? Can anybody suggest a pseudo code for efficient compression and retrieval?
you can if you willing to (a) have different bits size for each char and (b) you are always reading from the start and never from the middle. then, you can have a code something like:
A - 00
T - 01
G - 10
C - 110
N - 111
Reading from left to right you can only split a stream of bits to chars in one way. You read 2 bits at a time and if they are "11" you need to read one more bit to know what char it is.
This is based on Huffman Coding Algorithm
Note:
I don't know much about DNA, but if the probability of the chars is not equal (meaning 20% each). you should assign the shortest codes to those with the higher probability.
You have 5 unique values, so you need a base-5 encoding (e.g. A=0, T=1, G=2, C=3, N=4).
In 32 bits you can fit log5(232) = 13 base-5 values.
In 64 bits you can fit log5(264) = 27 base-5 values.
The encoding process would be:
uint8_t *input = /* base-5 encoded DNA letters */;
uint64_t packed = 0;
for (int i = 0; i < 27; ++i) {
packed = packed * 5 + *input++;
}
And decoding:
uint8_t *output = /* allocate buffer */;
uint64_t packed = /* next encoded chunk */;
for (int i = 0; i < 27; ++i) {
*output++ = packed % 5;
packed /= 5;
}
There are plenty of methods to compress, but the main question is what data you want to compress?
1. Raw unaligned sequenced data from the sequencing machine (fastq)
2. Aligned data (sam/bam/cram)
3. Reference genomes
You should reorder your reads putting reads from the close genome positions close to each other. For instance, this would allow usual gzip compress 3 times better. There are many ways to do this. You can for instance align fastq to bam and than export back to fastq. Use Suffix Tree/Array to find similar reads, the way most aligners work (needs a lot of memory). Use minimizers - very fast, low memory solution, but not good for long reads with many errors. Good results are from debruijn graph construction, which is used for the purpose (aka de-novo alignment).
Statistical coding like huffman / arithmetic would compress to 1/3 (one can pass huffman stream to binary arithmetic coder to gain another 20%).
The best results are from reference-based compressions here - just store differences between a reference and the aligned read.
Little can be done here. Using statistical coding you can get 2-3 bits per nucleotide.
Frankly, I would start with some version of Lempel-Ziv compression (a class of compression algorithms that includes the general-purpose gzip compression format). I note that some of the comments say that general-purpose compression algorithms don't work well on raw genome data, but their effectiveness depends on how the data is presented to them.
Note that most general-purpose compression programs (like gzip) examine their input on a per-byte basis. This means that "pre-compressing" the genome data at 3 bits/base is counterproductive; instead, you should format the uncompressed genome data at one byte per base before running it through a general-purpose compressor. Ascii "AGTCN" coding should be fine, as long as you don't add noise by including spaces, newlines, or variations in capitalization.
Lempel-Ziv compression methods work by recognizing repetitive substrings in their input, then encoding them by reference to the preceding data; I'd expect this class of methods should do a reasonably good job on appropriately presented genome data. A more genome-specific compression method may improve on this, but unless there's some strong, non-local constraint on genome encoding that I'm unaware of, I would not expect a major improvement.
We can use a combination of Roee Gavirel's idea and the following for an even tighter result. Since Roee's idea still stipulates that two out of our five characters be mapped to a 3-bit word, sections of the sequence where at least one of the five characters does not appear but one of the 3-bit words does could be mapped to 2-bit words, reducing our final result.
The condition to switch mapping is if there exists a section where at least one of the five characters does not appear and one of the 3-bit words appears just one time more than two times our section-prefix length in bits. If we order our possible characters (for example, alphabetically), given three bits indicating a specific missing character (if there's more than one missing, we choose the first in order) or none missing, we can immediately assign a consistent 2-bit mapping for the other four characters.
Two ideas for our prefixes:
(1)
3 bits: the missing character (if none, we use Roee's encoding for the section);
x bits: a constant number of bits representing section length. For maximal length sections of 65000, we could assign x = 16.
To justify the prefix use, we'd need a section where one of the five characters does not appear and one of the 3-bit words appears 39 times or more.
(2)
3 bits: the missing character (if none, we use Roee's encoding for the section);
x bits: the number of bits in the next section of the prefix - depends on how many characters the longest section could be. x = 6 would imply the maximal section length could be 2^(2^6)! Unlikely. For maximal length sections of 65000, we could assign x = 4;
the number of bits indicated in the previous part of the prefix, indicating the current section length.
In the example just above, our prefix length could vary between say 11 and 23 bits, which means to justify its use, we'd need a section where one of the five characters does not appear and one of the 3-bit words appears between 23 to 47 times or more.

How can I efficiently save a sparse array in a file in C++ ?

I have an array of doubles having 6 indices, and it is mostly filled with zeros. I don't know yet what type should I use to storage it in the memory.
But, most importantly:
I would like to save it into a file (a binary file?).
What is the most efficient way to save it?
One requirement is that I can run through all the non-zero entries without passing by the zeros.
If I run 6 nested for I'll need too many lives.
Moreover, I don't know how to practically save it: Do I need two files, one acting as an index and the second one containing all the values?
Thanks!
This is probably a solved problem; there are probably sparse-matrix libraries that give you efficient in-memory representations too. (e.g. each row is a list of index:value, stored in a std::vector, linked list, hash, or other data structure, depending on whether inserting single non-zero values in the middle is valuable or whatever other operation is important).
A binary format will be faster to store/load, but whether you go binary or text isn't important for some ways of representing a sparse array. If you write a binary format, endian-agnostic code is a good way to make sure it's portable and doesn't have bugs that only show up on some architectures.
Options:
Simple but kind of ugly: gzip / lz4 / lzma the buffer holding your multidimensional array, writing the result to disk. Convert to little-endian on the fly while saving/loading, or store an endianness flag in the format.
Same idea but store all 6 indices with each value. Good if many inner-most arrays have no non-zero values, this may be good. Every non-zero value has a separate record (line, in a text-based format). Sample line (triple-nested example for readability, extends to 6 just fine):
dimensions on the first line or something
a b c val
...
3 2 5 -3.1416
means: matrix[3][2][5] = -3.1416
Use a nested sparse-array representation: each row is a list of index:value. Non-present indices are zero. A text format could use spaces and newlines to separate things; a binary format could use a length field at the start of each row or a sentinel value at the end.
You could flatten the multidimensional array out to one linear index for storage with 32bit integer indices, or you could represent the nesting somehow. I'm not going to try to make up a text format for this, since it got ugly as I started to think about it.
A regular flat representation of a 6 dimension array ...
double[10][10][10][10][10][10] = 1million entries * 8 bytes ~= 8MB
An associative array Index:Value representation, assume 50% of entries are 0.0 ... using a 4 byte 32bit index ...
500,000 * 4 bytes + 500,000 * bytes ~= 6MB
A bit map representation of the sparse array, assume 50% of entries are 0.0 ... bits are set so that every byte represents 8 entries in the array 10000001b would mean 8 entries where only the first and last are represented and the 6 middle values are ignored since they are zero ...
ceil(1million / 8) bytes + 500,000 * 8 bytes ~= 4.125MB

How do I write files that gzip well?

I'm working on a web project, and I need to create a format to transmit files very efficiently (lots of data). The data is entirely numerical, and split into a few sections. Of course, this will be transferred with gzip compression.
I can't seem to find any information on what makes a file compress better than another file.
How can I encode floats (32bit) and short integers (16bit) in a format that results in the smallest gzip size?
P.s. it will be a lot of data, so saving 5% means a lot here. There won't likely be any repeats in the floats, but the integers will likely repeat about 5-10 times in each file.
The only way to compress data is to remove redundancy. This is essentially what any compression tool does - it looks for redundant/repeatable parts and replaces them with link/reference to the same data that was observed before in your stream.
If you want to make your data format more efficient, you should remove everything that could be possibly removed. For example, it is more efficient to store numbers in binary rather than in text (JSON, XML, etc). If you have to use text format, consider removing unnecessary spaces or linefeeds.
One good example of efficient binary format is google protocol buffers. It has lots of benefits, and not least of them is storing numbers as variable number of bytes (i.e. number 1 consumes less space than number 1000000).
Text or binary, but if you can sort your data before sending, it can increase possibility for gzip compressor to find redundant parts, and most likely to increase compression ratio.
Since you said 32-bit floats and 16-bit integers, you are already coding them in binary.
Consider the range and useful accuracy of your numbers. If you can limit those, you can recode the numbers using fewer bits. Especially the floats, which may have more bits than you need.
If the right number of bits is not a multiple of eight, then treat your stream of bytes as a stream of bits and use only the bits needed. Be careful to deal with the end of your data properly so that the added bits to go to the next byte boundary are not interpreted as another number.
If your numbers have some correlation to each other, then you should take advantage of that. For example, if the difference between successive numbers is usually small, which is the case for a representation of a waveform for example, then send the differences instead of the numbers. Differences can be coded using variable-length integers or Huffman coding or a combination, e.g. Huffman codes for ranges and extra bits within each range.
If there are other correlations that you can use, then design a predictor for the next value based on the previous values. Then send the difference between the actual and predicted value. In the previous example, the predictor is simply the last value. An example of a more complex predictor is a 2D predictor for when the numbers represent a 2D table and both adjacent rows and columns are correlated. The PNG image format has a few examples of 2D predictors.
All of this will require experimentation with your data, ideally large amounts of your data, to see what helps and what doesn't or has only marginal benefit.
Use binary instead of text.
A float in its text representation with 8 digits (a float has a precision of eight decimal digits), plus decimal separator, plus field separator, consumes 10 bytes. In binary representation, it takes only 4.
If you need to use text, use hex. It consumes less digits.
But although this makes a lot of difference for the uncompressed file, these differences might disappear after compression, since the compression algo should implicitly take care if that. But you may try.

Highly compressing a grid of numbers

I have a square grid which contains numbers and I need to compress it a lot so it can easily be transferred over a network. For instance, I need to be able to compress a 40x40 grid into less than 512 bytes, regardless of the values of the numbers in the grid. That is my basic requirement.
Each of the grid's cells contains a number from 0-7, so each cell can fit in 3 bits.
Does anyone know of a good algorithm that can achieve what I want?
You can encode your information differently. You dont't need to assign all numbers 0 to 7 a code with the same number of bits. You can assign based on number of times in the sequence.
First read the whole sequence counting the number of appearances of every number.
Based on that you can assign the code to each number.
If you assign the code following for example Huffman code the codes will be prefix code, meaning there is no extra character to separate numbers.
There are certain variations that you can introduce on the algorithm based on your test results to fine tune the compression ratio.
I used this technique in a project (university) and it provides, in general, good results. At least it should approximmate your theoretical 3 bits per character and can be much better if the distribution of probabilities helps.
What you want to do is perform a "burrowes-wheeler" transform on your data, and then compress it. Run-length encoding will be enough in this case.
http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
This will likely outperform huffman in your case.
It is true that in some cases you will need more than 512 bytes. So in your protocol just make an exception for "perverse" grids. But in the general case you should be under 512 easily.
As others have stated, the problem as stated is not possible as 600 bytes are required to represent all possible grids. 600 bytes is from 40 rows, 40 columns, 3 bits per cell, and 8 bits per byte (40 * 40 * 3 / 8). As Kerrek SB explained in the comments, you pack 8 cells into 3 bytes.
In your own comments, you mentioned this was game state being transferred over the network. Assuming you have a mechanism to assure reliable transport of the data, then if there is a reasonable limit to the number of cells that can change between updates, or if you are allowed to send your updates when a certain number of cells change, you can achieve a representation in 512 bytes. If you use 1 bit to represent whether or not a cell has changed, you would need 200 bytes. Then, you have 312 remaining bytes to represent the new values of the cells that have changed. So, you can represent up to 312*8/3 = 832 modified cells.
As an aside, this representation can represent up to 1064 changed cells in less than 600 bytes.

Better compression algorithm for vector data?

I need to compress some spatially correlated data records. Currently I am getting 1.2x-1.5x compression with zlib, but I figure it should be possible to get more like 2x. The data records have various fields, but for example, zlib seems to have trouble compressing lists of points.
The points represent a road network. They are pairs of fixed-point 4-byte integers of the form XXXXYYYY. Typically, if a single data block has 100 points, there will be only be a few combinations of the top two bytes of X and Y (spatial correlation). But the bottom bytes are always changing and must look like random data to zlib.
Similarly, the records have 4-byte IDs which tend to have constant high bytes and variable low bytes.
Is there another algorithm that would be able to compress this kind of data better? I'm using C++.
Edit: Please no more suggestions to change the data itself. My question is about automatic compression algorithms. If somebody has a link to an overview of all popular compression algorithms I'll just accept that as answer.
You'll likely get much better results if you try to compress the data yourself based on your knowledge of its structure.
General-purpose compression algorithms just treat your data as a bitstream. They look for commonly-used sequences of bits, and replace them with a shorter dictionary indices.
But the duplicate data doesn't go away. The duplicated sequence gets shorter, but it's still duplicated just as often as it was before.
As I understand it, you have a large number of data points of the form
XXxxYYyy, where the upper-case letters are very uniform. So factor them out.
Rewrite the list as something similar to this:
XXYY // a header describing the common first and third byte for all the subsequent entries
xxyy // the remaining bytes, which vary
xxyy
xxyy
xxyy
...
XXYY // next unique combination of 1st and 3rd byte)
xxyy
xxyy
...
Now, each combination of the rarely varying bytes is listed only once, rather than duplicated for every entry they occur in. That adds up to a significant space saving.
Basically, try to remove duplicate data yourself, before running it through zlib. You can do a better job of it because you have additional knowledge about the data.
Another approach might be, instead of storing these coordinates as absolute numbers, write them as deltas, relative deviations from some location chosen to be as close as possible to all the entries. Your deltas will be smaller numbers, which can be stored using fewer bits.
Not specific to your data, but I would recommend checking out 7zip instead of zlib if you can. I've seen ridiculously good compression ratios using this.
http://www.7-zip.org/
Without seeing the data and its exact distribution, I can't say for certain what the best method is, but I would suggest that you start each group of 1-4 records with a byte whose 8 bits indicate the following:
0-1 Number of bytes of ID that should be borrowed from previous record
2-4 Format of position record
6-7 Number of succeeding records that use the same 'mode' byte
Each position record may be stored one of eight ways; all types other than 000 use signed displacements. The number after the bit code is the size of the position record.
000 - 8 - Two full four-byte positions
001 - 3 - Twelve bits for X and Y
010 - 2 - Ten-bit X and six-bit Y
011 - 2 - Six-bit X and ten-bit Y
100 - 4 - Two sixteen-bit signed displacements
101 - 3 - Sixteen-bit X and 8-bit Y signed displacement
110 - 3 - Eight-bit signed displacement for X; 16-bit for Y
111 - 2 - Two eight-bit signed displacements
A mode byte of zero will store all the information applicable to a point without reference to any previous point, using a total of 13 bytes to store 12 bytes of useful information. Other mode bytes will allow records to be compacted based upon similarity to previous records. If four consecutive records differ only in the last bit of the ID, and either have both X and Y within +/- 127 of the previous record, or have X within +/- 31 and Y within +/- 511, or X within +/- 511 and Y within +/- 31, then all four records may be stored in 13 bytes (an average of 3.25 bytes each (a 73% reduction in space).
A "greedy" algorithm may be used for compression: examine a record to see what size ID and XY it will have to use in the output, and then grab up to three more records until one is found that either can't "fit" with the previous records using the chosen sizes, or could be written smaller (note that if e.g. the first record has X and Y displacements both equal to 12, the XY would be written with two bytes, but until one reads following records one wouldn't know which of the three two-byte formats to use).
Before setting your format in stone, I'd suggest running your data through it. It may be that a small adjustment (e.g. using 7+9 or 5+11 bit formats instead of 6+10) would allow many data to pack better. The only real way to know, though, is to see what happens with your real data.
It looks like the Burrows–Wheeler transform might be useful for this problem. It has a peculiar tendency to put runs of repeating bytes together, which might make zlib compress better. This article suggests I should combine other algorithms than zlib with BWT, though.
Intuitively it sounds expensive, but a look at some source code shows that reverse BWT is O(N) with 3 passes over the data and a moderate space overhead, likely making it fast enough on my target platform (WinCE). The forward transform is roughly O(N log N) or slightly over, assuming an ordinary sort algorithm.
Sort the points by some kind of proximity measure such that the average distance between adjacent points is small. Then store the difference between adjacent points.
You might do even better if you manage to sort the points so that most differences are positive in both the x and y axes, but I can't say for sure.
As an alternative to zlib, a family of compression techniques that works well when the probability distribution is skewed towards small numbers is universal codes. They would have to be tweaked for signed numbers (encode abs(x)<<1 + (x < 0 ? 1 : 0)).
You might want to write two lists to the compressed file: a NodeList and a LinkList. Each node would have an ID, x, y. Each link would have a FromNode and a ToNode, along with a list of intermediate xy values. You might be able to have a header record with a false origin and have node xy values relative to that.
This would provide the most benefit if your streets follow an urban grid network, by eliminating duplicate coordinates at intersections.
If the compression is not required to be lossless, you could use truncated deltas for intermediate coordinates. While someone above mentioned deltas, keep in mind that a loss in connectivity would likely cause more problems than a loss in shape, which is what would happen if you use truncated deltas to represent the last coordinate of a road (which is often an intersection).
Again, if your roads aren't on an urban grid, this probably wouldn't buy you much.