How can we compress DNA string efficiently - c++

DNA strings can be of any length comprising any combination of the 5 alphabets (A, T, G, C, N).
What could be the efficient way of compressing DNA string of alphabet comprising 5 alphabets (A, T, G, C, N). Instead of considering 3 bits per alphabet, can we compress and retrieve efficiently using less number of bits? Can anybody suggest a pseudo code for efficient compression and retrieval?

you can if you willing to (a) have different bits size for each char and (b) you are always reading from the start and never from the middle. then, you can have a code something like:
A - 00
T - 01
G - 10
C - 110
N - 111
Reading from left to right you can only split a stream of bits to chars in one way. You read 2 bits at a time and if they are "11" you need to read one more bit to know what char it is.
This is based on Huffman Coding Algorithm
I don't know much about DNA, but if the probability of the chars is not equal (meaning 20% each). you should assign the shortest codes to those with the higher probability.

You have 5 unique values, so you need a base-5 encoding (e.g. A=0, T=1, G=2, C=3, N=4).
In 32 bits you can fit log5(232) = 13 base-5 values.
In 64 bits you can fit log5(264) = 27 base-5 values.
The encoding process would be:
uint8_t *input = /* base-5 encoded DNA letters */;
uint64_t packed = 0;
for (int i = 0; i < 27; ++i) {
packed = packed * 5 + *input++;
And decoding:
uint8_t *output = /* allocate buffer */;
uint64_t packed = /* next encoded chunk */;
for (int i = 0; i < 27; ++i) {
*output++ = packed % 5;
packed /= 5;

There are plenty of methods to compress, but the main question is what data you want to compress?
1. Raw unaligned sequenced data from the sequencing machine (fastq)
2. Aligned data (sam/bam/cram)
3. Reference genomes
You should reorder your reads putting reads from the close genome positions close to each other. For instance, this would allow usual gzip compress 3 times better. There are many ways to do this. You can for instance align fastq to bam and than export back to fastq. Use Suffix Tree/Array to find similar reads, the way most aligners work (needs a lot of memory). Use minimizers - very fast, low memory solution, but not good for long reads with many errors. Good results are from debruijn graph construction, which is used for the purpose (aka de-novo alignment).
Statistical coding like huffman / arithmetic would compress to 1/3 (one can pass huffman stream to binary arithmetic coder to gain another 20%).
The best results are from reference-based compressions here - just store differences between a reference and the aligned read.
Little can be done here. Using statistical coding you can get 2-3 bits per nucleotide.

Frankly, I would start with some version of Lempel-Ziv compression (a class of compression algorithms that includes the general-purpose gzip compression format). I note that some of the comments say that general-purpose compression algorithms don't work well on raw genome data, but their effectiveness depends on how the data is presented to them.
Note that most general-purpose compression programs (like gzip) examine their input on a per-byte basis. This means that "pre-compressing" the genome data at 3 bits/base is counterproductive; instead, you should format the uncompressed genome data at one byte per base before running it through a general-purpose compressor. Ascii "AGTCN" coding should be fine, as long as you don't add noise by including spaces, newlines, or variations in capitalization.
Lempel-Ziv compression methods work by recognizing repetitive substrings in their input, then encoding them by reference to the preceding data; I'd expect this class of methods should do a reasonably good job on appropriately presented genome data. A more genome-specific compression method may improve on this, but unless there's some strong, non-local constraint on genome encoding that I'm unaware of, I would not expect a major improvement.

We can use a combination of Roee Gavirel's idea and the following for an even tighter result. Since Roee's idea still stipulates that two out of our five characters be mapped to a 3-bit word, sections of the sequence where at least one of the five characters does not appear but one of the 3-bit words does could be mapped to 2-bit words, reducing our final result.
The condition to switch mapping is if there exists a section where at least one of the five characters does not appear and one of the 3-bit words appears just one time more than two times our section-prefix length in bits. If we order our possible characters (for example, alphabetically), given three bits indicating a specific missing character (if there's more than one missing, we choose the first in order) or none missing, we can immediately assign a consistent 2-bit mapping for the other four characters.
Two ideas for our prefixes:
3 bits: the missing character (if none, we use Roee's encoding for the section);
x bits: a constant number of bits representing section length. For maximal length sections of 65000, we could assign x = 16.
To justify the prefix use, we'd need a section where one of the five characters does not appear and one of the 3-bit words appears 39 times or more.
3 bits: the missing character (if none, we use Roee's encoding for the section);
x bits: the number of bits in the next section of the prefix - depends on how many characters the longest section could be. x = 6 would imply the maximal section length could be 2^(2^6)! Unlikely. For maximal length sections of 65000, we could assign x = 4;
the number of bits indicated in the previous part of the prefix, indicating the current section length.
In the example just above, our prefix length could vary between say 11 and 23 bits, which means to justify its use, we'd need a section where one of the five characters does not appear and one of the 3-bit words appears between 23 to 47 times or more.


compression zero bit sequences

I tried to find some library (C++) or algorithm which could compress array of bits with these properties:
There are seqences of zero bits and sequences of bits, which carry the information (1 or 0).
The sequences are usually 8-24 bits long.
I need a loseless compression which would take advantage of those zero bits.
How did I come to such sequences:
I serialize various variables into byte array. I do this quite often to create snapshots, so these variables usually don't change much. I want to use this fact for compression. I don't know the type of those variables, just byte length. So I take the bytes and create diff information with the previous snapshot using XOR.
If the variable changed just a bit, there will usually be many zero bits. That's the zero bit sequence. The rest of the bits carry the information, that's the information sequence.
For every variable, there will probably be 1 zero bit sequence and 1 information sequence.
So far I was considering these algorithms:
RLE - the information sequences would mess up the result
Some symbol coding (Huffman etc.) - the data probably won't share much "symbols", it's not a text and the sequences are short. The whole array will be usually around 1000 bytes long.
If the ~1000 byte sequence has a lot of zero bytes, then just use a standard byte-oriented compression algorithm, such as zlib. You will get compression.

Saving binary date into file in c++

My algoritm produces stream of 9bits and 17bits I need to find solution to store this data in file. but i can't just store 9 bits as int and 17bits as int_32.
For example if my algoritm produces 10x9bit and 5x17bits the outfile size need to be 22bytes.
Also one of the big problem to solve is that the out file can be very big and size of the file is unknown.
The only idea with I have now is to use bool *vector;
If you have to save dynamic bits, then you should probably save two values: The first being either the number of bits (if bits are consecutive from 0 to x), or a bitmask to say which bits are valid; The second being the 32-bit integer representing your bits.
Taking your example literally: if you want to store 175 bits and it consists of unknown number of entities of two different lengths, then the file absolutely cannot be only 22 bytes. You need to know what is ahead of you in the file, you need the lengths. If you got only two possible sizes, then it can be only a single bit. 0 means 9 bit, 1 means 17 bit.
So for your example, you would need 10*(1+9)+5*(1+17) = 190 bits ~ 24 bytes. The outstanding 2 bits need to be padded with 0's so that you align at byte boundary. The fact that you will go on reading the file as if there was another entity (because you said you don't know how long the file is) shouldn't be a problem because last such padding will be always less than 9 bits. Upon reaching end of file, you can throw away the last incomplete reading.
This approach indeed requires implementing a bit-level manipulation of the byte-level stream. Which means careful masking and logical operations. BASE64 is exactly that, only being simpler than you, consisting only of fixed 6-bit entities, stored in a textfile.

Is there a name for this compression algorithm?

Say you have a four byte integer and you want to compress it to fewer bytes. You are able to compress it because smaller values are more probable than larger values (i.e., the probability of a value decreases with its magnitude). You apply the following scheme, to produce a 1, 2, 3 or 4 byte result:
Note that in the description below (the bits are one-based and go from most significant to least significant), i.e., the first bit refers to most significant bit, the second bit to the next most significant bit, etc...)
If n<128, you encode it as a
single byte with the first bit set
to zero
If n>=128 and n<16,384 ,
you use a two byte integer. You set
the first bit to one, to indicate
and the second bit to zero. Then you
use the remaining 14 bits to encode
the number n.
If n>16,384 and
n<2,097,152 , you use a three byte
integer. You set the first bit to
one, the second bit to one, and the
third bit to zero. You use the
remaining 21 bits, to encode n.
If n>2,097,152 and n<268,435,456 ,
you use a four byte integer. You set
the first three bits to one and the
fourth bit to zero. You use the
remaining 28 bits to encode n.
If n>=268,435,456 and n<4,294,967,296,
you use a five byte integer. You set
the first four bits to one and use
the following 32-bits to set the
exact value of n, as a four byte
integer. The remainder of the bits is unused.
Is there a name for this algorithm?
This is quite close to variable-length quantity encoding or base-128. The latter name stems from the fact that each 7-bit unit in your encoding can be considered a base-128 digit.
it sounds very similar to Dlugosz' Variable-Length Integer Encoding
Huffman coding refers to using fewer bits to store more common data in exchange for using more bits to store less common data.
Your scheme is similar to UTF-8, which is an encoding scheme used for Unicode text data.
The chief difference is that every byte in a UTF-8 stream indicates whether it is a lead or trailing byte, therefore a sequence can be read starting in the middle. With your scheme a missing lead byte will make the rest of the file completely unreadable if a series of such values are stored. And reading such a sequence must start at the beginning, rather than an arbitrary location.
Using the high bit of each byte to indicate "continue" or "stop", and the remaining bits (7 bits of each byte in the sequence) interpreted as plain binary that encodes the actual value:
This sounds like the "Base 128 Varint" as used in Google Protocol Buffers.
related ways of compressing integers
In summary: this code represents an integer in 2 parts:
A first part in a unary code that indicates how many bits will be needed to read in the rest of the value, and a second part (of the indicated width in bits) in more-or-less plain binary that encodes the actual value.
This particular code "threads" the unary code with the binary code, but other, similar codes pack the complete unary code first, and then the binary code afterwards,
such as Elias gamma coding.
I suspect this code is one of the family of "Start/Stop Codes"
as described in:
Steven Pigeon — Start/Stop Codes — Procs. Data Compression Conference 2001, IEEE Computer Society Press, 2001.

Better compression algorithm for vector data?

I need to compress some spatially correlated data records. Currently I am getting 1.2x-1.5x compression with zlib, but I figure it should be possible to get more like 2x. The data records have various fields, but for example, zlib seems to have trouble compressing lists of points.
The points represent a road network. They are pairs of fixed-point 4-byte integers of the form XXXXYYYY. Typically, if a single data block has 100 points, there will be only be a few combinations of the top two bytes of X and Y (spatial correlation). But the bottom bytes are always changing and must look like random data to zlib.
Similarly, the records have 4-byte IDs which tend to have constant high bytes and variable low bytes.
Is there another algorithm that would be able to compress this kind of data better? I'm using C++.
Edit: Please no more suggestions to change the data itself. My question is about automatic compression algorithms. If somebody has a link to an overview of all popular compression algorithms I'll just accept that as answer.
You'll likely get much better results if you try to compress the data yourself based on your knowledge of its structure.
General-purpose compression algorithms just treat your data as a bitstream. They look for commonly-used sequences of bits, and replace them with a shorter dictionary indices.
But the duplicate data doesn't go away. The duplicated sequence gets shorter, but it's still duplicated just as often as it was before.
As I understand it, you have a large number of data points of the form
XXxxYYyy, where the upper-case letters are very uniform. So factor them out.
Rewrite the list as something similar to this:
XXYY // a header describing the common first and third byte for all the subsequent entries
xxyy // the remaining bytes, which vary
XXYY // next unique combination of 1st and 3rd byte)
Now, each combination of the rarely varying bytes is listed only once, rather than duplicated for every entry they occur in. That adds up to a significant space saving.
Basically, try to remove duplicate data yourself, before running it through zlib. You can do a better job of it because you have additional knowledge about the data.
Another approach might be, instead of storing these coordinates as absolute numbers, write them as deltas, relative deviations from some location chosen to be as close as possible to all the entries. Your deltas will be smaller numbers, which can be stored using fewer bits.
Not specific to your data, but I would recommend checking out 7zip instead of zlib if you can. I've seen ridiculously good compression ratios using this.
Without seeing the data and its exact distribution, I can't say for certain what the best method is, but I would suggest that you start each group of 1-4 records with a byte whose 8 bits indicate the following:
0-1 Number of bytes of ID that should be borrowed from previous record
2-4 Format of position record
6-7 Number of succeeding records that use the same 'mode' byte
Each position record may be stored one of eight ways; all types other than 000 use signed displacements. The number after the bit code is the size of the position record.
000 - 8 - Two full four-byte positions
001 - 3 - Twelve bits for X and Y
010 - 2 - Ten-bit X and six-bit Y
011 - 2 - Six-bit X and ten-bit Y
100 - 4 - Two sixteen-bit signed displacements
101 - 3 - Sixteen-bit X and 8-bit Y signed displacement
110 - 3 - Eight-bit signed displacement for X; 16-bit for Y
111 - 2 - Two eight-bit signed displacements
A mode byte of zero will store all the information applicable to a point without reference to any previous point, using a total of 13 bytes to store 12 bytes of useful information. Other mode bytes will allow records to be compacted based upon similarity to previous records. If four consecutive records differ only in the last bit of the ID, and either have both X and Y within +/- 127 of the previous record, or have X within +/- 31 and Y within +/- 511, or X within +/- 511 and Y within +/- 31, then all four records may be stored in 13 bytes (an average of 3.25 bytes each (a 73% reduction in space).
A "greedy" algorithm may be used for compression: examine a record to see what size ID and XY it will have to use in the output, and then grab up to three more records until one is found that either can't "fit" with the previous records using the chosen sizes, or could be written smaller (note that if e.g. the first record has X and Y displacements both equal to 12, the XY would be written with two bytes, but until one reads following records one wouldn't know which of the three two-byte formats to use).
Before setting your format in stone, I'd suggest running your data through it. It may be that a small adjustment (e.g. using 7+9 or 5+11 bit formats instead of 6+10) would allow many data to pack better. The only real way to know, though, is to see what happens with your real data.
It looks like the Burrows–Wheeler transform might be useful for this problem. It has a peculiar tendency to put runs of repeating bytes together, which might make zlib compress better. This article suggests I should combine other algorithms than zlib with BWT, though.
Intuitively it sounds expensive, but a look at some source code shows that reverse BWT is O(N) with 3 passes over the data and a moderate space overhead, likely making it fast enough on my target platform (WinCE). The forward transform is roughly O(N log N) or slightly over, assuming an ordinary sort algorithm.
Sort the points by some kind of proximity measure such that the average distance between adjacent points is small. Then store the difference between adjacent points.
You might do even better if you manage to sort the points so that most differences are positive in both the x and y axes, but I can't say for sure.
As an alternative to zlib, a family of compression techniques that works well when the probability distribution is skewed towards small numbers is universal codes. They would have to be tweaked for signed numbers (encode abs(x)<<1 + (x < 0 ? 1 : 0)).
You might want to write two lists to the compressed file: a NodeList and a LinkList. Each node would have an ID, x, y. Each link would have a FromNode and a ToNode, along with a list of intermediate xy values. You might be able to have a header record with a false origin and have node xy values relative to that.
This would provide the most benefit if your streets follow an urban grid network, by eliminating duplicate coordinates at intersections.
If the compression is not required to be lossless, you could use truncated deltas for intermediate coordinates. While someone above mentioned deltas, keep in mind that a loss in connectivity would likely cause more problems than a loss in shape, which is what would happen if you use truncated deltas to represent the last coordinate of a road (which is often an intersection).
Again, if your roads aren't on an urban grid, this probably wouldn't buy you much.

Compression for a unique stream of data

I've got a large number of integer arrays. Each one has a few thousand integers in it, and each integer is generally the same as the one before it or is different by only a single bit or two. I'd like to shrink each array down as small as possible to reduce my disk IO.
Zlib shrinks it to about 25% of its original size. That's nice, but I don't think its algorithm is particularly well suited for the problem. Does anyone know a compression library or simple algorithm that might perform better for this type of information?
Update: zlib after converting it to an array of xor deltas shrinks it to about 20% of the original size.
If most of the integers really are the same as the previous, and the inter-symbol difference can usually be expressed as a single bit flip, this sounds like a job for XOR.
Take an input stream like:
and output:
a bit of pseudo code
compressed[0] = uncompressed[0]
compressed[i] = uncompressed[i-1] ^ uncompressed[i]
We've now reduced most of the output to 0, even when a high bit is changed. The RLE compression in any other tool you use will have a field day with this. It'll work even better on 32-bit integers, and it can still encode a radically different integer popping up in the stream. You're saved the bother of dealing with bit-packing yourself, as everything remains an int-sized quantity.
When you want to decompress:
uncompressed[0] = compressed[0]
uncompressed[i] = uncompressed[i-1] ^ compressed[i]
This also has the advantage of being a simple algorithm that is going to run really, really fast, since it is just XOR.
Have you considered Run-length encoding?
Or try this: Instead of storing the numbers themselves, you store the differences between the numbers. 1 1 2 2 2 3 5 becomes 1 0 1 0 0 1 2. Now most of the numbers you have to encode are very small. To store a small integer, use an 8-bit integer instead of the 32-bit one you'll encode on most platforms. That's a factor of 4 right there. If you do need to be prepared for bigger gaps than that, designate the high-bit of the 8-bit integer to say "this number requires the next 8 bits as well".
You can combine that with run-length encoding for even better compression ratios, depending on your data.
Neither of these options is particularly hard to implement, and they all run very fast and with very little memory (as opposed to, say, bzip).
You want to preprocess your data -- reversibly transform it to some form that is better-suited to your back-end data compression method, first. The details will depend on both the back-end compression method, and (more critically) on the properties you expect from the data you're compressing.
In your case, zlib is a byte-wise compression method, but your data comes in (32-bit?) integers. You don't need to reimplement zlib yourself, but you do need to read up on how it works, so you can figure out how to present it with easily compressible data, or if it's appropriate for your purposes at all.
Zlib implements a form of Lempel-Ziv coding. JPG and many others use Huffman coding for their backend. Run-length encoding is popular for many ad hoc uses. Etc., etc. ...
Perhaps the answer is to pre-filter the arrays in a way analogous to the Filtering used to create small PNG images. Here are some ideas right off the top of my head. I've not tried these approaches, but if you feel like playing, they could be interesting.
Break your ints up each into 4 bytes, so i0, i1, i2, ..., in becomes b0,0, b0,1, b0,2, b0,3, b1,0, b1,1, b1,2, b1,3, ..., bn,0, bn,1, bn,2, bn,3. Then write out all the bi,0s, followed by the bi,1s, bi,2s, and bi,3s. If most of the time your numbers differ only by a bit or two, you should get nice long runs of repeated bytes, which should compress really nicely using something like Run-length Encoding or zlib. This is my favourite of the methods I present.
If the integers in each array are closely-related to the one before, you could maybe store the original integer, followed by diffs against the previous entry - this should give a smaller set of values to draw from, which typically results in a more compressed form.
If you have various bits differing, you still may have largish differences, but if you're more likely to have large numeric differences that correspond to (usually) one or two bits differing, you may be better off with a scheme where you create ahebyte array - use the first 4 bytes to encode the first integer, and then for each subsequent entry, use 0 or more bytes to indicate which bits should be flipped - storing 0, 1, 2, ..., or 31 in the byte, with a sentinel (say 32) to indicate when you're done. This could result the raw number of bytes needed to represent and integer to something close to 2 on average, which most bytes coming from a limited set (0 - 32). Run that stream through zlib, and maybe you'll be pleasantly surprised.
Did you try bzip2 for this?
It's always worked better than zlib for me.
Since your concern is to reduce disk IO, you'll want to compress each integer array independently, without making reference to other integer arrays.
A common technique for your scenario is to store the differences, since a small number of differences can be encoded with short codewords. It sounds like you need to come up with your own coding scheme for differences, since they are multi-bit differences, perhaps using an 8 bit byte something like this as a starting point:
1 bit to indicate that a complete new integer follows, or that this byte encodes a difference from the last integer,
1 bit to indicate that there are more bytes following, recording more single bit differences for the same integer.
6 bits to record the bit number to switch from your previous integer.
If there are more than 4 bits different, then store the integer.
This scheme might not be appropriate if you also have a lot of completely different codes, since they'll take 5 bytes each now instead of 4.
"Zlib shrinks it by a factor of about 4x." means that a file of 100K now takes up negative 300K; that's pretty impressive by any definition :-). I assume you mean it shrinks it by 75%, i.e., to 1/4 its original size.
One possibility for an optimized compression is as follows (it assumes a 32-bit integer and at most 3 bits changing from element to element).
Output the first integer (32 bits).
Output the number of bit changes (n=0-3, 2 bits).
Output n bit specifiers (0-31, 5 bits each).
Worst case for this compression is 3 bit changes in every integer (2+5+5+5 bits) which will tend towards 17/32 of original size (46.875% compression).
I say "tends towards" since the first integer is always 32 bits but, for any decent sized array, that first integer would be negligable.
Best case is a file of identical integers (no bit changes for every integer, just the 2 zero bits) - this will tend towards 2/32 of original size (93.75% compression).
Where you average 2 bits different per consecutive integer (as you say is your common case), you'll get 2+5+5 bits per integer which will tend towards 12/32 or 62.5% compression.
Your break-even point (if zlib gives 75% compression) is 8 bits per integer which would be
single-bit changes (2+5 = 7 bits) : 80% of the transitions.
double-bit changes (2+5+5 = 12 bits) : 20% of the transitions.
This means your average would have to be 1.2 bit changes per integer to make this worthwhile.
One thing I would suggest looking at is 7zip - this has a very liberal licence and you can link it with your code (I think the source is available as well).
I notice (for my stuff anyway) it performs much better than WinZip on a Windows platform so it may also outperform zlib.