Here is the snapshot from the AAC ISO standard:
How can I tell how many bits should I read for hcod_sf? What does 1..19 stands for?
VLC (Variable Length Coding) stands for a variable amount of bits being used. So in your example the current value consists of 1 to 19 bits.
This may seem confusing at first, but it follows a strict logic. There will surely be a table in the Appendix defining what bit combination represents what value.
Each value is represented by a certain - unique - bit combination which can be determined by serially reading bits.
For example this table:
0xxxxxxxxxxxxxxxxxx = value = 0 ( 1 bit length )
10xxxxxxxxxxxxxxxxx = value = 1 ( 2 bit length )
110xxxxxxxxxxxxxxxx = value = 2 ( 3 bit length )
1110xxxxxxxxxxxxxxx = value = 3 ( 4 bit length )
...
1111111111111111110 = value = 18 ( 19 bit length )
In this example a 0 denotes the end-of-the-value which, in combination with the 1s, encode the VLC value. As being said, you have to read the bitstream serially to get the length in bits of the value, so in this example you'd have to check for a trailing 0.
In reality, these tables are often constructed with Huffman Coding and hence are more complex than in this example. You will surely find the table for your translation in the aforementioned Appendix of the specification.
How can I tell how many bits should I read for hcod_sf?
You can only determine the number of bits to be read by reading them. Practically you read (at least 19) bits at once and then evaluate how many of these bits belong to the value (often by using a table).
What does 1..19 stands for?
It stands for the minimum (1) and maximum (19) amount of bits in this particular value.
Finally:
It's quite tricky to implement this, so knowing the BMI instruction set extension is really helpful for an x86 way of handling of this.
Related
DNA strings can be of any length comprising any combination of the 5 alphabets (A, T, G, C, N).
What could be the efficient way of compressing DNA string of alphabet comprising 5 alphabets (A, T, G, C, N). Instead of considering 3 bits per alphabet, can we compress and retrieve efficiently using less number of bits? Can anybody suggest a pseudo code for efficient compression and retrieval?
you can if you willing to (a) have different bits size for each char and (b) you are always reading from the start and never from the middle. then, you can have a code something like:
A - 00
T - 01
G - 10
C - 110
N - 111
Reading from left to right you can only split a stream of bits to chars in one way. You read 2 bits at a time and if they are "11" you need to read one more bit to know what char it is.
This is based on Huffman Coding Algorithm
Note:
I don't know much about DNA, but if the probability of the chars is not equal (meaning 20% each). you should assign the shortest codes to those with the higher probability.
You have 5 unique values, so you need a base-5 encoding (e.g. A=0, T=1, G=2, C=3, N=4).
In 32 bits you can fit log5(232) = 13 base-5 values.
In 64 bits you can fit log5(264) = 27 base-5 values.
The encoding process would be:
uint8_t *input = /* base-5 encoded DNA letters */;
uint64_t packed = 0;
for (int i = 0; i < 27; ++i) {
packed = packed * 5 + *input++;
}
And decoding:
uint8_t *output = /* allocate buffer */;
uint64_t packed = /* next encoded chunk */;
for (int i = 0; i < 27; ++i) {
*output++ = packed % 5;
packed /= 5;
}
There are plenty of methods to compress, but the main question is what data you want to compress?
1. Raw unaligned sequenced data from the sequencing machine (fastq)
2. Aligned data (sam/bam/cram)
3. Reference genomes
You should reorder your reads putting reads from the close genome positions close to each other. For instance, this would allow usual gzip compress 3 times better. There are many ways to do this. You can for instance align fastq to bam and than export back to fastq. Use Suffix Tree/Array to find similar reads, the way most aligners work (needs a lot of memory). Use minimizers - very fast, low memory solution, but not good for long reads with many errors. Good results are from debruijn graph construction, which is used for the purpose (aka de-novo alignment).
Statistical coding like huffman / arithmetic would compress to 1/3 (one can pass huffman stream to binary arithmetic coder to gain another 20%).
The best results are from reference-based compressions here - just store differences between a reference and the aligned read.
Little can be done here. Using statistical coding you can get 2-3 bits per nucleotide.
Frankly, I would start with some version of Lempel-Ziv compression (a class of compression algorithms that includes the general-purpose gzip compression format). I note that some of the comments say that general-purpose compression algorithms don't work well on raw genome data, but their effectiveness depends on how the data is presented to them.
Note that most general-purpose compression programs (like gzip) examine their input on a per-byte basis. This means that "pre-compressing" the genome data at 3 bits/base is counterproductive; instead, you should format the uncompressed genome data at one byte per base before running it through a general-purpose compressor. Ascii "AGTCN" coding should be fine, as long as you don't add noise by including spaces, newlines, or variations in capitalization.
Lempel-Ziv compression methods work by recognizing repetitive substrings in their input, then encoding them by reference to the preceding data; I'd expect this class of methods should do a reasonably good job on appropriately presented genome data. A more genome-specific compression method may improve on this, but unless there's some strong, non-local constraint on genome encoding that I'm unaware of, I would not expect a major improvement.
We can use a combination of Roee Gavirel's idea and the following for an even tighter result. Since Roee's idea still stipulates that two out of our five characters be mapped to a 3-bit word, sections of the sequence where at least one of the five characters does not appear but one of the 3-bit words does could be mapped to 2-bit words, reducing our final result.
The condition to switch mapping is if there exists a section where at least one of the five characters does not appear and one of the 3-bit words appears just one time more than two times our section-prefix length in bits. If we order our possible characters (for example, alphabetically), given three bits indicating a specific missing character (if there's more than one missing, we choose the first in order) or none missing, we can immediately assign a consistent 2-bit mapping for the other four characters.
Two ideas for our prefixes:
(1)
3 bits: the missing character (if none, we use Roee's encoding for the section);
x bits: a constant number of bits representing section length. For maximal length sections of 65000, we could assign x = 16.
To justify the prefix use, we'd need a section where one of the five characters does not appear and one of the 3-bit words appears 39 times or more.
(2)
3 bits: the missing character (if none, we use Roee's encoding for the section);
x bits: the number of bits in the next section of the prefix - depends on how many characters the longest section could be. x = 6 would imply the maximal section length could be 2^(2^6)! Unlikely. For maximal length sections of 65000, we could assign x = 4;
the number of bits indicated in the previous part of the prefix, indicating the current section length.
In the example just above, our prefix length could vary between say 11 and 23 bits, which means to justify its use, we'd need a section where one of the five characters does not appear and one of the 3-bit words appears between 23 to 47 times or more.
I am trying to efficiently store a huge number ( > 1 billion) time series. Each value can only be 1, 0 or -1 and the value is recorded once a minute for 40,000 minutes.
I realize that each minute the value can be stored in 2 bits, but I think there is an easier way: there are a limited number of permutations for any time period, so I could just assign a number to each permutation instead of recording all the bits.
For example, if I were to take a 16 minute period: to record those values would require (16 x 2 bits) = 32 bits = 4 bytes. But presumably, I can cut that number in half (or more) if I simply assign a number to each of the 16 possible permutations.
My question: what is the formula for determining the number of permutations for 16 values? I know how to calculate it if the values can be any number, but am stumped as to how to do it when there are just 3 values.
For instance you can zip the file and you will get a great compression level with only 3 symbols.
If you want to do hard work, you can do what basic zip algorithms do:
You have 3 values -1, 0, and 1.
Then you can define a transaltion tree like:
bit sequence - symbol
0 - 0
10 - 1
110 - -1
1110 - End of data
So if you read a zero you know it is a 0 symbol, and if you read a 1 you have to read the next bit to know if it is a 1 or if you have to read one more to know if it is a -1.
So if you have a series 1,1,0,-1,0 it would translate as:
101001100
If this is all the data you see you have 9 bits, so you would need to complete with something to get to 16.
Then just put an end of data marker and after that anytihg.
10100110 01110000
To do this you need to work with bit operators.
If you know that any of these symbols has a rate of occurance greater that the rest, use that symbol with less amount of bits (for example the 0 should represent the most used symbol).
If -1, 0, and 1 are all equally likely, then the formula for the number of bits required for n samples is ceiling(n log23). For one sample, you get two bits as you have noted, effectively wasting one of the states, a little more than 0.4 bits per sample wasted.
As it turns out, five samples fit really nicely into eight bits, where 35 = 243, with only about 0.015 bits per symbol wasted.
You can use the extra states as end-of-stream symbols. For example, you could use five of the remaining 13 states to signal end-of-stream, indicating that there are 0, 1, 2, 3, or 4 samples remaining. Then if it's 1, 2, 3, or 4, there is one more byte with those samples. A little better would be to use three states for the 1 case, providing the sample in that byte. Then seven of the 13 states are used, requiring one byte to end the stream for the 0 and 1 cases, and two bytes to end the stream for the cases of 2, 3, or 4 remaining.
If -1, 0, and 1 have noticeably different probabilities, then you can use Huffman coding on the samples to represent the result in fewer bits than the "flat" case above. However there is only one Huffman code for one sample of three symbols, which would not give good performance in general. So you would again want to combine samples for better Huffman coding performance. (Or use arithmetic coding, but that is more involved than perhaps necessary in this case.) So you could again group five samples into one integer in the range 0..242, and Huffman code those, along with an end-of-stream symbol (call it 243) that occurs only once.
Say you have a four byte integer and you want to compress it to fewer bytes. You are able to compress it because smaller values are more probable than larger values (i.e., the probability of a value decreases with its magnitude). You apply the following scheme, to produce a 1, 2, 3 or 4 byte result:
Note that in the description below (the bits are one-based and go from most significant to least significant), i.e., the first bit refers to most significant bit, the second bit to the next most significant bit, etc...)
If n<128, you encode it as a
single byte with the first bit set
to zero
If n>=128 and n<16,384 ,
you use a two byte integer. You set
the first bit to one, to indicate
and the second bit to zero. Then you
use the remaining 14 bits to encode
the number n.
If n>16,384 and
n<2,097,152 , you use a three byte
integer. You set the first bit to
one, the second bit to one, and the
third bit to zero. You use the
remaining 21 bits, to encode n.
If n>2,097,152 and n<268,435,456 ,
you use a four byte integer. You set
the first three bits to one and the
fourth bit to zero. You use the
remaining 28 bits to encode n.
If n>=268,435,456 and n<4,294,967,296,
you use a five byte integer. You set
the first four bits to one and use
the following 32-bits to set the
exact value of n, as a four byte
integer. The remainder of the bits is unused.
Is there a name for this algorithm?
This is quite close to variable-length quantity encoding or base-128. The latter name stems from the fact that each 7-bit unit in your encoding can be considered a base-128 digit.
it sounds very similar to Dlugosz' Variable-Length Integer Encoding
Huffman coding refers to using fewer bits to store more common data in exchange for using more bits to store less common data.
Your scheme is similar to UTF-8, which is an encoding scheme used for Unicode text data.
The chief difference is that every byte in a UTF-8 stream indicates whether it is a lead or trailing byte, therefore a sequence can be read starting in the middle. With your scheme a missing lead byte will make the rest of the file completely unreadable if a series of such values are stored. And reading such a sequence must start at the beginning, rather than an arbitrary location.
Varint
Using the high bit of each byte to indicate "continue" or "stop", and the remaining bits (7 bits of each byte in the sequence) interpreted as plain binary that encodes the actual value:
This sounds like the "Base 128 Varint" as used in Google Protocol Buffers.
related ways of compressing integers
In summary: this code represents an integer in 2 parts:
A first part in a unary code that indicates how many bits will be needed to read in the rest of the value, and a second part (of the indicated width in bits) in more-or-less plain binary that encodes the actual value.
This particular code "threads" the unary code with the binary code, but other, similar codes pack the complete unary code first, and then the binary code afterwards,
such as Elias gamma coding.
I suspect this code is one of the family of "Start/Stop Codes"
as described in:
Steven Pigeon — Start/Stop Codes — Procs. Data Compression Conference 2001, IEEE Computer Society Press, 2001.
I found some interesting bit twiddling in "source\common\unicode\utf.h" file of ICU library (International Components for Unicode). The bit twiddling is intended for checking whether a number is in a particular range.
// Is a code point in a range of U+d800..U+dbff?
#define U_IS_LEAD(c) (((c)&0xfffffc00)==0xd800)
I have figured out the magic number (0xfffffc00) come from:
MagicNumber = 0xffffffff - (HighBound - LowBound)
However, I also found that the formula doesn't apply to every arbitrary range. Does somebody here know in what circumstance the formula works?
Is there another bit twiddling for checking whether a number is in particular range?
For these tricks to apply, the numbers must have some common features in their binary representation.
0xD800 == 0b1101_1000_0000_0000
0xDBFF == 0b1101_1011_1111_1111
What this test really does is to mask out the lower ten bits. This is usually written as
onlyHighBits = x & ~0x03FF
After this operation ("and not") the lower ten bits of onlyHighBits are guaranteed to be zero. That means that if this number equals the lower range of the interval now, it has been somewhere in the interval before.
This trick works in all cases where the lower and the higher limit of the interval start with the same digits in binary, and at some point the lower limit has only zeroes while the higher limit has only ones. In your example this is at the tenth position from the right.
If you do not have 2^x boundaries type might use the following trick:
if x >= 0 and x < N you can check both by:
if Longword( x ) < Longword( N ) then ...
This works due to the fact that negative numbers in signed numbers correspond to the largest numbers in unsigned datatypes.
You could extend this (when range checking is DISABLED) to:
if Longword( x - A ) < Longword ( ( B - A ) ) then ...
Now you have both tests (range [ A, B >) in a SUB and a CMP plus a single Jcc, assuming (B - A ) is precalculated.
I only use these kind of optimizations when really needed; eg they tend to make your code less readable and it only shaves off a few clock cycles per test.
Note to C like language readers: Longword is Delphi's unsigned 32bit datatype.
The formula works whenever the range you are looking for starts at a multiple of a power of 2 (that is, 1 or more bits at the low end of the binary form of the number ends in 0) and the size of the range is 2^n-1 (that is, low&high == low and low|high == high).
Or, maybe, what I don't get is unary coding:
In Golomb, or Rice, coding, you split a number N into two parts by dividing it by another number M and then code the integer result of that division in unary and the remainder in binary.
In the Wikipedia example, they use 42 as N and 10 as M, so we end up with a quotient q of 4 (in unary: 1110) and a remainder r of 2 (in binary 010), so that the resulting message is 1110,010, or 8 bits (the comma can be skipped). The simple binary representation of 42 is 101010, or 6 bits.
To me, this seems due to the unary representation of q which always has to be more bits than binary.
Clearly, I'm missing some important point here. What is it?
The important point is that Golomb codes are not meant to be shorter than the shortest binary encoding for one particular number. Rather, by providing a specific kind of variable-length encoding, they reduce the average length per encoded value compared to fixed-width encoding, if the encoded values are from a large range, but the most common values are generally small (and hence are using only a small fraction of that range most of the time).
As an example, if you were to transmit integers in the range from 0 to 1000, but a large majority of the actual values were in the range between 0 and 10, in a fixed-width encoding, most of the transmitted codes would have leading 0s that contain no information:
To cover all values between 0 and 1000, you need a 10-bit wide encoding in fixed-width binary. Now, as most of your values would be below 10, at least the first 6 bits of most numbers would be 0 and would carry little information.
To rectify this with Golomb codes, you split the numbers by dividing them by 10 and encoding the quotient and the remainder separately. For most values, all that would have to be transmitted is the remainder which can be encoded using 4 bits at most (if you use truncated binary for the remainder it can be less). The quotient is then transmitted in unary, which encodes as a single 0 bit for all values below 10, as 10 for 10..19, 110 for 20..29 etc.
Now, for most of your values, you have reduced the message size to 5 bits max, but you are still able to transmit all values unambigously without separators.
This comes at a rather high cost for the larger values (for example, values in the range 990..999 need 100 bits for the quotient), which is why the coding is optimal for 2-sided geometric distributions.
The long runs of 1 bits in the quotients of larger values can be addressed with subsequent run-length encoding. However, if the quotients consume too much space in the resulting message, this could indicate that other codes might be more appropriate than Golomb/Rice.
One difference between the Golomb coding and binary code is that binary code is not a prefix code, which is a no-go for coding strings of arbitrarily large numbers (you cannot decide if 1010101010101010 is a concatenation of 10101010 and 10101010 or something else). Hence, they are not that easily comparable.
Second, the Golomb code is optimal for geometric distribution, in this case with parameter 2^(-1/10). The probability of 42 is some 0.3 %, so you get the idea about how important is this for the length of the output string.