LZ77 Extra Bits in DEFLATE - compression

In the LZ77 phase of the DEFLATE compression, extra bits are used to represent the length and distances of the back reference. However, are these extra bits concatenated onto the base values to form a unique code to be Huffman coded, or is the base value alone Huffman coded with the extra bits tacked on afterwards, i.e. during encoding?
In the first case, the lengths 11 and 12 would be different Huffman tree nodes, with each node representing their respective frequencies. But in the second case, 11 and 12 would be the same Huffman tree nodes and the frequency of said node being the combination of 11 and 12.
In the former case, the extra bits would be added pre-Huffman coding. But in the latter case, extra bits would be added after Huffman coding.
Thanks!

The base value alone is Huffman coded, with that code followed by the associated number of extra bits representing a value to be added to the base.

Related

How do I represent an LZW output in bytes?

I found an implementation of the LZW algorithm and I was wondering how can I represent its output, which is an int list, to a byte array.
I had tried with one byte but in case of long inputs the dictionary has more than 256 entries and thus I cannot convert.
Then I tried to add an extra byte to indicate how many bytes are used to store the values, but in this case I have to use 2 bytes for each value, which doesn't compress enough.
How can I optimize this?
As bits, not bytes. You just need a simple routine that writes an arbitrary number of bits to a stream of bytes. It simply keeps a one-byte buffer into which you put bits until you have eight bits. Then write than byte, clear the buffer, and start over. The process is reversed on the other side.
When you get to the end, just write the last byte buffer if not empty with the remainder of the bits set to zero.
You only need to figure out how many bits are required for each symbol at the current state of the compression. That same determination can be made on the other side when pulling bits from the stream.
In his 1984 article on LZW, T.A. Welch did not actually state how to "encode codes", but described mapping "strings of input characters into fixed-length codes", continuing "use of 12-bit codes is common". (Allows bijective mapping between three octets and two codes.)
The BSD compress(1) command didn't literally follow, but introduced a header, the interesting part being a specification of the maximum number if bits to use to encode an LZW output code, allowing decompressors to size decompression tables appropriately or fail early and in a controlled way. (But for the very first,) Codes were encoded with just the number of integral bits necessary, starting with 9.
An alternative would be to use Arithmetic Coding, especially if using a model different from every code is equally probable.

Compressing a string of 1's and 0s containing the same number of 1's as 0's

I have a string of 1's and 0's in which the number of 1's and 0's is the same. I would like to compress this into a number that is smaller in terms of the number of bits needed to store it. Also, converting between the compressed form and non compressed form needs to not require a lot of work.
For example, ordering all possible strings and numbering them off and letting this number be the compressed data would be too much work.
An easy solution would be to allow the compressed data to be just the first n-1 characters of the string where the string is of length n. Converting between the compressed and decompressed data would be easy but this offers little compression, only one bit per string.
I would like an algorithm that would compress a string with this property (same number of ones and zeros) that can be generalized to a string with any even length. I would also like it to compress more than the method described above.
Thanks for help.
This is a combination problem, N items taken k at a time.
In your comment as an example of Length 10, taken 5 at a time, means that there are only 252 unique patterns. Which can fit into an 8 bit value, instead of a 10 bit value. SEE: WIKI: Combinations
Expanding the indexed value from the 0-251 , there are examples here:
SEE: Algorithm to return all combinations of k elements from n
While extracting, you can use the extracted value to set the Bit position in the reconstructed value, which is O(1) time per expansion. If the list is not millions+ you could pre-compute a lookup table, which is much faster to translate the index value to the decoded value. IE: build a list of all possible, and lookup the translation.

Saving binary date into file in c++

My algoritm produces stream of 9bits and 17bits I need to find solution to store this data in file. but i can't just store 9 bits as int and 17bits as int_32.
For example if my algoritm produces 10x9bit and 5x17bits the outfile size need to be 22bytes.
Also one of the big problem to solve is that the out file can be very big and size of the file is unknown.
The only idea with I have now is to use bool *vector;
If you have to save dynamic bits, then you should probably save two values: The first being either the number of bits (if bits are consecutive from 0 to x), or a bitmask to say which bits are valid; The second being the 32-bit integer representing your bits.
Taking your example literally: if you want to store 175 bits and it consists of unknown number of entities of two different lengths, then the file absolutely cannot be only 22 bytes. You need to know what is ahead of you in the file, you need the lengths. If you got only two possible sizes, then it can be only a single bit. 0 means 9 bit, 1 means 17 bit.
|0|9bit|0|9bit|1|17bit|0|9bit|1|17bit|1|17bit|...
So for your example, you would need 10*(1+9)+5*(1+17) = 190 bits ~ 24 bytes. The outstanding 2 bits need to be padded with 0's so that you align at byte boundary. The fact that you will go on reading the file as if there was another entity (because you said you don't know how long the file is) shouldn't be a problem because last such padding will be always less than 9 bits. Upon reaching end of file, you can throw away the last incomplete reading.
This approach indeed requires implementing a bit-level manipulation of the byte-level stream. Which means careful masking and logical operations. BASE64 is exactly that, only being simpler than you, consisting only of fixed 6-bit entities, stored in a textfile.

Decoding of codes generated from shannon fano encoding algorithm

I have generated the codes for different symbol in a file using shannon fano algorithm.
Now my problem is that how i will store these codes into file (as file is in byte) so that while reading, reader can assure that at some point, it is the end of code for a particular symbol. So that extra code will not be read.
First, you can use bitwise operations to read a variable number of bits (not multiple of 8) from a byte array.
Second, if the code is a valid Prefix code, which satisfies
there is no valid code word in the system that is a prefix (start) of any other valid code word in the set
then you can determine where the code ends by comparing the prefix with a table.
Usually, this is done in the following manner:
Suppose the code length is anywhere from 1 to 16 bits.
Load the next 16 bits from the file to the variable.
Compare the 16-bit variable with a table which contain the following values. Binary search or radix search can be used.
Key: the Shannon-Fano or Huffman code, shifted so that the top bit is at the most-significant bit.
KeyLength: the actual number of bits in the Shannon-Fano or Huffman code. This allows us to subtract the number of decoded bits from the variable.
Value: the value that the code will decode to.
Subtract (remove) the decoded bits from the variable depending on the code. For example, if the code has 9 bits, we will remove 9 bits from the MSB and keep the remaining 7 bits.
Read the next 9 bits from the file, concatenating with the undecoded 7 bits.
Repeat the process.

Is there a name for this compression algorithm?

Say you have a four byte integer and you want to compress it to fewer bytes. You are able to compress it because smaller values are more probable than larger values (i.e., the probability of a value decreases with its magnitude). You apply the following scheme, to produce a 1, 2, 3 or 4 byte result:
Note that in the description below (the bits are one-based and go from most significant to least significant), i.e., the first bit refers to most significant bit, the second bit to the next most significant bit, etc...)
If n<128, you encode it as a
single byte with the first bit set
to zero
If n>=128 and n<16,384 ,
you use a two byte integer. You set
the first bit to one, to indicate
and the second bit to zero. Then you
use the remaining 14 bits to encode
the number n.
If n>16,384 and
n<2,097,152 , you use a three byte
integer. You set the first bit to
one, the second bit to one, and the
third bit to zero. You use the
remaining 21 bits, to encode n.
If n>2,097,152 and n<268,435,456 ,
you use a four byte integer. You set
the first three bits to one and the
fourth bit to zero. You use the
remaining 28 bits to encode n.
If n>=268,435,456 and n<4,294,967,296,
you use a five byte integer. You set
the first four bits to one and use
the following 32-bits to set the
exact value of n, as a four byte
integer. The remainder of the bits is unused.
Is there a name for this algorithm?
This is quite close to variable-length quantity encoding or base-128. The latter name stems from the fact that each 7-bit unit in your encoding can be considered a base-128 digit.
it sounds very similar to Dlugosz' Variable-Length Integer Encoding
Huffman coding refers to using fewer bits to store more common data in exchange for using more bits to store less common data.
Your scheme is similar to UTF-8, which is an encoding scheme used for Unicode text data.
The chief difference is that every byte in a UTF-8 stream indicates whether it is a lead or trailing byte, therefore a sequence can be read starting in the middle. With your scheme a missing lead byte will make the rest of the file completely unreadable if a series of such values are stored. And reading such a sequence must start at the beginning, rather than an arbitrary location.
Varint
Using the high bit of each byte to indicate "continue" or "stop", and the remaining bits (7 bits of each byte in the sequence) interpreted as plain binary that encodes the actual value:
This sounds like the "Base 128 Varint" as used in Google Protocol Buffers.
related ways of compressing integers
In summary: this code represents an integer in 2 parts:
A first part in a unary code that indicates how many bits will be needed to read in the rest of the value, and a second part (of the indicated width in bits) in more-or-less plain binary that encodes the actual value.
This particular code "threads" the unary code with the binary code, but other, similar codes pack the complete unary code first, and then the binary code afterwards,
such as Elias gamma coding.
I suspect this code is one of the family of "Start/Stop Codes"
as described in:
Steven Pigeon — Start/Stop Codes — Procs. Data Compression Conference 2001, IEEE Computer Society Press, 2001.