Is there a name for this compression algorithm? - compression

Say you have a four byte integer and you want to compress it to fewer bytes. You are able to compress it because smaller values are more probable than larger values (i.e., the probability of a value decreases with its magnitude). You apply the following scheme, to produce a 1, 2, 3 or 4 byte result:
Note that in the description below (the bits are one-based and go from most significant to least significant), i.e., the first bit refers to most significant bit, the second bit to the next most significant bit, etc...)
If n<128, you encode it as a
single byte with the first bit set
to zero
If n>=128 and n<16,384 ,
you use a two byte integer. You set
the first bit to one, to indicate
and the second bit to zero. Then you
use the remaining 14 bits to encode
the number n.
If n>16,384 and
n<2,097,152 , you use a three byte
integer. You set the first bit to
one, the second bit to one, and the
third bit to zero. You use the
remaining 21 bits, to encode n.
If n>2,097,152 and n<268,435,456 ,
you use a four byte integer. You set
the first three bits to one and the
fourth bit to zero. You use the
remaining 28 bits to encode n.
If n>=268,435,456 and n<4,294,967,296,
you use a five byte integer. You set
the first four bits to one and use
the following 32-bits to set the
exact value of n, as a four byte
integer. The remainder of the bits is unused.
Is there a name for this algorithm?

This is quite close to variable-length quantity encoding or base-128. The latter name stems from the fact that each 7-bit unit in your encoding can be considered a base-128 digit.

it sounds very similar to Dlugosz' Variable-Length Integer Encoding

Huffman coding refers to using fewer bits to store more common data in exchange for using more bits to store less common data.

Your scheme is similar to UTF-8, which is an encoding scheme used for Unicode text data.
The chief difference is that every byte in a UTF-8 stream indicates whether it is a lead or trailing byte, therefore a sequence can be read starting in the middle. With your scheme a missing lead byte will make the rest of the file completely unreadable if a series of such values are stored. And reading such a sequence must start at the beginning, rather than an arbitrary location.

Varint
Using the high bit of each byte to indicate "continue" or "stop", and the remaining bits (7 bits of each byte in the sequence) interpreted as plain binary that encodes the actual value:
This sounds like the "Base 128 Varint" as used in Google Protocol Buffers.
related ways of compressing integers
In summary: this code represents an integer in 2 parts:
A first part in a unary code that indicates how many bits will be needed to read in the rest of the value, and a second part (of the indicated width in bits) in more-or-less plain binary that encodes the actual value.
This particular code "threads" the unary code with the binary code, but other, similar codes pack the complete unary code first, and then the binary code afterwards,
such as Elias gamma coding.
I suspect this code is one of the family of "Start/Stop Codes"
as described in:
Steven Pigeon — Start/Stop Codes — Procs. Data Compression Conference 2001, IEEE Computer Society Press, 2001.

Related

Can I be sure that 32 bytes of binary data read from a file is equal to 256 bits?

I want to read a file 32 bytes at a time using a C/C++ program, but I want to be sure that the data will be 256 bits. In essence I am worried about leading bits in the "bytes" that I read from the file being off ? Is that even a matter of concern ?
Example : If I have a number say 2 represented in binary as 10 . This would be sufficient for me as a human.
How is that different as far a computer is concerned if it's written as: 00000010 to represent a char value of 1 byte ??? Would the leading zeros affect the bit count ? Does that in turn affect operations like XOR ?
I've trouble understanding its effects ! Does that involve data loss ? I really do not know... !
Every help to clear my misunderstanding will be appreciated !!!
Every routine in the C standard library that reads from a file or stream reads in units of bytes. Each byte read is a fixed number of bits; what is read from a file does not vary due to leading zeros or lack thereof in a byte. Some routines return a single character (which is a byte). Some routines put data read into a buffer and return a count of bytes read. Some routines, such as scanf, return a count of the number of items successfully converted. (You generally would not use these routines to read a fixed number of bytes.)
The number of bits in a byte is set by the C implementation. It is reported in CHAR_BIT, defined in <limits.h>. It is eight in all common C implementations.
Although the number of bits per byte does not vary, the number of bytes read from a stream can vary. If a stream is opened as a text stream, characters may be “added, altered, or deleted on input and output to conform to differing conventions for representing text in the host environment” (C 2018 7.21.2 2). To avoid this, you should open a stream as a binary stream.
The CHAR_BIT macro (defined in climits) will tell you how many bits make up a byte in the execution environment. However I am not aware of any recent general-purpose hardware that uses bytes of other than 8 bits. Some specialized processors (such as for digital signal processing) may use other sizes. Also completely outdated equipment used a wide variety of sizes (the typical alternative to 8 bits being 9).
No
C++ allows a char to be any size a platform requires. However, the macro CHAR_BIT always has the number of bits in a char.
So, to find out the number of bits in 32 bytes, you would use the formula 32*CHAR_BIT.
C++17 has introduced the new type std::byte that is not a character type and is always CHAR_BIT bits, as explained in the SO question std::byte on odd platforms
In order to find the number of bytes needed to hold 256 bits, you have a problem, because CHAR_BIT isn't always a divisor of 256. So, you have to decide what you want and use a more complicated formula. For example, 1+(255+CHAR_BIT)/CHAR_BIT will give you the number of bytes needed to hold 256 contiguous bits.

How do I represent an LZW output in bytes?

I found an implementation of the LZW algorithm and I was wondering how can I represent its output, which is an int list, to a byte array.
I had tried with one byte but in case of long inputs the dictionary has more than 256 entries and thus I cannot convert.
Then I tried to add an extra byte to indicate how many bytes are used to store the values, but in this case I have to use 2 bytes for each value, which doesn't compress enough.
How can I optimize this?
As bits, not bytes. You just need a simple routine that writes an arbitrary number of bits to a stream of bytes. It simply keeps a one-byte buffer into which you put bits until you have eight bits. Then write than byte, clear the buffer, and start over. The process is reversed on the other side.
When you get to the end, just write the last byte buffer if not empty with the remainder of the bits set to zero.
You only need to figure out how many bits are required for each symbol at the current state of the compression. That same determination can be made on the other side when pulling bits from the stream.
In his 1984 article on LZW, T.A. Welch did not actually state how to "encode codes", but described mapping "strings of input characters into fixed-length codes", continuing "use of 12-bit codes is common". (Allows bijective mapping between three octets and two codes.)
The BSD compress(1) command didn't literally follow, but introduced a header, the interesting part being a specification of the maximum number if bits to use to encode an LZW output code, allowing decompressors to size decompression tables appropriately or fail early and in a controlled way. (But for the very first,) Codes were encoded with just the number of integral bits necessary, starting with 9.
An alternative would be to use Arithmetic Coding, especially if using a model different from every code is equally probable.

Saving a Huffman Tree compactly in C++

Let's say that I've encoded my Huffman tree in with the compressed file. So I have as an example file output:
001A1C01E01B1D
I'm having an issue saving this string to file bit-by-bit. I know that C++ can only output to file one byte at a time, so I'm having an issue storing this string in bytes. Is it possible to convert the first three bits to a char without the program padding to a byte? If it pads to a byte for the traversal codes then my tree (And the codes) will be completely messed up. If I were to chop this up one byte at a time, then what happens if the tree isn't exactly a multiple of 8? What happens if the compressed file's bit-length isn't exactly a multiple of 8?
Hopefully I've been clear enough.
The standard solution to this problem is padding. There are many possible padding schemes. Padding schemes pad up to an even number of bytes (i.e., a multiple of 8 bits). Additionally, they encode either the length of the message in bits, or the number of padding bits (from which the message length in bits can be determined by subtraction). The latter solution obviously results in slightly more efficient paddings.
Most simply, you can append the number of "unused" bits in the last byte as an additional byte value.
One level up, start by assuming that the number of padding bits fits in 3 bits. Define the last 3 bits of an encoded file to encode the number of padding bits. Now if the message uses up no more than 5 bits of the last byte, the padding can fit nicely in the same byte. If it is necessary to add a byte to contain the padding, the maximum gap is 5+2=7 (5 from the unused high bits of the extra byte, and 2 is the maximum possible space free in the last byte, otherwise the 3-bit padding value would've fit there). Since 0-7 is representable in 3 bits, this works (it doesn't work for 2 bits, since the maximum gap is larger and the range of representable values is smaller).
By the way, one of the main advantages of placing the padding information at the end of the file (rather than as a header at the beginning of the file) is that the compression functions can then operate on a stream without having to know its length in advance. Decompression can be stream-based as well, with careful handling of EOF signals.
Simply treat a sequence of n bytes as a sequence of 8n bits. Use the >> or <<, |, and & operators to assemble bytes from the sequence of variable-length bit codes.
The end of the stream is important to handle properly. You need an end of stream code so that the decoder knows to stop and not try to decode the final padding bits that complete the last byte.

compression zero bit sequences

I tried to find some library (C++) or algorithm which could compress array of bits with these properties:
There are seqences of zero bits and sequences of bits, which carry the information (1 or 0).
The sequences are usually 8-24 bits long.
I need a loseless compression which would take advantage of those zero bits.
How did I come to such sequences:
I serialize various variables into byte array. I do this quite often to create snapshots, so these variables usually don't change much. I want to use this fact for compression. I don't know the type of those variables, just byte length. So I take the bytes and create diff information with the previous snapshot using XOR.
If the variable changed just a bit, there will usually be many zero bits. That's the zero bit sequence. The rest of the bits carry the information, that's the information sequence.
For every variable, there will probably be 1 zero bit sequence and 1 information sequence.
EDIT:
So far I was considering these algorithms:
RLE - the information sequences would mess up the result
Some symbol coding (Huffman etc.) - the data probably won't share much "symbols", it's not a text and the sequences are short. The whole array will be usually around 1000 bytes long.
If the ~1000 byte sequence has a lot of zero bytes, then just use a standard byte-oriented compression algorithm, such as zlib. You will get compression.

Saving binary date into file in c++

My algoritm produces stream of 9bits and 17bits I need to find solution to store this data in file. but i can't just store 9 bits as int and 17bits as int_32.
For example if my algoritm produces 10x9bit and 5x17bits the outfile size need to be 22bytes.
Also one of the big problem to solve is that the out file can be very big and size of the file is unknown.
The only idea with I have now is to use bool *vector;
If you have to save dynamic bits, then you should probably save two values: The first being either the number of bits (if bits are consecutive from 0 to x), or a bitmask to say which bits are valid; The second being the 32-bit integer representing your bits.
Taking your example literally: if you want to store 175 bits and it consists of unknown number of entities of two different lengths, then the file absolutely cannot be only 22 bytes. You need to know what is ahead of you in the file, you need the lengths. If you got only two possible sizes, then it can be only a single bit. 0 means 9 bit, 1 means 17 bit.
|0|9bit|0|9bit|1|17bit|0|9bit|1|17bit|1|17bit|...
So for your example, you would need 10*(1+9)+5*(1+17) = 190 bits ~ 24 bytes. The outstanding 2 bits need to be padded with 0's so that you align at byte boundary. The fact that you will go on reading the file as if there was another entity (because you said you don't know how long the file is) shouldn't be a problem because last such padding will be always less than 9 bits. Upon reaching end of file, you can throw away the last incomplete reading.
This approach indeed requires implementing a bit-level manipulation of the byte-level stream. Which means careful masking and logical operations. BASE64 is exactly that, only being simpler than you, consisting only of fixed 6-bit entities, stored in a textfile.