Fixed-length compression algorithm? - compression

I'm trying to find a compression algorithm which I can use to encode a blob using only 16 fixed-length symbols (0b0000 - 0b1111).
Without any compression, I could use those 16 symbols to encode their respective bit-values (e.g. symbol 5 (0b0101) encodes bits 0101, so if my blob is 100 bits-long I would need 25 symbols to represent it - but doing so provides no compression.
I think what I need is somewhat a reverse-Huffman (in the sense that the code is fixed-length, but it represents variable-length output).
Any ideas?
I only need to do this to one specific blob that's about 2 KBs, so it doesn't need to be super-efficient.

If I am understanding your question correctly, use any standard compressor to compress your data, and then simply encode each byte of the result as two of your symbols.

Related

Best compression algo for random string

I have some string like below
ff8870fd30db56efd72e8b499a454c4e27be6ab70e23dd59a864563628e998a
which is around 2Kbytes I tried to compress but I am not getting good compression ratio
with gz I got only 400 bytes reduction and with defalte I got 450 reduction ..
Is there any better alogoritham to have more compression atleast more than 50% reduction .
By definition, you cannot compress random data because it will not contain any structure you can represent / describe in a more efficient way, using less bits.
If this is possible, the data contains a structure and is no longer random.
A common counter-argument is that, given enough odds, even an all 0 string can be generated by a RNG, but devil is in details: it is all about odds!
Even in a tiny 2KB space you have 2^(2048*8) possible strings if the data is generated by a true RNG or a robust PRNG algorithm seeded with reasonable amount of noise, and the vast majority of those stings will not contain any reasonable amout of "order" you can compress.
The fact you are obtaining a 400 B / 450 B compression on 2 KB is a strong hint the string you are looking at is not really random, just non-human readable or "random-looking".
GZ format is based on Deflate compression algorithm, so it is not clear why the two figures are presented separately - Deflate accepts various parameters for fine tuning compression at the expense of speed, so different settings can justify the different results.
To get better compression on random-looking (but not really random!) data can try with LZMA2 (7-Zip) or even better ZPAQ (http://mattmahoney.net/dc/zpaq.html).
I do know that this is much later than the OP.... however if you look at HOW the data is being represented then yes it is going to be hard to find repetition as a string... however...
with the example
"ff8870fd30db56efd72e8b499a454c4e27be6ab70e23dd59a864563628e998a" as given...
How else COULD this information be represented? these look to all be HEX couplets.. .for example
"0xff 0x88 0x70" etc... so.....if this was stored in bytes.... you automatically get 100% compression since each character is a single byte in itself...
if We wanted to get very clever we could look at some math where say we could map this data to more easily compressible data.. of course this would only be beneficial for very large data, as the encoding of small amounts of data would likely make it larger...

Interleaving bzip2 and non-bzip2 data

I am looking at making a file format that interleaves two types of chunks of raw bytes.
One chunk will contain a block of bzip2-compressed data, which has a header containing the usual bzip2 magic number (BZh9).
The second chunk will consist of the other data of interest, which has a header containing a different magic number (TBD).
The two magic numbers would be used for seeking, identifying and processing the two data block types differently.
My question is: Is there a magic number I can pick for the second block type, which would very unlikely (or better, impossible) to be found inside a bzip2-compressed block of bytes?
In other words, are there particular bytes that bzip2 excludes or would be probabilistically unlikely to use when compressing, within some statistical threshold, which I could use for a header for another data type in the same file?
One option is that, when I find header bytes for a second block type, I would simply try to process data in the second block type, and if that processing fails, then I assume I am accidentally inside a compressed bzip2 block. But I'd like to know if there is the possibility that there are bytes that would not be found in a bzip2 block, or would not be likely to be found.
No. bzip2 compressed data can contain any pair of bytes, essentially all with equal probability. All you could do would be to define a longer series of bytes as the signature, to reduce the probability that that series accidentally appears in the compressed data. But it still could.
The bzip2 format is self-terminating, so if you're willing to take the time to decode the bzip2 data, you can always find where the next thing is.
To answer the question in a comment, the entire bzip2 stream necessarily terminates on a byte boundary. The last byte may have 0 to 7 bits of zero pad. You can search backwards from the start of your second stream component to look for the bzip2 end marker 0x177245385090 (first 12 decimal digits of the square root of pi), which can start at any bit in a specific byte. It would be 80 to 87 bits back.

Types bit length and architecture specific implementations

I'm doing stuff in C++ but lately I've found that there are slight differences regarding how much data a type can accomodate and also the byte order is an issue.
Suppose I got a binary file, where I've encoded shorts that are 2 bytes in size. The file is in binary format like:
FA C8 - data segment 1
BA 32 - data segment 2
53 56 - data segment 3
Now all is well up to this point. Now I want to read this data. There are 2 problems:
1 what data type to choose to store this values?
2 how to deal with endianness of the target architecture?
The first problem is actually related to the second because here I will have to do bit shifts in order to swap the order of bytes.
I know that I could read the file byte by byte and add every two bytes. But is there an approach that could ease that pain?
I'm sorry If I'm being ambiguous. The problem is hard to explain. Hope you get a glimpse of what I'm talking about. I just want to store this data internally.
So I would appreciate some advices or if you can share some of your experience in this topic.
If you use big endian on the file that stores the data then you could just rely on htons(), htonl(), ntohs(), ntohl() to convert the integers to the right endianess before saving or after reading.
There is no easy way to do this.
Rather than doing that yourself, you might want to look into serialization libraries (for example Protobuf or boost serialization), they'll take care of a lot of that for you.
If you want to do it yourself, use fixed-width types (uint32_t and the like from <cstdint>), and endian conversion functions as appropriate. Either have a "prefix" in your file that determines what endianness it contains (a BOM/Byte Order Mark), or always store in either big or little endian, and systematically convert.
Be extra careful if you need to serialize strings, they have encoding problems of their own too.

Writing to a text file, binary vs ascii

So I am having the hardest time trying to understand this concept. I have a program that reads a text file, and writes it to another file and replaces the most common words with unsigned chars. But what I cannot for the life of me understand is how then do I determine the difference between the two.
If I write to the new file the original char I read in or an unsigned char value corresponding to 1-255, how then do I determine the difference when I go back in reverse to the original file contents?
When you write a file as binary, then a number such as "1253553" is written using 2 or 4 bytes (depending on the size of the int on the platform). So, in a binary file, you will see a sequence of 2 or 4 bytes representing that number. For chars, it should not make a difference as each char is represented on one byte.
Usually, you have to have some well known and obvious way to determine the format of your file.
One way to do this is to create your own file extension. You could naively expect that any file with that extension is in your compressed format, but it's actually quite likely other files out there have the same extension (e.g., ".dat" is probably a bad choice). So, you'll want to take further steps, like having the first few bytes of the file be something that is unlikely to be there in any other file (some "magic numbers"). Let's use two bytes, and let's simply choose 0xAB 0xCD as those two bytes.
So, when your program is presented with a file that has the proper extension, open it and read the first two bytes. If they're 0xAB and 0xCD, you can assume you're reading your special format.
This isn't a very strong way of accomplishing this task, but it is one way of doing it. You could get more extravagant if you like.
For more information, you might want to read the Wikipedia page on the subject. It's a start.

Error correcting codes

I need to use an error correcting technique on short messages (between 100 and 200 bits). Space available to add the redundant bits is constrained to 20-50%.
I will have to implement the coding and decoding in C/C++. So it needs to be either open sourced or sufficiently easy to program. (I have had some experience in the past with decoding algorithms - they are dreadful!)
Can anyone advise of a suitable error code to use (with relevant parameters) ?
Take a look at Reed Solomon error correction.
Sample implementation in C++ is available here.
For a different option look here - see item #11
EDIT: If you want a commercial library - http://www.schifra.com/faq.html
Reed-Solomon encoders are described in the form RS(CAPACITY,PAYLOAD). The capacity is always 2^SYMBOL-1, where SYMBOL is the number of bits in each Reed-Solomon symbol. Quite often, this SYMBOL size is 8 bits (a normal byte). It can typically be anything from 3 to 16 bits. For an 8-bit symbol, the Reed-Solomon encoder will be named RS(255,PAYLOAD).
The PAYLOAD is the number of non-parity symbols. If you want 4 parity symbols, you would specify RS(255,251).
To effectively correct errors in your data block, you must first package the data as symbols (groups of bits, quite often just 8-bit bytes). Your goal is to try to arrange (if possible) for any errors to be clustered into the smallest number of symbols possible.
For example, if an error occurs on average every 8 bits, then an 8-bit symbol will not be appropriate; pretty much every symbol will have an error! You might go for 4-bit symbols and use an RS(15,11) codec -- for up to 11 4-bit symbols at a time, producing 4 parity symbols per block. The smaller the symbol size, the lower the CAPACITY (eg. for a SYMBOL size of 4 bits, 2^4-1 == 15 symbol CAPACITY).
But typically, you would use 8-bit symbols. If you have a more realistic error rate of, say, 10% of your 8-bit symbols being erroneous, then you might use an RS(255,205) -- 50 parity symbols per 255 symbol Reed-Solomon "codeword", with a maximum PAYLOAD of 205 bytes. This gives us ~25% parity, allowing us to correct a codeword containing up to ~12.5% errors.
Using https://github.com/pjkundert/ezpwd-reed-solomon's c++/ezpwd/rs Reed-Solomon API, you would specify this as:
#include <ezpwd/rs>
...
ezpwd::RS<255,205> rscodec;
Put your data in a std::string (it can handle raw 8-bit binary data just fine) or a std::vector and call the API, adding the 50 symbols of parity:
std::string data;
// ... fill data with a fixed size block, up to 205 bytes
rscodec.encode( data );
Send your data, and later on, after you receive the data+parity, recover the original data (and discard the 50 parity symbols):
int corrected = rscodec.decode( data );
If the data could be recovered, the number of symbols corrected will be returned, or -1 if the Reed-Solomon codeword contained too many errors.
Enjoy!