Why does base64-encoded data compress so poorly? - compression

I was recently compressing some files, and I noticed that base64-encoded data seems to compress really bad. Here is one example:
Original file: 429,7 MiB
compress via xz -9:
13,2 MiB / 429,7 MiB = 0,031 4,9 MiB/s 1:28
base64 it and compress via xz -9:
26,7 MiB / 580,4 MiB = 0,046 2,6 MiB/s 3:47
base64 the original compressed xz file:
17,8 MiB in almost no time = the expected 1.33x increase in size
So what can be observed is that:
xz compresses really good ☺
base64-encoded data doesn't compress well, it is 2 times larger than the unencoded compressed file
base64-then-compress is significantly worse and slower than compress-then-base64
How can this be? Base64 is a lossless, reversible algorithm, why does it affect compression so much? (I tried with gzip as well, with similar results).
I know it doesn't make sense to base64-then-compress a file, but most of the time one doesn't have control over the input files, and I would have thought that since the actual information density (or whatever it is called) of a base64-encoded file would be nearly identical to the non-encoded version, and thus be similarily compressible.

Most generic compression algorithms work with a one-byte granularity.
Let's consider the following string:
"XXXXYYYYXXXXYYYY"
A Run-Length-Encoding algorithm will say: "that's 4 'X', followed by 4 'Y', followed by 4 'X', followed by 4 'Y'"
A Lempel-Ziv algorithm will say: "That's the string 'XXXXYYYY', followed by the same string: so let's replace the 2nd string with a reference to the 1st one."
A Huffman coding algorithm will say: "There are only 2 symbols in that string, so I can use just one bit per symbol."
Now let's encode our string in Base64. Here's what we get:
"WFhYWFlZWVlYWFhYWVlZWQ=="
All algorithms are now saying: "What kind of mess is that?". And they're not likely to compress that string very well.
As a reminder, Base64 basically works by re-encoding groups of 3 bytes in (0...255) into groups of 4 bytes in (0...63):
Input bytes : aaaaaaaa bbbbbbbb cccccccc
6-bit repacking: 00aaaaaa 00aabbbb 00bbbbcc 00cccccc
Each output byte is then transformed into a printable ASCII character. By convention, these characters are (here with a mark every 10 characters):
0 1 2 3 4 5 6
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
For instance, our example string begins with a group of three bytes equal to 0x58 in hexadecimal (ASCII code of character "X"). Or in binary: 01011000. Let's apply Base64 encoding:
Input bytes : 0x58 0x58 0x58
As binary : 01011000 01011000 01011000
6-bit repacking : 00010110 00000101 00100001 00011000
As decimal : 22 5 33 24
Base64 characters: 'W' 'F' 'h' 'Y'
Output bytes : 0x57 0x46 0x68 0x59
Basically, the pattern "3 times the byte 0x58" which was obvious in the original data stream is not obvious anymore in the encoded data stream because we've broken the bytes into 6-bit packets and mapped them to new bytes that now appear to be random.
Or in other words: we have broken the original byte alignment that most compression algorithms rely on.
Whatever compression method is used, it usually has a severe impact on the algorithm performance. That's why you should always compress first and encode second.
This is even more true for encryption: compress first, encrypt second.
EDIT - A note about LZMA
As MSalters noticed, LZMA -- which xz is using -- is working on bit streams rather than byte streams.
Still, this algorithm will also suffer from Base64 encoding in a way which is essentially consistent with my earlier description:
Input bytes : 0x58 0x58 0x58
As binary : 01011000 01011000 01011000
(see above for the details of Base64 encoding)
Output bytes : 0x57 0x46 0x68 0x59
As binary : 01010111 01000110 01101000 01011001
Even by working at the bit level, it's much easier to recognize a pattern in the input binary sequence than in the output binary sequence.

Compression is necessarily an operation that acts on multiple bits. There's no possible gain in trying to compress an individual "0" and "1". Even so, compression typically works on a limited set of bits at a time. The LZMA algorithm in xz isn't going to consider all of the 3.6 billion bits at once. It looks at much smaller strings (<273 bytes).
Now look at what base-64 encoding does: It replaces a 3 byte (24 bit) word with a 4 byte word, using only 64 out of 256 possible values. This gives you the x1.33 growth.
Now it is fairly clear that this growth must cause some substrings to grow past the maximum substring size of the encoder. This causes them to be no longer compressed as a single substring, but as two separate substrings indeed.
As you have a lot of compression (97%), you apparently have the situation that very long input substrings are compressed as a whole. this means that you will also have many substrings being base-64 expanded past the maximum length the encoder can deal with.

It's not Base64. its them memory requirements of libraries "The presets 7-9 are like the preset 6 but use bigger dictionaries and have higher compressor and decompressor memory requirements."https://tukaani.org/xz/xz-javadoc/org/tukaani/xz/LZMA2Options.html

Related

Base64 algorithm with multiple result of a factor in C++

I have the following problem: I need to convert a certain content to base64 to prevent some problems with characters, and after this conversion I need to encrypt this data with one Aes algorithm with a key length of 16. The problem occurs when the result of the base64 algorithm returns a response with size that is not a multiple of 16 causing problems on encryption, considering that the size of the original content are multiples of 16. How can I avoid this problem?
Pad the result of the base4 encoding to be a multiple of 16.
base64 encoder encodes your 8-bit data using 6-bits per character, so for every 3 bytes of input data you get 4 bytes of output data (plus padding to 4-byte boundary). So data size passed to AES algorithm may have length which is not multiple of 16.
Please check documentation for your AES library - chances are that it can handle this internally if you call special function for last chunk of data (e.g. EVP_EncryptFinal_ex in OpenSSL). Another solution is to pad data to 16-byte boundary in your code before encrypting it.
Most AES implementations support padding such as PKCS#7 (née PKCS#5). This will add the required padding bytes on encryption and remove then on decryption.
AES is a data based function and as such accepts any bytes so it's not necessary to Base64 encode prior to encryption. Encrypted data is just bytes, not characters, so the output may need to be encoded with a method such as Base64 or hexadecimal based on the usage such as the need for printable characters.

Outputting Huffman codes to file

I have a program that reads a file and saves the frequency of each character. It then constructs a huffman tree based on each character's frequency and then outputs to a file the huffman codes for the tree.
So an input like "Hello World" would output this sequence to a file:
01010101 0010 010 010 01010 0101010 000 01010 00101 010 0001
This makes sense because the most frequent characters have the shortest codes. The issue is, this increases the file size ten-fold. I realized the reason why is because each 1 and 0 is being represented in memory as its own character, so they get each get expanded out to a byte of data.
I was thinking what I could do is convert each code (E.G. "010") to a character and save that to file - but that still would pad the code to be a byte long (Or mess it up if the code is longer than a byte).
How do I go about this? I can give code snippets if needed - I'm basically saving each code into a string so that's why the file's coming out so big (It's outputting each "bit" as a byte). If I were to convert the code to a long for example, then a code like 00010 would be represented as 2 and a code like 010 would also be represented as 2.
You basically have to do it a byte (or a word) at a time. Maintain a byte which you fill with bits, and a record of how many bits have been filled in so far. When you get to 8, write the byte and start over with an empty one.

Where can I get sample Reed-Solomon encoded data?

I want to write a Reed-Solomon decoder and experiment with performance improvements. Where could I find sample data with appended Reed-Solomon parity bytes?
I am aware that Reed-Solomon is used in all kinds of 1D and 2D bar codes, but I would like to have the raw data (an array of bytes) with clear separation of payload and parity bytes.
Any help is appreciated.
Basically, a Reed-Solomon code will be composed of characters with a value between 0 and (m-1), where m is the exposant of the Galois Field used to generate the RS code. For example, in GF(2^8) (2^8 = 256), you will get a RS code composed of characters between 0 and 255 (compatible with ASCII, UTF-8 and usual binary encoding). In GF(2^16), you will get characters encoded between 0 and 65535 (compatible with UTF-16 or with UTF-8 if you encode 2 characters as one).
Other than the range of values of each character of a RS code, all the rest can be basically considered random from an external POV (it's not if you have the generator polynomial and Galois field, but for your purpose of getting a sample, you can assume a random distribution of values in the valid range).
If you want to generate real samples of a RS code with the corresponding data block, you can use the Python library pyFileFixity (disclaimer, I am the author). By default, each ecc block is separated by a md5 digest, so that you can clearly separate them. The original data is not stored, but you can easily do that by modifying the script structural_adaptive_ecc.py or header_ecc.py (the latter will be easier to modify) to also store the original data (it's just a file.write() to edit). If Python is not your thing, you can probably find a Reed-Solomon library for your language of choice, and just do a slight modification to print or save into a file the original data along the ecc blocks.

issues using stringstream to handle binary file

I'm working with a binary file that I need to grab its useful contents from. The structure is:
Based on a quick look at the file, you don't have an "unknown amt of nulls" anywhere. The format appears to be:
N Bytes: number of animals, integer as text delimited by '\n'
24 Bytes per animal:
16 Bytes: name of animal padded with 0
4 Bytes: some 32 bit number (little endian)
4 Bytes: another 32 bit number (little endian)
You shouldn't be reading this as a text file, but instead as a raw binary file. There's absolutely no need for a stringstream, you can simply parse the number of animals by reading in one byte at a time and adding to the previous value * 10 until you reach '\n'.

Is it possible to losslessy compress 32 hexadecimal numbers into 30?

For example is it possible to compress
002e3483bbdc11ddaae0754822a559f6 into something that just takes at most 30 characters.
Yes, you can convert it to a base-32 number so the greatest 32 characters hex number i.e. ffffffffffffffffffffffffffffffff is equivalent to 80000000000000000000000000 in base-32 that only has 26 characters, also note that in base-32 you will end with a string containing only this characters: 123456789ABCDEFGHIJKLMNOPQRSTUV
For example: 002e3483bbdc11ddaae0754822a559f6 is 5OQ87EUS27F0000000000000 in base-32
If your question is to compress 32 hex numbers into 30 hex numbers.
This is impossible to occur for all test cases, since, if it were possible, multiple 32-length hex strings would have to compress to the same 30-length hex string, thus you wouldn't know which one it was (the pigeonhole principle).
A less proof-y proof - you'd be able to repeatedly invoke the process on any size file to get down to a single 30-length hex string, which doesn't make a whole lot of sense.
Here is a article I just found. Wikipedia says something similar.
Convert hex to binary then use something like base64 or any other encoding scheme, see Binary-to-text encoding (Wikipedia). This has the advantage of not requiring 128bit arithmetic like the suggested base32 solution.
Conversion to base64 and back:
$ echo 002e3483bbdc11ddaae0754822a559f6 |xxd -r -ps |openssl base64 -e |tee >(openssl base64 -d |xxd -ps)
AC40g7vcEd2q4HVIIqVZ9g==
002e3483bbdc11ddaae0754822a559f6
Cut the line starting from |tee to get only the encoded output. In most programing languages you will have core or external libraries to do hex to binary conversion and base64 encoding.
NB: Conversion to base32 would also be possible but the base32 binary to text encoding requires 8-bytes padding, so you would have to trim it then re-add the pads (=) on decode.