Base64 algorithm with multiple result of a factor in C++ - c++

I have the following problem: I need to convert a certain content to base64 to prevent some problems with characters, and after this conversion I need to encrypt this data with one Aes algorithm with a key length of 16. The problem occurs when the result of the base64 algorithm returns a response with size that is not a multiple of 16 causing problems on encryption, considering that the size of the original content are multiples of 16. How can I avoid this problem?

Pad the result of the base4 encoding to be a multiple of 16.

base64 encoder encodes your 8-bit data using 6-bits per character, so for every 3 bytes of input data you get 4 bytes of output data (plus padding to 4-byte boundary). So data size passed to AES algorithm may have length which is not multiple of 16.
Please check documentation for your AES library - chances are that it can handle this internally if you call special function for last chunk of data (e.g. EVP_EncryptFinal_ex in OpenSSL). Another solution is to pad data to 16-byte boundary in your code before encrypting it.

Most AES implementations support padding such as PKCS#7 (née PKCS#5). This will add the required padding bytes on encryption and remove then on decryption.
AES is a data based function and as such accepts any bytes so it's not necessary to Base64 encode prior to encryption. Encrypted data is just bytes, not characters, so the output may need to be encoded with a method such as Base64 or hexadecimal based on the usage such as the need for printable characters.

Related

Why does base64-encoded data compress so poorly?

I was recently compressing some files, and I noticed that base64-encoded data seems to compress really bad. Here is one example:
Original file: 429,7 MiB
compress via xz -9:
13,2 MiB / 429,7 MiB = 0,031 4,9 MiB/s 1:28
base64 it and compress via xz -9:
26,7 MiB / 580,4 MiB = 0,046 2,6 MiB/s 3:47
base64 the original compressed xz file:
17,8 MiB in almost no time = the expected 1.33x increase in size
So what can be observed is that:
xz compresses really good ☺
base64-encoded data doesn't compress well, it is 2 times larger than the unencoded compressed file
base64-then-compress is significantly worse and slower than compress-then-base64
How can this be? Base64 is a lossless, reversible algorithm, why does it affect compression so much? (I tried with gzip as well, with similar results).
I know it doesn't make sense to base64-then-compress a file, but most of the time one doesn't have control over the input files, and I would have thought that since the actual information density (or whatever it is called) of a base64-encoded file would be nearly identical to the non-encoded version, and thus be similarily compressible.
Most generic compression algorithms work with a one-byte granularity.
Let's consider the following string:
"XXXXYYYYXXXXYYYY"
A Run-Length-Encoding algorithm will say: "that's 4 'X', followed by 4 'Y', followed by 4 'X', followed by 4 'Y'"
A Lempel-Ziv algorithm will say: "That's the string 'XXXXYYYY', followed by the same string: so let's replace the 2nd string with a reference to the 1st one."
A Huffman coding algorithm will say: "There are only 2 symbols in that string, so I can use just one bit per symbol."
Now let's encode our string in Base64. Here's what we get:
"WFhYWFlZWVlYWFhYWVlZWQ=="
All algorithms are now saying: "What kind of mess is that?". And they're not likely to compress that string very well.
As a reminder, Base64 basically works by re-encoding groups of 3 bytes in (0...255) into groups of 4 bytes in (0...63):
Input bytes : aaaaaaaa bbbbbbbb cccccccc
6-bit repacking: 00aaaaaa 00aabbbb 00bbbbcc 00cccccc
Each output byte is then transformed into a printable ASCII character. By convention, these characters are (here with a mark every 10 characters):
0 1 2 3 4 5 6
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
For instance, our example string begins with a group of three bytes equal to 0x58 in hexadecimal (ASCII code of character "X"). Or in binary: 01011000. Let's apply Base64 encoding:
Input bytes : 0x58 0x58 0x58
As binary : 01011000 01011000 01011000
6-bit repacking : 00010110 00000101 00100001 00011000
As decimal : 22 5 33 24
Base64 characters: 'W' 'F' 'h' 'Y'
Output bytes : 0x57 0x46 0x68 0x59
Basically, the pattern "3 times the byte 0x58" which was obvious in the original data stream is not obvious anymore in the encoded data stream because we've broken the bytes into 6-bit packets and mapped them to new bytes that now appear to be random.
Or in other words: we have broken the original byte alignment that most compression algorithms rely on.
Whatever compression method is used, it usually has a severe impact on the algorithm performance. That's why you should always compress first and encode second.
This is even more true for encryption: compress first, encrypt second.
EDIT - A note about LZMA
As MSalters noticed, LZMA -- which xz is using -- is working on bit streams rather than byte streams.
Still, this algorithm will also suffer from Base64 encoding in a way which is essentially consistent with my earlier description:
Input bytes : 0x58 0x58 0x58
As binary : 01011000 01011000 01011000
(see above for the details of Base64 encoding)
Output bytes : 0x57 0x46 0x68 0x59
As binary : 01010111 01000110 01101000 01011001
Even by working at the bit level, it's much easier to recognize a pattern in the input binary sequence than in the output binary sequence.
Compression is necessarily an operation that acts on multiple bits. There's no possible gain in trying to compress an individual "0" and "1". Even so, compression typically works on a limited set of bits at a time. The LZMA algorithm in xz isn't going to consider all of the 3.6 billion bits at once. It looks at much smaller strings (<273 bytes).
Now look at what base-64 encoding does: It replaces a 3 byte (24 bit) word with a 4 byte word, using only 64 out of 256 possible values. This gives you the x1.33 growth.
Now it is fairly clear that this growth must cause some substrings to grow past the maximum substring size of the encoder. This causes them to be no longer compressed as a single substring, but as two separate substrings indeed.
As you have a lot of compression (97%), you apparently have the situation that very long input substrings are compressed as a whole. this means that you will also have many substrings being base-64 expanded past the maximum length the encoder can deal with.
It's not Base64. its them memory requirements of libraries "The presets 7-9 are like the preset 6 but use bigger dictionaries and have higher compressor and decompressor memory requirements."https://tukaani.org/xz/xz-javadoc/org/tukaani/xz/LZMA2Options.html

what is the difference getContentMd5() and getETag() of aws s3 PutObjectResult

getEtag() seems to return base64 encoded md5. So what does getContentMd5 return? How are they related?
One example:
getContentMd5() -> /yoLi66uT7Q6qaverVTqrQ==
getEtag() -> ff2a0b8baeae4fb43aa9abdead54eaad
It's the same value -- the md5 hash of the object -- encoded two different ways.
An MD5 hash consists of 16 bytes, but they are not all printable characters. The ETag is the md5 hash, hex-encoded (not base64, as the question suggests) -- hex encoding uses 32 characters to encode 16 bytes.
Meanwhile, Content-MD5 is the md5 hash, base64 encoded, which uses 24 characters to encode 16 bytes.

Where can I get sample Reed-Solomon encoded data?

I want to write a Reed-Solomon decoder and experiment with performance improvements. Where could I find sample data with appended Reed-Solomon parity bytes?
I am aware that Reed-Solomon is used in all kinds of 1D and 2D bar codes, but I would like to have the raw data (an array of bytes) with clear separation of payload and parity bytes.
Any help is appreciated.
Basically, a Reed-Solomon code will be composed of characters with a value between 0 and (m-1), where m is the exposant of the Galois Field used to generate the RS code. For example, in GF(2^8) (2^8 = 256), you will get a RS code composed of characters between 0 and 255 (compatible with ASCII, UTF-8 and usual binary encoding). In GF(2^16), you will get characters encoded between 0 and 65535 (compatible with UTF-16 or with UTF-8 if you encode 2 characters as one).
Other than the range of values of each character of a RS code, all the rest can be basically considered random from an external POV (it's not if you have the generator polynomial and Galois field, but for your purpose of getting a sample, you can assume a random distribution of values in the valid range).
If you want to generate real samples of a RS code with the corresponding data block, you can use the Python library pyFileFixity (disclaimer, I am the author). By default, each ecc block is separated by a md5 digest, so that you can clearly separate them. The original data is not stored, but you can easily do that by modifying the script structural_adaptive_ecc.py or header_ecc.py (the latter will be easier to modify) to also store the original data (it's just a file.write() to edit). If Python is not your thing, you can probably find a Reed-Solomon library for your language of choice, and just do a slight modification to print or save into a file the original data along the ecc blocks.

Byte to ASCII turns into square characters

I have an USB Device. It's a pedometer/activity tracker.
I'm able to send bytes to the device and I'm also able to retrieve bytes back.
But for some reason numbers are turned into those square characters and not into numbers... The characters that are actually letters come through just fine.
Any idea's on how to retrieve the bytes as numbers?
The expected format something like this:
The square characters are actually binary data, likely hex before 0x20 or above 0x7f.
The first 15 bytes are binary, you would need to interpret them using something approximately like the following pseudocode:
if (isascii(byte)) {
textToAppendToEditBox(byte);
} else {
textToAppendToEditBox( someKindOfSprintF( "{%02x}");
}
There are plenty of googleable examples of hex dumping code snippets that can make it look pretty
The expected format that you showed sends binary data. You first have to convert the received data to internal information, then you can pass that information to an std::ostringstream to display it in a gui.
When reading in the binary data, make sure to respect the used endianess.

Save a big array of numbers in a cookie

I have to save 400 numbers in a cookie on the client's side. Every numbers is from 1 to 10. Any suggestions how to organize the data in the cookie to make it compact? (I'm going to work with the cookies using Python at the server side, and JavaScript at the client's side.)
Another solution would be to use `localStorage':
for(i=0; i <400; i++){
localStorage[i] = Math.floor((Math.random()*10)+1);
}
console.log(localStorage);
alert(JSON.stringify(localStorage));
http://jsfiddle.net/S2uUn/
OK, if you want to do this entirely on the client side, then let's try another approach. For every three integers (from 1 to 10) that you want to store, there would be 1000 combinations (10*10*10=1000). 12 bits give you 1024 combinations (2^12=1024). So, you can store 3 integers (each 1-10) using 12 bits. So, 400 integers (each 1-10) could be stored using 1600 bits (400 / 3 * 12). The logic for storing the integers bitwise this way can be implemented easily on the client side using javascript. 1600 bits = 200 bytes (1 byte = 8 bits), so you can store 400 integers (each 1-10) using 200 bytes. This is binary data, so to store this info in a cookie, it would have to be converted to ascii text. base64 encoding is one way to do that, and this can be done client side using the functions at How can you encode a string to Base64 in JavaScript?. base64 encoding produces 4 characters for every 3 bytes encoded, so the resulting string that would be stored in the cookie would be 267 characters in length (20 * 4 / 3). The whole thing can be done this way on the client-side in javascript, and 400 integers (each 1-10) could be stored in a cookie 267 characters in length.
How about writing the list of integers to a comma-separated list, then compressing the list using gzip (which will produce binary output), then piping the binary output from gzip through a base-64 encoder, which will produce text that can be stored as a cookie. I would guess that the result would end up being only about 100 bytes in size, which can easily be stored in a cookie.