Why base64 a sha1/sha256 hash? - amazon-web-services

can anybody tell me why amazon want a base64 of the hmac-sha1/sha256 hash?
http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/HMACAuth.html
I know that base64 is to represent binary date in ascii but sha1/sha256 is already ascii – I mean its only hex.
Thanks
Timo

Those hashes are not ASCII–the reason you see hex digits is because the software you use to generate them takes the binary output of the digest and turns it into an ASCII string of hex digits.
For instance, the MD5 digest will fill an array of 16 bytes. You can also represent it as a string of 32 characters, but the most basic form of the digest is still the array of bytes.
When you change an array of bytes into a hex string, you need 8 bits (one full character) to represent every 4 bits of data. Although it's not frequently called that way, you could say that this uses "base16" encoding, since you're grabbing 4 bits at a time and mapping them to a 16-character alphabet.
Base64, on the other hand, grabs 6 bits at a time and maps them to a 64-character alphabet. This means that you need 8 bits (again, one full character) to represent every 6 bits of data, which has half the wasted bits of base16. A base16-encoded string will always be twice as big as the original; a base64-encoded string will only be four thirds as big. For a SHA256 hash, base16 does 64 bytes, but base64 does approximately 43.

For example, the bytes, hex, and base64 samples below encode the same bytes:
bytes: 243 48 133 140 73 157 28 136 11 29 189 101 194 101 116 64 172 227 220 78
hex: f330858c499d1c880b1dbd65c2657440ace3dc4e
base64: 8zCFjEmdHIgLHb1lwmV0QKzj3E4=.
It's only that AWS requires its values to be base64 encoded.

Related

How 0061 736d represents \0asm?

I just started to learn web assembly . I found this text
"In binary format The first four bytes represent the Wasm binary magic
number \0asm; the next four bytes represent the Wasm binary version in
a 32-bit format"
I am not able to understand this . Can anyone explain me this
\0 is a character with code 0 (the first 00 in 00617369), the remaining three are literal characters a, s and m. With codes 97, 115 and 109 respectively, or 61, 73 and 6d in hex.

Why does base64-encoded data compress so poorly?

I was recently compressing some files, and I noticed that base64-encoded data seems to compress really bad. Here is one example:
Original file: 429,7 MiB
compress via xz -9:
13,2 MiB / 429,7 MiB = 0,031 4,9 MiB/s 1:28
base64 it and compress via xz -9:
26,7 MiB / 580,4 MiB = 0,046 2,6 MiB/s 3:47
base64 the original compressed xz file:
17,8 MiB in almost no time = the expected 1.33x increase in size
So what can be observed is that:
xz compresses really good ☺
base64-encoded data doesn't compress well, it is 2 times larger than the unencoded compressed file
base64-then-compress is significantly worse and slower than compress-then-base64
How can this be? Base64 is a lossless, reversible algorithm, why does it affect compression so much? (I tried with gzip as well, with similar results).
I know it doesn't make sense to base64-then-compress a file, but most of the time one doesn't have control over the input files, and I would have thought that since the actual information density (or whatever it is called) of a base64-encoded file would be nearly identical to the non-encoded version, and thus be similarily compressible.
Most generic compression algorithms work with a one-byte granularity.
Let's consider the following string:
"XXXXYYYYXXXXYYYY"
A Run-Length-Encoding algorithm will say: "that's 4 'X', followed by 4 'Y', followed by 4 'X', followed by 4 'Y'"
A Lempel-Ziv algorithm will say: "That's the string 'XXXXYYYY', followed by the same string: so let's replace the 2nd string with a reference to the 1st one."
A Huffman coding algorithm will say: "There are only 2 symbols in that string, so I can use just one bit per symbol."
Now let's encode our string in Base64. Here's what we get:
"WFhYWFlZWVlYWFhYWVlZWQ=="
All algorithms are now saying: "What kind of mess is that?". And they're not likely to compress that string very well.
As a reminder, Base64 basically works by re-encoding groups of 3 bytes in (0...255) into groups of 4 bytes in (0...63):
Input bytes : aaaaaaaa bbbbbbbb cccccccc
6-bit repacking: 00aaaaaa 00aabbbb 00bbbbcc 00cccccc
Each output byte is then transformed into a printable ASCII character. By convention, these characters are (here with a mark every 10 characters):
0 1 2 3 4 5 6
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
For instance, our example string begins with a group of three bytes equal to 0x58 in hexadecimal (ASCII code of character "X"). Or in binary: 01011000. Let's apply Base64 encoding:
Input bytes : 0x58 0x58 0x58
As binary : 01011000 01011000 01011000
6-bit repacking : 00010110 00000101 00100001 00011000
As decimal : 22 5 33 24
Base64 characters: 'W' 'F' 'h' 'Y'
Output bytes : 0x57 0x46 0x68 0x59
Basically, the pattern "3 times the byte 0x58" which was obvious in the original data stream is not obvious anymore in the encoded data stream because we've broken the bytes into 6-bit packets and mapped them to new bytes that now appear to be random.
Or in other words: we have broken the original byte alignment that most compression algorithms rely on.
Whatever compression method is used, it usually has a severe impact on the algorithm performance. That's why you should always compress first and encode second.
This is even more true for encryption: compress first, encrypt second.
EDIT - A note about LZMA
As MSalters noticed, LZMA -- which xz is using -- is working on bit streams rather than byte streams.
Still, this algorithm will also suffer from Base64 encoding in a way which is essentially consistent with my earlier description:
Input bytes : 0x58 0x58 0x58
As binary : 01011000 01011000 01011000
(see above for the details of Base64 encoding)
Output bytes : 0x57 0x46 0x68 0x59
As binary : 01010111 01000110 01101000 01011001
Even by working at the bit level, it's much easier to recognize a pattern in the input binary sequence than in the output binary sequence.
Compression is necessarily an operation that acts on multiple bits. There's no possible gain in trying to compress an individual "0" and "1". Even so, compression typically works on a limited set of bits at a time. The LZMA algorithm in xz isn't going to consider all of the 3.6 billion bits at once. It looks at much smaller strings (<273 bytes).
Now look at what base-64 encoding does: It replaces a 3 byte (24 bit) word with a 4 byte word, using only 64 out of 256 possible values. This gives you the x1.33 growth.
Now it is fairly clear that this growth must cause some substrings to grow past the maximum substring size of the encoder. This causes them to be no longer compressed as a single substring, but as two separate substrings indeed.
As you have a lot of compression (97%), you apparently have the situation that very long input substrings are compressed as a whole. this means that you will also have many substrings being base-64 expanded past the maximum length the encoder can deal with.
It's not Base64. its them memory requirements of libraries "The presets 7-9 are like the preset 6 but use bigger dictionaries and have higher compressor and decompressor memory requirements."https://tukaani.org/xz/xz-javadoc/org/tukaani/xz/LZMA2Options.html

issues using stringstream to handle binary file

I'm working with a binary file that I need to grab its useful contents from. The structure is:
Based on a quick look at the file, you don't have an "unknown amt of nulls" anywhere. The format appears to be:
N Bytes: number of animals, integer as text delimited by '\n'
24 Bytes per animal:
16 Bytes: name of animal padded with 0
4 Bytes: some 32 bit number (little endian)
4 Bytes: another 32 bit number (little endian)
You shouldn't be reading this as a text file, but instead as a raw binary file. There's absolutely no need for a stringstream, you can simply parse the number of animals by reading in one byte at a time and adding to the previous value * 10 until you reach '\n'.

Huffman Coding in JPEG

This is my JPEG picture hex content (I marked FFC4 marker on picture). As you can see, after the byte 0x01 there is value 0xA2! How that can be possible, because the standard says that the next 16 bytes after the 0x10 tell us how many codes of each length? It is impossible to have that number of codes with 1 bit. Am I wrong?
What you are seeing is the length of the huffman block (in big endian order) in bytes (subtract 2 to include length of length field).
The huffman block is 0x1a2 bytes long.
Following the length there is a single byte representing the huffman table information (table number and also whether or not the table is for AC or DC coefficients)
Start reading the length codes after the information value:
Information Byte = 0x00
Number of length 1 codes = 0
Number of length 2 codes = 0
Number of length 3 codes = 7
...

characters XOR with caret manipulation

Working with exclusive-OR on bits is something which is clear to me. But here, XOR is working on individual characters. So does this mean the byte which makes up the character is being XORed? What does this look like?
#include <iostream.h>
int main()
{
char string[11]="A nice cat";
char key[11]="ABCDEFGHIJ";
for(int x=0; x<10; x++)
{
string[x]=string[x]^key[x];
cout<<string[x];
}
return 0;
}
I know bits XORed look like this:
1010
1100
0110
XOR has the nice property that if you XOR something twice using the same data, you obtain the original. The code you posted is some rudimentary encryption function, which "encrypts" a string using a key. The resulting ciphertext can be fed through the same program to decrypt it.
In C and C++ strings are usually stored in memory as 8-bit char values where the value stored is the ASCII value of the character.
Your code is therefore XORing the ASCII values. For example, the second character in your output is calculated as follows:
'B' ^ ' '
= 66 ^ 32
= 01000010 ^ 00100000
= 01100010
= 98
= 'b'
You could get a different result if you ran this code on a system which uses EBCDIC instead of ASCII.
The xor on characters performs the xor operation on each corresponding bit of the two characters (one byte each).
So does this mean the byte which makes up the character is being XORed?
Exactly.
What does this look like?
As any other XOR :) . In ASCII "A nice cat" is (in hexadecimal)
41 20 6E 69 63 65 20 63 61 74
and ABCDEFGHIJ
41 42 43 44 45 46 47 48 49 4A
so, if you XOR each byte with each other, you get
00 62 2D 2D 26 23 67 2B 28 3E
, which is the hexadecimal representation of "\0b--&#g+(>", i.e. the string that is displayed when you run that code.
Notice that if you XOR again the resulting text you get back the text with which you started; this the reason why XOR is used often in encoding and cyphering.
This is a simple demonstration of one time pad encryption, which as you can see is quite simple and also happens to be the only provably unbreakable form of encryption. Due to it being symmetric and having a key as large as the message, it's often not practical, but it still has a number of interesting applications.. :-)
One fun thing to notice if you're not already familiar with it is the symmetry between the key and the ciphertext. After generating them, there's no distinction of which one is which, i.e. which one was created first and which was based on the plaintext xor'd with the other. Aside from basic encryption this also leads to applications in plausible deniability.