How handles AES_set_encrypt_key short keys - c++

I'm writing a set of tools, where a c++ application encodes data with the AES encryption standard and a java app decodes it. As far as I know the key length has to be 16 bytes. But when I was trying to use passwords with different length I came across the following behaviour of the AES_set_encrypt_key function:
length = 16 : nothing special happens, as expected
length > 16 : password gets cut after the sixteenth character
length < 16 : the password seems to be filled "magical"
So, does anyone know what exactly happens in the last case ?
Btw: Java throws an exception if the password is not exactly 16 chars long
Thanks,
Robert

Don't confuse byte array with C-String. Every C-String is a byte array, but not every byte array is a C-String.
The concept with AES is to use a "KEY". It acts like a password but the concept is a little bit different. It has a fixed size and must be 16 bytes on your case.
The key is a byte array of 16 bytes that is NOT a C-String. It means it can have any value at any point in the buffer, while a C-String must be null-terminated (the '\0' in the end of your content).
When you give a C-String to your AES, it still interprets it as a buffer, ignoring any \0 character on the way. In other words, if your string is "something", the buffer is in fact "something\0??????", where "??????" here means any random trash bytes that cannot be guaranteed to work all the time.
Why does the key length < 16 is working? In debug mode, when you start a buffer, it usually keeps a default value that is repeating on your case. But it changes accordinly to compiler and/or platform, so take take.
And the key length > 16, AES is just picking the 16 first bytes of your buffer and ignoring the rest.

Related

Can a file end with a 0x80 char?

I'm implementing my own version of blowfish encoder/decoder. I use a standard padding of 0x80 if necessary.
My question is if I need to add a padding chars even if I don't need it, because in the case of a file that ends naturally with 0x80, in the deconding part I will remove this character, but in this case it is a wrong action, since te 0x80 is part of the file itself.
This of course can be solved by adding a final char even if the total number of characters is a multiple of the encoding block (64bit in this case). I can implement this countermeasure, but first I'd prefer to know if I really need it.
Natural consequence is thinking if this type of char is chosen because never happens in a file (so the wrong situation above never happens), but I'm not sure at all.
Thanks! and sorry for the dummy question..
On Linux and other filesystem a file can contain any sequence of any bytes. So ideally you can not depended on any particular byte to decide file end. (Although EOF is there..!!)
What i am suggesting is the Most file formats are using.
You can have specific 4-5 magic bytes Header for your file format. and followed by you can have size of rest of bytes. So after some bytes your last byte would be there.
Edit:
In above suggestion In encoder you need to update size of file after adding any new data in files.
If you do not want that then you can encode your data in perticular chunk of data and then encode them packet by packet. Your file will be number of some packet. such things are used in NAL units
Blowfish is a block cipher. It always takes 64 bit input and outputs 64 bit output. If you want to encrypt a stream that is not a multiple of 64 bit long you will need to add some padding bytes. When you decrypt the encrypted stream you always get a multiple of 64 bit. But you have no information if the encrypted stream contained 'real' data or padding bytes. You need to keep track of that yourself. A simple approach would be to store the set of 'data length' and 'encrypted stream'. Another approach would be to prepend the clear text stream with a data length value, for example a 64 bit unsigned integer. Then after decrypting the encrypted stream you will have that length as the first value and then you know how many bytes of the last block are real data and how many are just padding.
And regarding your question about what bytes can be at the end of a file: any. You can have files with arbitrary content. Each byte in the file can be of any value, there is no restriction.
Regular binary file can contain any bytes sequence, so file can end with 0x80, with NULL or any other.
If you are talking about some specific standard, so it depends.. However I think that there is no such file type that could not contain some specific character in the end, I know about file types that ignores as many last characters as not needed (because header determines size) so you should do so, but never heard about illegal file data (except cracked).
So as mentioned use header, reserve for example 8 bytes that determines size. That is easy solution.
Also, before asking such question, you should ask yourself, why file should end with some special character?
The answer is Yes. On every operating system in current use, a file can end with any possible sequence of bytes. In fact, you should generate such a file to test your implementation.
In the general case you cannot recognise trailing padding characters or remove them reliably without knowing the length of the file. Therefore encoding the length of the file must be part of your cryptographic protocol.
Simply put the length of the file at the beginning and encrypt the whole thing, including any padding bytes you like (random is probably best). Once unencrypted you will have the file length to tell you where to truncate.

Storing hexadecimal addresses in a file

I have a pintool application which store the memory address accessed by an application in a file. These addresses are in hexadecimal form. If I write these addresses in form of string, it will take a huge amount of storage(nearly 300GB). Writing such a large file will also take large amount of time. So I think of an alternate way to reduce the amount of storage used.
Each character of hexadecimal address represent 4 bits and each ASCII character is of 8 bits. So I am thinking of representing two hexadecimal characters by one ASCII character.
For example :
if my hexadecimal address is 0x26234B
then corresponding converted ASCII address will be &#K (0x is ignored as I know all address will be hexadecimal).
I want to know that is there any other much more efficient method for doing this which takes less amount of storage.
NOTE : I am working in c++
This is a good start. If you really want to go further, you can consider compressing the data using something like a zip library or Huffman encoding.
Assuming your addresses are 64-bit pointers, and that such a representation is sensible for your platform, you can just store them as 64-bit ints. For example, you list 0x1234567890abcdef, which could be stored as the four bytes:
12 34 56 78 90 ab cd ef
(your pointer, stored in 8 bytes.)
or the same, but backwards, depending on what endianness you choose. Specifically, you should read this.
We can even do this somewhat platform-independently: uintptr_t is unsigned integer type the same width as a pointer (assuming one exists, which it usually does, but it's not a sure thing), and sizeof(our_pointer), which gives us the size in bytes of a pointer. We can arrive at the above bytes with:
Convert the pointer to an integer representation (i.e., 0x0026234b)
Shift the bytes around to pick out the one we want.
Stick it somewhere.
In code:
unsigned char buffer[sizeof(YourPointerType)];
for(unsigned int i = 0; i < sizeof(YourPointerType); ++i) {
buffer[i] = (
(reinterpret_cast<uintptr_t>(your_pointer) >> (sizeof(YourPointerType) - i - 1))
& 0xff
);
}
Some notes:
That'll do a >> 0 on the last loop iteration. I suspect that might be undefined behavior, and you'll need an if-case to handle it.
This will write out pointers of the size of your platform, and requires that they can be converted sensibly to integers. (I think uintptr_t won't exist if this isn't the case.) It won't do the same thing on 64- as it will on 32-bit platforms, as they have different pointer sizes. (Or any other pointer-sized platform you run across.)
A program's pointers aren't valid once the program dies, and might not even remain valid when the program is still running. (If the pointer points to memory that the program decides to free, then the pointer is invalid.)
There's likely a library that'll do this for you. (struct, in Python, does this.)
The above is a big-endian encoder. Alternatively, you can write out little endian — the Wikipedia article details the difference.
Last, you can just cast a pointer to the pointer to a unsigned char *, and write that. (I.e., dump the actual memory of the pointer to a file.) That's way more platform dependent though.
If you need even more space, I'd run it through gzip.

Data Encryption using AES-256-CBC mode openssl , doesnt return the same size of data which doesnt need padding?

I am trying to use openssl AES to encrypt my data i found the pretty nice example in this link ., http://saju.net.in/code/misc/openssl_aes.c.txt
but the question i still could found the answer it padding the data although it perform a multiple of key size .
for example it needs 16 byte as input to encrypt or any multiple of 16
i gave 1024 including the null ., and it still give me an out put of size 1040 ,
but as what i know AES input size = out put size , if the input is a multiple of 128 bit / 16 byte .
any one tried this example before me or can give me any idea ?|
thanks in Advance .
Most padding schemes require that some minimum amount of padding always be added. This is (at least primarily) so that on the receiving end, you can look at the last byte (or some small amount of data at the end) and know how much of the data at the end is padding, and how much is real data.
For example, a typical padding scheme puts zero bytes after the data with one byte at the end containing the number of bytes that are padding. For example, if you added 4 bytes of padding, the padding bytes (in hex) would be something like 00 00 00 04. Another common possibility puts that same value in all the padding bytes, so it would look like 04 04 04 04.
On the receiving end, the algorithm has to be ready to strip off the padding bytes. To do that, it looks at the last byte to tell it how many bytes of data to remove from the end and ignore. If there's no padding present, that's going to contain some value (whatever the last byte in the message happened to be). Since it has no way to know that no padding was added, it looks at that value, and removes that many bytes of data -- only in this case, it's removing actual data instead of padding.
Although it might be possible to devise a padding scheme that avoided adding extra data when/if the input happened to be an exact multiple of the block size, it's a lot simpler to just add at least one byte of padding to every message, so the receiver can count on always reading the last byte and finding how much of what it received is padding.

strlen() gives wrong size cause of null bytes in array

I have a dynamic char array that was deserialized from a stream.
Content of char *myarray on the file (with a hexal editor) :
4F 4B 20 31 32 20 0D 0A 00 00 B4 7F
strlen(myarray) returns 8, (must be 12)
strlen counts the characters up to the first 0 character, that's what it's for.
If you want to know the length of the deserialized array, you must get that from somewhere else, the deserialization code should know how large an array it deserialized.
strlen(myarray) returns the index of the first 00 in myarray.
Which language are you asking about?
In C, you'll need to remember the size and pass it to anything that needs to know it. There's no (portable) way to determine the size of an allocated array given just a pointer to it and, as you say, strlen and other functions that work with zero-terminated strings won't work with unterminated lumps of data.
In C++, use std::string or std::vector<char> to manage a dynamic array of bytes. Both of these make the size available, as well as handling deallocation for you.
9th char is 00. i.e '\0'. This is the reason you are getting 8 instead of 12. strlen() takes it as Null terminator.
A C-style String is terminated by NULL, and your char* contains a NULL-Byte at the 9th position, thus strlen returns 8, as it counts the elements until it finds a NULL byte.
(from http://www.cplusplus.com/reference/clibrary/cstring/strlen/):
A C string is as long as the amount of characters between the beginning of the string and the terminating null character.
As you're using the char* for binary data, you must not use the strlen function, but remember (pass along) the size of the char array.
In your case, you could serialize the size of the dynamic array on transmission, and deserialize it before allocating / reading the array.
From cplusplus.com (http://www.cplusplus.com/reference/clibrary/cstring/strlen/):
The length of a C string is determined by the terminating
null-character
You can not expect that it would count the whole string if you have '\0' in the middle of it.
It such scenarios I have found it to be best to serialize the length of the message alongside the data in the stream. For example, you first serialize the length of the char array - 12 and then you serialize the actual data (characters). That way when you read the data you would first read the length and then you can read that much characters and be sure that is your char array.

Defining a byte in C++

In http://www.parashift.com/c++-faq-lite/intrinsic-types.html#faq-26.6, it is wriiten that
"Another valid approach would be to define a "byte" as 9 bits, and simulate a char* by two words of memory: the first could point to the 36-bit word, the second could be a bit-offset within that word. In that case, the C++ compiler would need to add extra instructions when compiling code using char* pointers."
I couldn't understand what it meant by "simulating char* by two words" and further quote.
Could somebody please explain it by giving an example ?
I think this is what they were describing:
The PDP-10 referenced in the second paragraph had 36-bit words and was unable to address anything inside of those words. The following text is a description of one way that this problem could have been solved while fitting within the restrictions of the C++ language spec (that are included in the first paragraph).
Let's assume that you want to make 9-bit-long bytes (for some reason). By the spec, a char* must be able to address individual bytes. The PDP-10 can't do this, because it can't address anything smaller than a 36-bit word.
One way around the PDP-10's limitations would be to simulate a char* using two words of memory. The first word would be a pointer to the 36-bit word containing the char (this is normally as precise as the PDP-10's pointers allow). The second word would indicate an offset (in bits) within that word. Now, the char* can access any byte in the system and complies with the C++ spec's limitations.
ASCII-art visual aid:
| Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 | Byte 7 | Byte 8 |
-------------------------------------------------------------------------
| Word 1 | Word 2 |
| (Address) | (Offset) |
-------------------------------------------------------------------------
Say you had a char* with word1 = 0x0100 and word2 = 0x12. This would point to the 18th bit (the start of the third byte) of the 256th word of memory.
If this technique was really used to generate a conforming C++ implementation on the PDP-10, then the C++ compiler would have to do some extra work with juggling the extra bits required by this rather funky internal format.
The whole point of that article is to illustrate that a char isn't always 8 bits. It is at least 8 bits, but there is no defined maximum. The internal representation of data types is dependent on the platform architecture and may be different than what you expect.
Since the C++ spec says that a char* must point to individual bytes, and the PDP-6/10 does not allow addressing individual bytes in a word, you have a problem with char* (which is a byte pointer) on the PDP-6/10
So one work around is: define a byte as 9 bits, then you essentially have 4 bytes in a word (4 * 9 = 36 bits = 1 word).
You still can't have char* point to individual bytes on the PDP-6/10, so instead have char* be made up of 2 36-bit words. The lower word would be the actual address, and the upper word would be some byte-mask magic that the C++ compiler could use to point to the right 9bits in the lower word.
In this case,
sizeof(*int) (36bits) is different than sizeof(*char) (72bits).
It's just a contrived example that shows how the spec doesn't constrain primatives to specific bit/byte sizes.
data: [char1|char2|char3|char4]
To access char1:
ptrToChar = &data
index = 0
To access char2:
ptrToChar = &data
index = 9
To access char3:
ptrToChar = &data
index = 18
...
then to access a char, you would:
(*ptrToChar >> index) & 0x001ff
but ptrToChar and index would be saved in some sort of structure that the compiler creates so they would be associated with each other.
Actually, the PDP-10 can address (load, store) 'bytes', smaller than a (36-bit) word, with a single word pointer. On th -10, a byte pointer includes the word address containing the 'byte', the width (in bits) of the 'byte', and the position (in bits from the right) of the 'byte' within the word. Incrementing the pointer (with an explicit increment, or increment and load/deposit instruction), increments the position part (by the size part) and, handles overflow to the next word address. (No decrementing, though.) A byte pointer can e.g. address individual bits, but 6, 8, 9, 18(!) were probably common, as there were specially-formatted versions of byte pointers (global byte pointers) that made their use somewhat easier.
Supposing a PDP-10 implementation wanted to get as close to having 8-bit bytes as possible. The most reasonable to split up a 36-bit word (the smallest unit of memory that the machine's assembly langauge can address) is to divide the word up into four 9-bit bytes. To access a particular 9-bit byte, you need to know which word it's in (you'd use the machine's native addressing mode for that, using a pointer which takes up one word), and you'd need extra data to indicate which of the 4 bytes inside the word was the one you're interested. This extra data would be stored in a second machine word. The compiler would generate lots of extra instructions to use that extra data to pull the right byte out of the word, using the extra data stored in the second word.