Can a file end with a 0x80 char? - c++

I'm implementing my own version of blowfish encoder/decoder. I use a standard padding of 0x80 if necessary.
My question is if I need to add a padding chars even if I don't need it, because in the case of a file that ends naturally with 0x80, in the deconding part I will remove this character, but in this case it is a wrong action, since te 0x80 is part of the file itself.
This of course can be solved by adding a final char even if the total number of characters is a multiple of the encoding block (64bit in this case). I can implement this countermeasure, but first I'd prefer to know if I really need it.
Natural consequence is thinking if this type of char is chosen because never happens in a file (so the wrong situation above never happens), but I'm not sure at all.
Thanks! and sorry for the dummy question..

On Linux and other filesystem a file can contain any sequence of any bytes. So ideally you can not depended on any particular byte to decide file end. (Although EOF is there..!!)
What i am suggesting is the Most file formats are using.
You can have specific 4-5 magic bytes Header for your file format. and followed by you can have size of rest of bytes. So after some bytes your last byte would be there.
Edit:
In above suggestion In encoder you need to update size of file after adding any new data in files.
If you do not want that then you can encode your data in perticular chunk of data and then encode them packet by packet. Your file will be number of some packet. such things are used in NAL units

Blowfish is a block cipher. It always takes 64 bit input and outputs 64 bit output. If you want to encrypt a stream that is not a multiple of 64 bit long you will need to add some padding bytes. When you decrypt the encrypted stream you always get a multiple of 64 bit. But you have no information if the encrypted stream contained 'real' data or padding bytes. You need to keep track of that yourself. A simple approach would be to store the set of 'data length' and 'encrypted stream'. Another approach would be to prepend the clear text stream with a data length value, for example a 64 bit unsigned integer. Then after decrypting the encrypted stream you will have that length as the first value and then you know how many bytes of the last block are real data and how many are just padding.
And regarding your question about what bytes can be at the end of a file: any. You can have files with arbitrary content. Each byte in the file can be of any value, there is no restriction.

Regular binary file can contain any bytes sequence, so file can end with 0x80, with NULL or any other.
If you are talking about some specific standard, so it depends.. However I think that there is no such file type that could not contain some specific character in the end, I know about file types that ignores as many last characters as not needed (because header determines size) so you should do so, but never heard about illegal file data (except cracked).
So as mentioned use header, reserve for example 8 bytes that determines size. That is easy solution.
Also, before asking such question, you should ask yourself, why file should end with some special character?

The answer is Yes. On every operating system in current use, a file can end with any possible sequence of bytes. In fact, you should generate such a file to test your implementation.
In the general case you cannot recognise trailing padding characters or remove them reliably without knowing the length of the file. Therefore encoding the length of the file must be part of your cryptographic protocol.
Simply put the length of the file at the beginning and encrypt the whole thing, including any padding bytes you like (random is probably best). Once unencrypted you will have the file length to tell you where to truncate.

Related

c++ How can I change the size of a void* according to a file I want to process

I am currently trying to make a program that can read a .blend file. Well trying is the important part, since I am already stuck on reading the file block info.
Im gonna quickly explain my problem, please refer this page for context
So in the .blend header there is a char that determines wheter or not the pointer size, later used in the file info block (Or just fileBlock on the linked webpage) among other things, is 4 or 8 bytes long. From what I have read, in c++ the void pointer only changes size according to the target platform it was compiled for ( 8 bytes for 64 bit and 4 bytes for 32 bits ). However .blend files can have either one, regardless of the platform I presume.
Now since blender itself does also read its own files using c, there must be a way to change the pointer to match the required pointer size, according to the info in the header. However my best guess would be to dynamically allocate a void pointer array to either one or two pointers, which then makes actually using the data even more complicated.
Please help me find the intended way of handling the different pointer sizes!
Go back to the top of the wiki page and you will find the File Header structure. The header of a blend file starts with "BLENDER" which is followed by the pointer size for the file -
Size of a pointer
All pointers in the file are stored in this format
'_' (underscore) means 4 bytes or 32 bit
'-' (minus) means 8 bytes or 64 bits.
So by reading the eighth byte of the file you know the size of the pointers in the file.
if (file_bytes[7] == "_")
ptr_size = 4;
else if (file_bytes[7] == "-")
ptr_size = 8;
The copy of blender creating the file determines the sizes used for the file, so a 32bit build will save 32bit pointers in the file while a 64 bit build will save 64bit pointers.
You should also read the next byte, it tells you whether the file was saved as big or little endian, to see if you need to do any byte swapping. The use of blender on big endian machines might be getting smaller, but you may still come across big endian files.
Another important thing that doesn't seem to be mentioned, is that blend files can be compressed and often are. Reading a compressed blend file will mean using gzread() to read the file. A compressed file has the first two bytes set to 0x1f 0x8b
You will find the code that blender uses to read blend files in source/blender/blenloader.
Yup, that's painful. The solution is not to treat them as C++ at all. Instead, create your own class BlendPointer to abstract this away. Those would be read from a BlendFile, and that BlendFile would store whether its BlendPointers are 4 or 8 bytes on disk.

Interleaving bzip2 and non-bzip2 data

I am looking at making a file format that interleaves two types of chunks of raw bytes.
One chunk will contain a block of bzip2-compressed data, which has a header containing the usual bzip2 magic number (BZh9).
The second chunk will consist of the other data of interest, which has a header containing a different magic number (TBD).
The two magic numbers would be used for seeking, identifying and processing the two data block types differently.
My question is: Is there a magic number I can pick for the second block type, which would very unlikely (or better, impossible) to be found inside a bzip2-compressed block of bytes?
In other words, are there particular bytes that bzip2 excludes or would be probabilistically unlikely to use when compressing, within some statistical threshold, which I could use for a header for another data type in the same file?
One option is that, when I find header bytes for a second block type, I would simply try to process data in the second block type, and if that processing fails, then I assume I am accidentally inside a compressed bzip2 block. But I'd like to know if there is the possibility that there are bytes that would not be found in a bzip2 block, or would not be likely to be found.
No. bzip2 compressed data can contain any pair of bytes, essentially all with equal probability. All you could do would be to define a longer series of bytes as the signature, to reduce the probability that that series accidentally appears in the compressed data. But it still could.
The bzip2 format is self-terminating, so if you're willing to take the time to decode the bzip2 data, you can always find where the next thing is.
To answer the question in a comment, the entire bzip2 stream necessarily terminates on a byte boundary. The last byte may have 0 to 7 bits of zero pad. You can search backwards from the start of your second stream component to look for the bzip2 end marker 0x177245385090 (first 12 decimal digits of the square root of pi), which can start at any bit in a specific byte. It would be 80 to 87 bits back.

Writing to a text file, binary vs ascii

So I am having the hardest time trying to understand this concept. I have a program that reads a text file, and writes it to another file and replaces the most common words with unsigned chars. But what I cannot for the life of me understand is how then do I determine the difference between the two.
If I write to the new file the original char I read in or an unsigned char value corresponding to 1-255, how then do I determine the difference when I go back in reverse to the original file contents?
When you write a file as binary, then a number such as "1253553" is written using 2 or 4 bytes (depending on the size of the int on the platform). So, in a binary file, you will see a sequence of 2 or 4 bytes representing that number. For chars, it should not make a difference as each char is represented on one byte.
Usually, you have to have some well known and obvious way to determine the format of your file.
One way to do this is to create your own file extension. You could naively expect that any file with that extension is in your compressed format, but it's actually quite likely other files out there have the same extension (e.g., ".dat" is probably a bad choice). So, you'll want to take further steps, like having the first few bytes of the file be something that is unlikely to be there in any other file (some "magic numbers"). Let's use two bytes, and let's simply choose 0xAB 0xCD as those two bytes.
So, when your program is presented with a file that has the proper extension, open it and read the first two bytes. If they're 0xAB and 0xCD, you can assume you're reading your special format.
This isn't a very strong way of accomplishing this task, but it is one way of doing it. You could get more extravagant if you like.
For more information, you might want to read the Wikipedia page on the subject. It's a start.

Most efficient to read file into separate variables using fstream

I have tons of files which look a little like:
12-3-125-BINARYDATA
What would be the most efficient way to save the 12, 3 and 125 as separate integer variables, and the BINARYDATA as a char-vector?
I'd really like to use fstream, but I don't exactly know how to (got it working with std::strings, but the BINARYDATA part was all messed up).
The most efficient method for reading data is to read many "chunks", or records into memory using the fewest I/O function calls, then parsing the data in memory.
For example, reading 5 records with one fread call is more efficient than 5 calls to fread to read in a record. Accessing memory is always faster than accessing external data such as files.
Some platforms have the ability to memory-map a file. This may be more efficient than reading the using I/O functions. Profiling will determine the most efficient.
Fixed length records are always more efficient than variable length records. Variable length records involve either reading until a fixed size is read or reading until a terminal (sentinel) value is found. For example, a text line is a variable record and must be read one byte at a time until the terminating End-Of-Line marker is found. Buffering may help in this case.
What do you mean by Binary Data? Is it a 010101000 char by char or "real" binary data? If they are real "binary data", just read the file as binary file. First read 2 bytes for the first int, next 1 bytes for -, 2 bytes for 3,and so on, until you read the first pos of binary data, just get file length and read all of it.

Multibyte character constants and bitmap file header type constants

I have some existing code that I've used to write out an image to a bitmap file. One of the lines of code looks like this:
bfh.bfType='MB';
I think I probably copied that from somewhere. One of the other devs says to me "that doesn't look right, isn't it supposed to be 'BM'?" Anyway it does seem to work ok, but on code review it gets refactored to this:
bfh.bfType=*(WORD*)"BM";
A google search indicates that most of the time, the first line seems to be used, while some of the time people will do this:
bfh.bfType=0x4D42;
So what is the difference? How can they all give the correct result? What does the multi-byte character constant mean anyway? Are they the same really?
All three are (probably) equivalent, but for different reasons.
bfh.bfType=0x4D42;
This is the simplest to understand, it just loads bfType with a number that happens to represent ASCII 'M' in bits 8-15 and ASCII 'B' in bits 0-7. If you write this to a stream in little-endian format, then the stream will contain 'B', 'M'.
bfh.bfType='MB';
This is essentially equivalent to the first statement -- it's just a different way of expressing an integer constant. It probably depends on the compiler exactly what it does with it, but it will probably generate a value according to the endian-ness of the machine you compile on. If you compile and execute on a machine of the same endian-ness, then when you write the value out on the stream you should get 'B', 'M'.
bfh.bfType=*(WORD*)"BM";
Here, the "BM" causes the compiler to create a block of data that looks like 'B', 'M', '\0' and get a char* pointing to it. This is then cast to WORD* so that when it's dereferenced it will read the memory as a WORD. Hence it reads the 'B', 'M' into bfType in whatever endian-ness the machine has. Writing it out using the same endian-ness will obviously put 'B', 'M' on your stream. So long as you only use bfType to write out to the stream this is the most portable version. However, if you're doing any comparisons/etc with bfType then it's probably best to pick an endian-ness for it and convert as necessary when reading or writing the value.
I did not find the API, but according to http://cboard.cprogramming.com/showthread.php?t=24453, the bfType is a bitmapheader. A value of BM would most likely mean "bitmap".
0x4D42 is a hexadecimal value (0x4D for M and 0x42 for B). In the little endian way of writing (least significate byte first), that would be the same as "BM" (not "MB"). If it also works with "MB" then probably some default value is taken.
Addendum to tehvan's post:
From Wikipedia's entry on BMP:
File header
Note that the first two bytes of the BMP file format (thus the BMP header) are stored in big-endian order. This is the magic number 'BM'. All of the other integer values are stored in little-endian format (i.e. least-significant byte first).
So it looks like the refactored code is correct according to the specification.
Have you tried opening the file with 'MB' as the magic number with a few different photo-editors?