Most efficient to read file into separate variables using fstream - c++

I have tons of files which look a little like:
12-3-125-BINARYDATA
What would be the most efficient way to save the 12, 3 and 125 as separate integer variables, and the BINARYDATA as a char-vector?
I'd really like to use fstream, but I don't exactly know how to (got it working with std::strings, but the BINARYDATA part was all messed up).

The most efficient method for reading data is to read many "chunks", or records into memory using the fewest I/O function calls, then parsing the data in memory.
For example, reading 5 records with one fread call is more efficient than 5 calls to fread to read in a record. Accessing memory is always faster than accessing external data such as files.
Some platforms have the ability to memory-map a file. This may be more efficient than reading the using I/O functions. Profiling will determine the most efficient.
Fixed length records are always more efficient than variable length records. Variable length records involve either reading until a fixed size is read or reading until a terminal (sentinel) value is found. For example, a text line is a variable record and must be read one byte at a time until the terminating End-Of-Line marker is found. Buffering may help in this case.

What do you mean by Binary Data? Is it a 010101000 char by char or "real" binary data? If they are real "binary data", just read the file as binary file. First read 2 bytes for the first int, next 1 bytes for -, 2 bytes for 3,and so on, until you read the first pos of binary data, just get file length and read all of it.

Related

Can a file end with a 0x80 char?

I'm implementing my own version of blowfish encoder/decoder. I use a standard padding of 0x80 if necessary.
My question is if I need to add a padding chars even if I don't need it, because in the case of a file that ends naturally with 0x80, in the deconding part I will remove this character, but in this case it is a wrong action, since te 0x80 is part of the file itself.
This of course can be solved by adding a final char even if the total number of characters is a multiple of the encoding block (64bit in this case). I can implement this countermeasure, but first I'd prefer to know if I really need it.
Natural consequence is thinking if this type of char is chosen because never happens in a file (so the wrong situation above never happens), but I'm not sure at all.
Thanks! and sorry for the dummy question..
On Linux and other filesystem a file can contain any sequence of any bytes. So ideally you can not depended on any particular byte to decide file end. (Although EOF is there..!!)
What i am suggesting is the Most file formats are using.
You can have specific 4-5 magic bytes Header for your file format. and followed by you can have size of rest of bytes. So after some bytes your last byte would be there.
Edit:
In above suggestion In encoder you need to update size of file after adding any new data in files.
If you do not want that then you can encode your data in perticular chunk of data and then encode them packet by packet. Your file will be number of some packet. such things are used in NAL units
Blowfish is a block cipher. It always takes 64 bit input and outputs 64 bit output. If you want to encrypt a stream that is not a multiple of 64 bit long you will need to add some padding bytes. When you decrypt the encrypted stream you always get a multiple of 64 bit. But you have no information if the encrypted stream contained 'real' data or padding bytes. You need to keep track of that yourself. A simple approach would be to store the set of 'data length' and 'encrypted stream'. Another approach would be to prepend the clear text stream with a data length value, for example a 64 bit unsigned integer. Then after decrypting the encrypted stream you will have that length as the first value and then you know how many bytes of the last block are real data and how many are just padding.
And regarding your question about what bytes can be at the end of a file: any. You can have files with arbitrary content. Each byte in the file can be of any value, there is no restriction.
Regular binary file can contain any bytes sequence, so file can end with 0x80, with NULL or any other.
If you are talking about some specific standard, so it depends.. However I think that there is no such file type that could not contain some specific character in the end, I know about file types that ignores as many last characters as not needed (because header determines size) so you should do so, but never heard about illegal file data (except cracked).
So as mentioned use header, reserve for example 8 bytes that determines size. That is easy solution.
Also, before asking such question, you should ask yourself, why file should end with some special character?
The answer is Yes. On every operating system in current use, a file can end with any possible sequence of bytes. In fact, you should generate such a file to test your implementation.
In the general case you cannot recognise trailing padding characters or remove them reliably without knowing the length of the file. Therefore encoding the length of the file must be part of your cryptographic protocol.
Simply put the length of the file at the beginning and encrypt the whole thing, including any padding bytes you like (random is probably best). Once unencrypted you will have the file length to tell you where to truncate.

Writing to a text file, binary vs ascii

So I am having the hardest time trying to understand this concept. I have a program that reads a text file, and writes it to another file and replaces the most common words with unsigned chars. But what I cannot for the life of me understand is how then do I determine the difference between the two.
If I write to the new file the original char I read in or an unsigned char value corresponding to 1-255, how then do I determine the difference when I go back in reverse to the original file contents?
When you write a file as binary, then a number such as "1253553" is written using 2 or 4 bytes (depending on the size of the int on the platform). So, in a binary file, you will see a sequence of 2 or 4 bytes representing that number. For chars, it should not make a difference as each char is represented on one byte.
Usually, you have to have some well known and obvious way to determine the format of your file.
One way to do this is to create your own file extension. You could naively expect that any file with that extension is in your compressed format, but it's actually quite likely other files out there have the same extension (e.g., ".dat" is probably a bad choice). So, you'll want to take further steps, like having the first few bytes of the file be something that is unlikely to be there in any other file (some "magic numbers"). Let's use two bytes, and let's simply choose 0xAB 0xCD as those two bytes.
So, when your program is presented with a file that has the proper extension, open it and read the first two bytes. If they're 0xAB and 0xCD, you can assume you're reading your special format.
This isn't a very strong way of accomplishing this task, but it is one way of doing it. You could get more extravagant if you like.
For more information, you might want to read the Wikipedia page on the subject. It's a start.

How would I write a class object to a file?

Alright, I have an object:
LivingObject* myPlayer=new LivingObject(...);
And I would like to write it to a file on exit. Here is what I have so far:
std::fstream myWrite;
myWrite.open("Character.dat",std::ios::binary|std::ios::app);
myWrite.write((char*)myPlayer,sizeof(myPlayer));
myWrite.close();
I watched over the file when exiting and the size did not increase at all(me assuming it didnt write). What did I do wrong?
This code writes the only the first 4 (or 8 in 64 bits) bytes of the object to file not the whole object. To write the whole object use:
myWrite.write((char*)myPlayer,sizeof(LivingObject));
As for the size of the file: some operating systems report file size as the space allocated to the file on disk, which is multiple of the physical block size. So as long as the write did not increase beyond the block size, you will not see an increase of the file size.
myPlayer is a pointer to a LivingObject
myWrite.write((char*)myPlayer,sizeof(myPlayer)); This line, you're casting a pointer to another pointer, and then saying the sizeof a pointer type (which is usually 4). So you'd be writing 4 bytes of data (the address), and not the object instead.
So instead, what you'll need to do is to serialize the class, either to a binary packed format, or another format (XML, JSON etc), and write that to the file.
Search the web for "boost serialize". The operation you are performing is called serialization.
If you want to share data between platforms, you will need to either choose a format that is not binary or write down the format, be sure to mention which multi-byte quanties are Little Endian or Big Endian.

Unexpected "padding" in a Fortran unformatted file

I don't understand the format of unformatted files in Fortran.
For example:
open (3,file=filename,form="unformatted",access="sequential")
write(3) matrix(i,:)
outputs a column of a matrix into a file. I've discovered that it pads the file with 4 bytes on either end, however I don't really understand why, or how to control this behavior. Is there a way to remove the padding?
For unformated IO, Fortran compilers typically write the length of the record at the beginning and end of the record. Most but not all compilers use four bytes. This aids in reading records, e.g., length at the end assists with a backspace operation. You can suppress this with the new Stream IO mode of Fortran 2003, which was added for compatibility with other languages. Use access='stream' in your open statement.
I never used sequential access with unformatted output for this exact reason. However it depends on the application and sometimes it is convenient to have a record length indicator (especially for unstructured data). As suggested by steabert in Looking at binary output from fortran on gnuplot, you can avoid this by using keyword argument ACCESS = 'DIRECT', in which case you need to specify record length. This method is convenient for efficient storage of large multi-dimensional structured data (constant record length). Following example writes an unformatted file whose size equals the size of the array:
REAL(KIND=4),DIMENSION(10) :: a = 3.141
INTEGER :: reclen
INQUIRE(iolength=reclen)a
OPEN(UNIT=10,FILE='direct.out',FORM='UNFORMATTED',&
ACCESS='DIRECT',RECL=reclen)
WRITE(UNIT=10,REC=1)a
CLOSE(UNIT=10)
END
Note that this is not the ideal aproach in sense of portability. In an unformatted file written with direct access, there is no information about the size of each element. A readme text file that describes the data size does the job fine for me, and I prefer this method instead of padding in sequential mode.
Fortran IO is record based, not stream based. Every time you write something through write() you are not only writing the data, but also beginning and end markers for that record. Both record markers are the size of that record. This is the reason why writing a bunch of reals in a single write (one record: one begin marker, the bunch of reals, one end marker) has a different size with respect to writing each real in a separate write (multiple records, each of one begin marker, one real, and one end marker). This is extremely important if you are writing down large matrices, as you could balloon the occupation if improperly written.
Fortran Unformatted IO I am quite familiar with differing outputs using the Intel and Gnu compilers. Fortunately my vast experience dating back to 1970's IBM's allowed me to decode things. Gnu pads records with 4 byte integer counters giving the record length. Intel uses a 1 byte counter and a number of embedded coding values to signify a continuation record or the end of a count. One can still have very long record lengths even though only 1 byte is used.
I have software compiled by the Gnu compiler that I had to modify so it could read an unformatted file generated by either compiler, so it has to detect which format it finds. Reading an unformatted file generated by the Intel compiler (which follows the "old' IBM days) takes "forever" using Gnu's fgetc or opening the file in stream mode. Converting the file to what Gnu expects results in a factor of up to 100 times faster. It depends on your file size if you want to bother with detection and conversion or not. I reduced my program startup time (which opens a large unformatted file) from 5 minutes down to 10 seconds. I had to add in options to reconvert back again if the user wants to take the file back to an Intel compiled program. It's all a pain, but there you go.

Binary file write problem in C++

This is my function which creates a binary file
void writefile()
{
ofstream myfile ("data.abc", ios::out | ios::binary);
streamoff offset = 1;
if(myfile.is_open())
{
char c='A';
myfile.write(&c, offset );
c='B';
myfile.write(&c, offset );
c='C';
myfile.write(&c,offset);
myfile.write(StartAddr,streamoff (16) );
myfile.close();
}
else
cout << "Some error" << endl ;
}
The value of StartAddr is 1000, hence the expected output file is:
A B C 1000 NUL NUL NUL
However, strangely my output file appends this: data.abc
So the final outcome is: A B C 1000 NUL NUL NUL data.abc
Please help me out with this. How to deal with this? Why is this strange behavior?
I recommend you quit with binary writing and work on writing the data in a textual format. You've already encountered some of the problems with writing data. There are still issues for you to come across about reading the data and portability. Expect more pain if you continue this route.
Use textual representations. For simplicity you can put one field per line and use std::getline to read it in. The textual representation allows you to view the data in any text editor, easily. Try using Notepad to view a binary file!
Oh, but binary data is soo much faster and takes up less space in the file. You've already wasted enough time and money than you would gain by using binary data. The speed of computers and huge memory capacities (disk and RAM) make binary representations a thing of the past (except in extreme cases).
As a learning tool, go ahead and use binary. For ease of development and quick schedules (IOW, finishing early), use textual representations.
Search Stack Overflow for "C++ micro optimization" for the justifications.
There are several issues with this code.
For starters, if you want to write individual characters t a stream, you don't need to use ostream::write. Instead, just use ostream::put, as shown here:
myfile.put('A');
Second, if you want to write out a string into a file stream, just use the stream insertion operator:
myfile << StartAddr;
This is perfectly safe, even in binary mode.
As for the particular problem you're reporting, I think that the issue is that you're trying to write out a string of length four (StartAddr), but you've told the stream to write out sixteen bytes. This means that you're writing out the four bytes for the string contents, then the null terminator, and then nine bytes of whatever happens to be in memory after the buffer. In your case, this is two more null bytes, then the meaningless text that you saw after that. To fix this, either change your code to write fewer bytes or, if StartAddr is a string, then just write it using <<.
With the line myfile.write(StartAddr,streamoff (16) ); you are instructing the myfile object to write 16 bytes to the stream starting at the address StartAddr. Imagine that StartAddr is an array of 16 bytes:
char StartAddr[16] = "1000\0\0\0data.b32\0";
myfile.write(StartAddr, sizeof(StartAddr));
Would generate the output that you see. Without seeing the declaration / definition of StartAddr I cannot say for certain, but it appears you are writing out a five byte nul terminated string "1000" followed by whatever happens to reside in the next 11 bytes after StartAddr. In this case, it appears a couple of nul bytes followed by the constant nul terminated string "data.b32" (which the compiler must put somewhere in memory) are what follow StartAddr.
Regardless, it is clear that you overread a buffer.
If you are trying to write a 16 bit integer type to a stream you have a couple of options, both based on the fact that there are typically 8 bits in a byte. The 'cleanest' one would be something like:
char x = (StartAddr & 0xFF);
myfile.write(x);
x = (StartAddr >> 8);
myfile.write(x);
This assumes StartAddr is a 16 bit integer type and does not take into account any translation that might occur (such as potential conversion of a value of 10 [a linefeed] into a carriage return / linefeed sequence).
Alternatively, you could write something like:
myfile.write(reinterpret_cast<char*>(&StartAddr), sizeof(StartAddr));