Reading and writing class objects to binary file

Reading and writing class objects to binary file - c++

I would like to know what happens when I write:
object.write((char*)&class_object, sizeof(class_object));
// or
object.read((char*)&class_object, sizeof(class_object));
From what I read so far, the class_object is converted to a pointer. But I don't know how it manages to convert data carried by the object into binary. What does the binary actually represent?
I am a beginner.
EDIT
Could you please explain what really happens when we write the above piece of code? I mean, what actually happens when we write (char*)*S, say where S is the object of a class that I have declared?

Imagine it this way, the class instance is just some memory chunk resting in your RAM, if you convert your class to a char pointer:
SomeClass someClassInstance;
char* data = reinterpret_cast<char*>(&someClassInstance);
It will point to the same data in your memory but it will be treated as a byte array in your program.
If you convert it back:
SomeClass* instance = reinterpret_cast<SomeClass*>(data);
It will be treated as the class again.
So in order to write your class to a file and later reconstruct it, you can just write the data to some file which will be sizeof(SomeClass) in size and later read the file and convert the raw bytes to the class instance.
However, keep in mind that you can only do this if your class is POD (Plain Old Data)!

In practice, your code won't work and is likely to yield undefined behavior, at least when your class or struct is not a POD (plain old data) and contains pointers or virtual functions (so has some vtable).
The binary file would contain the bit representation of your object, and this is not portable to another computer, or even to another process running the same program (notably because of ASLR) unless your object is a POD.
See also this answer to a very similar question.
You probably want some serialization. Since disks and file accesses are a lot slower (many dozen of thousands slower) than the CPU, it is often wise to use some more portable data representation. Practically speaking, you should consider some textual representation like e.g. JSON, XML, YAML etc.... Libraries such as jsoncpp are really easy to use, and you'll need to code something to transform your object into some JSON, and to create some object from a JSON.
Remember also that data is often more costly and more precious than code. The point is that you often want some old data (written by a previous version of your program) to be read by a newer version of your program. And that might not be trivial (e.g. if you have added or changed the type of some field in your class).
You could also read about dynamic software updating. It is an interesting research subject. Be aware of databases.
Read also about parsing techniques, notably about recursive descent parsers. They are relevant.

Related

Save objects from vector with pointers to file

I have a vector with pointers to a base class object so i can manage objects derived from that class.
vector <Product*> products;
i am trying to write these objects to a file while iterating through the vector
but i am not sure if this works correctly.
void Inventory :: saveProductsToFile()
{
ofstream outfile;
outfile.open("inventory.dat",ios::binary);
list <Product*> :: iterator it;
for(it=products.begin(); it!=products.end(); it++)
outfile.write((char*)*(it),sizeof(Product));
}
The file is created but i have no idea if i'm saving the actual objects themselves or their
addresses.Is this correct or is there another way?
This is how the file looks like:
ˆFG " H*c \Âõ(œ##pFG h*c b'v b#

You code can work. You cannot serialize polymorphic objects in
that way. For starters, you're writing the hidden vptr out
to disk; when you reread the data, it will not be valid. And
you're only writing out the data in the base class (Product),
because that's what sizeof(Product) evaluates to. And
finally, just writing a byte image of anything but a char[]
will probably mean that you won't be able to reread the data
some time in the future (after a compiler upgrade, or a machine
upgrade, or whatever).
What you have to do is to define a format (binary or text) for
the file, and write that. For the basic types, you can start
with something existing, like XDR or Protocol buffers, but
neither of these work that well with polymorphic types. For
polymorphic types, you have to start by defining how you
identify the type in question when rereading. This can be
tricky: there's nothing in std::type_info which helps, so you
need some means of establishing a relationship between your
(derived) types and the identifier. Then every derived class
must implement a write function, which first writes its type,
then writes its data out, one element by one. When reading, you
read the type, look up the appropriate read function for that
type in a map, and call that function, which then reads the data
one by one.
Finally, I might point out that all successful serialization
schemes I've seen depend on generated code. You describe your
types in a separate file, or in special markup (in a specially
marked comment in the C++), and have a program which reads that,
and generates the necessary code (and often the actual classes
you use).

Thats not how you "serialize" data. Like this the pointers are only valid during runtime or until you delete them (depending on what happens/stops first). Like this you wouldn't be able to restore your data, because after the program has stopped everything from your former memory becomes invalid. You would have to store the actual values from your class.

Are classes guaranteed to have the same organization in memory between program runs?

I'm attempting to implement a Save/Load feature into my small game. To accomplish this I have a central class that stores all the important variables of the game such as position, etc. I then save this class as binary data to a file. Then simply load it back for the loading function. This seems to work MOST of the time, but if I change certain things then try to do a save/load the program will crash with memory access violations. So, are classes guaranteed to have the same structure in memory on every run of the program or can the data be arranged at random like a struct?
Response to Jesus - I mean the data inside the class, so that if I save the class to disk, when I load it back, will everything fit nicely back.
Save
fout.write((char*) &game,sizeof Game);
Load
fin.read((char*) &game, sizeof Game);

Your approach is extremely fragile. With many restrictions, it can work. These restrictions are not worth subjecting your users (or yourself!) to in typical cases.
Some Restrictions:
Never refer to external memory (e.g. a pointer or reference)
Forbid ABI changes/differences. Common case: memory layout and natural alignment on 32 vs 64 will vary. The user will need a new 'game' for each ABI.
Not endian compatible.
Altering your type's layouts will break your game. Changing your compiler options can do this.
You're basically limited to POD data.
Use offsets instead of pointers to refer to internal data (This reference would be in contiguous memory).
Therefore, you can safely use this approach in extremely limited situations -- that typically applies only to components of a system, rather than the entire state of the game.
Since this is tagged C++, "boost - Serialization" would be a good starting point. It's well tested and abstracts many of the complexities for you.

Even if this would work, just don't do it. Define a file format at the byte-level and write sensible 'convert to file format' and 'convert from file format' functions. You'll actually know the format of the file. You'll be able to extend it. Newer versions of the program will be able to read files from older versions. And you'll be able to update your platform, build tools, and classes without fear of causing your program to crash.

Yes, classes and structures will have the same layout in memory every time your program runs., although I can't say if the standard enforces this. The machine code generated by C++ compilers use "hard-coded" offsets to access type fields, so they are fixed. Realistically, the layout will only change if you modify the C++ class definition (field sizes, order, virtual methods, etc.), compile with a different compiler or change compiler options.
As long as the type is POD and without pointer fields, it should be safe to simply dump it to a file and read it back with the exact same program. However, because of the above-mentionned concerns, this approach is quite inflexible with regard to versionning and interoperability.
[edit]
To respond to your own edit, do not do this with your "Game" object! It certainly has pointers to other objects, and those objects will not exist anymore in memory or will be elsewhere when you'll reload your file.

You might want to take a look at this.
Classes are not guaranteed to have the same structure in memory as pointers can point to different locations in memory each time a class is created.
However, without posting code it is difficult to say with certainty where the problem is.

Is it possible to write a truly generic disk-baked B+Tree implementation?

I wrote a generic in-memory B+Tree implementation in C++ few times ago, and I'm thinking about making it persistent on disk (which is why B+Tree have been designed for initially).
My first thought was to use mmap (I'm under Linux) to be able to manipulate the file as normal memory and just rewrite the new operator of my nodes classes so that it returns pointers in the mapped portion and create a smart pointer which can convert RAM adresses to file offset to link my nodes with others.
But I want my implementation to be generic, so the user can store an int, an std::string, or whatever custom class he wants in the B+tree.
That's where the problem occurs: for primitive types or aggregated types that do not contain pointers that's all good, but as soon as the object contains a pointer/reference to an heap allocated object, this approach no longer works.
So my question is: is there some known way to overcome this difficulty? My personnal searches on the topic end up unsuccessful, but maybe I missed something.

As far as I know, there are three (somewhat) easy ways to solve this.
Approach 1: write a std::streambuf that points to some pre-allocated memory.
This approach allows you to use operator<< and use whatever existing code already exists to get a string representation of what you want.
Pro: re-use loads of existing code.
Con: no control over how operator<< spits out content.
Con: text-based representations only.
Approach 2: write your own (many times overloaded) output function.
Pro: can come up with binary representation.
Pro: exact control over every single output format.
Con: re-write so many output functions... writing overloads for new types by clients is a pain because they shouldn't write functions that fall in your library's namespace... unless you resort to Koenig (argument dependant) lookup!
Approach 3: write a btree_traits<> template.
Pro: can come up with binary representation.
Pro: exact control over every single output format.
Pro: more control on output and format that a function, may contain meta data and all.
Con: still requires you / your library's users to write lots of custom overloads.
Pro: have the btree_traits<> detault to use operator<< unless someone overrides the traits?

You cannot write a truly generic and transparent version since if the pointer in a non-trivial item was allocated with malloc (or new and new[]), then it's already in the heap.
A non-transparent sollution may be serializing the class is an option, and this can be done relatively easy. Before you store the class you'd have to call the serialization function and before pulling it you'd call the deserialize. Boost has good serialization features that you could make work with your B+Tree.

Handling pointers and references in a generic way means you will need to inspect the type of the structure you're trying to store, and its fields. C++ is a language not known for its reflectiveness.
But even in a language with powerful reflection, a generic solution to this problem is difficult. You might be able to get it to work for a subset of types in higher level languages like Python, Ruby, etc. A related and more powerful paradigm is the persistent programming language.
The function you want is usually implemented by delegating responsibility for writing the data block to the target type itself. It's called serialization. It simply means writing an interface with a method to dump data, and a method to load data. Any class that wants to be persisted in your B-tree then simply implements this interface.

Can you write a polymorphic class to disk and survive?

Firstly, I know that writing a class to disk is bad, but you should see some of our other code. D:
My question is: can I write a polymorphic class to disk and then read it in later and not get undefined behaviour? I am going to guess not because of vtables (I think these are generated at runtime and unique to the object?)
I.e.
class A {
virtual ~A() {}
virtual void foo() = 0;
};
class B : public A {
virtual ~B() {}
virtual void foo() {}
};
A * a = new B;
fwrite( a, 1, sizeof( B ), fp );
delete a;
a = new B;
fread( a, 1, sizeof( B ), fp );
a->foo();
delete a;
Thank-you!

I'll suggest you to take a look at Boost Serialization.
we use the term "serialization" to mean the reversible deconstruction of an arbitrary set of C++ data structures to a sequence of bytes. Such a system can be used to reconstitute an equivalent structure in another program context. Depending on the context, this might used implement object persistence, remote parameter passing or other facility.

You might be able to get away with it if such objects are always read back during the same execution of the program that wrote them (though I really don't recommend it). But if the data in the file must persist between different executions of the program, then using the raw bytes of the in-memory objects will almost certainly lead to significant problems.
Each vtable itself is generated at compile time and stored somewhere in the resulting executable. What each object instance contains is just a pointer to the appropriate vtable, and that pointer does not change for the lifetime of any given object. (Multiple inheritance can be a little more complicated, but for this discussion those details aren't relevant. The pointers are still constant.)
So if an object has a vtable pointer and you write the raw bytes of that object to disk, then the vtable pointer is written to disk as well. If you then read back those bytes during the same execution of the program and push them into an appropriate object, it may work since the vtable will still be in the same location and thus the vtable pointer will still be correct.
(However note that everything I just explained there is an implementation detail. While many compilers typically implement virtual functions in that manner, I don't think any of the exact details are guaranteed by the C++ standard. So there could be additional potential problems.)
Now, if this might be possible, why not store such objects for longer durations? Because you have no guarantee that any particular virtual table will be in the same memory location.
Some operating systems may change the memory layout for each execution of the same program. I don't know whether or not this actually affects virtual table locations, but that's certainly a serious risk.
Furthermore, if you ever compile a new version of the program, the location of each virtual table is completely up to the whims of the compiler. Changes to seemingly unrelated parts of the code may cause the compiler to place the relevant virtual tables in different locations. Obviously, that happening would completely break this scheme. And you have no way to prevent it from happening.
(And beyond the vtables, what if new data members need to be added to those objects in subsequent versions of the program? You might have to deal with reading past versions of raw objects' bytes into new versions that have new members or a different layout of members. That can get complicated and ugly as well as error prone.)
Now, even if you only intend to store the objects temporarily for each execution of the program. I still don't think it's a good idea. You are highly restricted as to what kinds of variables these objects can contain. No smart objects (std::string, std::vector, etc). No pointers to memory allocated per each object. Any strings must therefore be stored in raw character arrays. Other dynamic allocation would have to be turned into fixed members or member arrays. That means you lose a lot of C++'s benefits everywhere these objects are used.
Furthermore, these objects and the scheme of writing this directly to disk would need to be accompanied by comments and documentation warning of all the dangers I've described. Otherwise, some future programmer might unknowingly decide to add the wrong kind of data member. Or even worse, they might decide to try storing such objects longer than the execution of the program, opening them up to serious crashes and failures that might not happen until much later in the future (and probably at the worst possible time).
In the end, I strongly suggest using a scheme that stores the data in a format specifically intended for the file. As someone else already mentioned, Boost Serialization is a good option. If not that, there are may be other usable serialization libraries. Or else depending on your needs, you may be able to roll your own mechanism without too much trouble.

The problem is not the vtable. It is stored per class type, not per instance, so you won't write it to file. Basically your code should work (haven't tried it).
However, you should keep in mind that reading pointers/handles from file does not work.

C++ Storing objects in a file

I have a list of objects that I would like to store in a file as small as possible for later retrieval. I have been carefully reading this tutorial, and am beginning (I think) to understand, but have several questions. Here is the snippet I am working with:
static bool writeHistory(string fileName)
{
fstream historyFile;
historyFile.open(fileName.c_str(), ios::binary);
if (historyFile.good())
{
list<Referral>::iterator i;
for(i = AllReferrals.begin();
i != AllReferrals.end();
i++)
{
historyFile.write((char*)&(*i),sizeof(Referral));
}
return true;
} else return false;
}
Now, this is adapted from the snippet
file.write((char*)&object,sizeof(className));
taken from the tutorial. Now what I believe it is doing is converting the object to a pointer, taking the value and size and writing that to the file. But if it is doing this, why bother doing the conversions at all? Why not take the value from the beginning? And why does it need the size? Furthermore, from my understanding then, why does
historyFile.write((char*)i,sizeof(Referral));
not compile? i is an iterator (and isn't an iterator a pointer?). or simply
historyFile.write(i,sizeof(Referral));
Why do i need to be messing around with addresses anyway? Aren't I storing the data in the file? If the addresses/values are persisting on their own, why can't i just store the addresses deliminated in plain text and than take their values later?
And should I still be using the .txt extension? < edit> what should I use instead then? I tried .dtb and was not able to create the file. < /edit> I actually can't even seem to get file to open without errors with the ios::binary flag. I'm also having trouble passing the filename (as a string class string, converted back by c_str(), it compiles but gives an error).
Sorry for so many little questions, but it all basically sums up to how to efficiently store objects in a file?

What you are trying to do is called serialization. Boost has a very good library for doing this.
What you are trying to do can work, in some cases, with some very important conditions. It will only work for POD types. It is only guaranteed to work for code compiled with the same version of the compiler, and with the same arguments.
(char*)&(*i)
says to take the iterator i, dereference it to get your object, take the address of it and treat it as an array of characters. This is the start of what is being written to the file. sizeof(Referral) is the number of bytes that will be written out.
An no, an iterator is not necessarily a pointer, although pointers meet all the requirements for an iterator.

Question #1 why does ... not compile?
Answer: Because i is not a Referral* -- it's a list::iterator ;; an iterator is an abstraction over a pointer, but it's not a pointer.
Question #2 should I still be using the .txt extension?
Answer: probably not. .txt is associated by many systems to the MIME type text/plain.
Unasked Question: does this work?
Answer: if a Referral has any pointers on it, NO. When you try to read the Referrals from the file, the pointers will be pointing to the location on memory where something used to live, but there is no guarantee that there is anything valid there anymore, least of all the thing that the pointers were pointing to originally. Be careful.

isn't an iterator a pointer?
An iterator is something that acts like a pointer from the outside. In most (perhaps all) cases, it is actually some form of object instead of a bare pointer. An iterator might contain a pointer as an internal member variable that it uses to perform its job, but it just as well might contain something else or additional variables if necessary.
Furthermore, even if an iterator has a simple pointer inside of it, it might not point directly at the object you're interested in. It might point to some kind of bookkeeping component used by the container class which it can then use to get the actual object of interest. Fortunately, we don't need to care what those internal details actually are.
So with that in mind, here's what's going on in (char*)&(*i).
*i returns a reference to the object stored in the list.
& takes the address of that object, thus yielding a pointer to the object.
(char*) casts that object pointer into a char pointer.
That snippet of code would be the short form of doing something like this:
Referral& r = *i;
Referral* pr = &r;
char* pc = (char*)pr;
Why do i need to be messing around
with addresses anyway?
And why does it need the size?
fstream::write is designed to write a series of bytes to a file. It doesn't know anything about what those bytes mean. You give it an address so that it can write the bytes that exist starting wherever that address points to. You give it a size so that it knows how many bytes to write.
So if I do:
MyClass ExampleObject;
file.write((char*)ExampleObject, sizeof(ExampleObject));
Then it writes all the bytes that exist directly within ExampleObject to the file.
Note: As others have mentioned, if the object you want to write has members that dynamically allocate memory or otherwise make use of pointers, then the pointed to memory will not be written by a single simple fstream::write call.
will serialization give a significant boost in storage efficiency?
In theory, binary data can often be both smaller than plain-text and faster to read and write. In practice, unless you're dealing with very large amounts of data, you'll probably never notice the difference. Hard drives are large and processors are fast these days.
And efficiency isn't the only thing to consider:
Binary data is harder to examine, debug, and modify if necessary. At least without additional tools, but even then plain-text is still usually easier.
If your data files are going to persist between different versions of your program, then what happens if you need to change the layout of your objects? It can be irritating to write code so that a version 2 program can read objects in a version 1 file. Furthermore, unless you take action ahead of time (like by writing a version number in to the file) then a version 1 program reading a version 2 file is likely to have serious problems.
Will you ever need to validate the data? For instance, against corruption or against malicious changes. In a binary scheme like this, you'd need to write extra code. Whereas when using plain-text the conversion routines can often help fill the roll of validation.
Of course, a good serialization library can help out with some of these issues. And so could a good plain-text format library (for instance, a library for XML). If you're still learning, then I'd suggest trying out both ways to get a feel for how they work and what might do best for your purposes.

What you are trying to do (reading and writing raw memory to/from file) will invoke undefined behaviour, will break for anything that isn't a plain-old-data type, and the files that are generated will be platform dependent, compiler dependent and probably even dependent on compiler settings.
C++ doesn't have any built-in way of serializing complex data. However, there are libraries that you might find useful. For example:
http://www.boost.org/doc/libs/1_40_0/libs/serialization/doc/index.html

Did you have already a look at boost::serialization, it is robust, has a good documentation, supports versioning and if you want to switch to an XML format instead of a binary one, it'll be easier.

Fstream.write simply writes raw data to a file. The first parameter is a pointer to the starting address of the data. The second parameter is the length (in bytes) of the object, so write knows how many bytes to write.
file.write((char*)&object,sizeof(className));
^
This line is converting the address of object to a char pointer.
historyFile.write((char*)i,sizeof(Referral));
^
This line is trying to convert an object (i) into a char pointer (not valid)
historyFile.write(i,sizeof(Referral));
^
This line is passing write an object, when it expects a char pointer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js