Why Serialization when a class object in memory is already binary (C/C++)? - c++

My guess is that data is scattered in physical memory (even the data of a class object is sequential in virtual memory), so in order to send the data correctly it needs to be reassembled, and to be able to send over the network, one additional step is the transformation of host byte order to network byte order. Is it correct?

Proper serialization can be used to send data to arbitrary systems, that might not work under the same architecture as the source host.
Even an object that only consist of native types can be troublesome sharing between two systems because of the extra padding that might exists in between and after members, among other things. Sharing raw memory dumps of objects between programs compiled for the same architecture but with different compiler versions can also turn into a big hassle. There is no guarantee how variable type T actually is stored in memory.
If you are not working with pointers (references included), and the data is meant to be read by the same binary as it's dumped from, it's usually safe just to dump a raw struct to disk, but when sending data to another host.. drum roll serialization is the way to go.
I've heard developers talking about ntohl / htonl / ntohl / ntohs as methods of serializing/deserializing integers, and when you think about it saying that isn't that far from the truth.
The word "serialization" is often used to describe this "complicated method of storing data in a generic way", but then again; your first programming assignment where you were asked to save information about Dogs to file (hopefully*) made use of serialization, in some way or another.
* "hopefully" meaning that you didn't dump the raw memory representation of your Dog object to disk

Pointers!
If you've allocated memory on the heap you'll just end up with a serialised pointer pointing to an arbitrary area of memory. If you just have a few ints and chars then yes you can just write it out directly to a file, but that then becomes platform dependent because of the byte ordering that you mentioned.

Pointer and data pack(data align)
If you memcpy your object's memory, there is dangerous to copy a wild pointer value instead of it's data. There is another risk, if the sender and receiver have different data pack(data align) method, you will get rubbish after decoding.

Binary representations may be different between different architectures, compilers and even different versions of the same compiler. There's no guarantee that what system A sees as a signed integer will be seen as the same on system B. Byte ordering, word langths, struct padding etc will become hard to debug problems if you don't properly define the protocol or file format for exchanging the data.

Class (when we speak of C++) also includes virtual method pointers - and they must be reconstructed on receiving end.

Related

How one can safely serialize std::basic_istream<char>::pos_type?

In one of my projects I have to cache positional information about certain data chunks found in large files. I've already implemented a small API built around std::basic_istream<char>::pos_type placed in maps.
Now I need to serialize these descriptors into a bytestream and write them on a disk for further usage (on other *nix-machines as well). I have read that this type is platform-dependent but still rather being a POD-type. So my questions are:
Whether it will be better to save something besides of just offsets? E.g. std::fpos<std::mbstate_t> keeping the state of reading structure?
How can I safely obtain and restore the offset data from std::basic_istream<char>::pos_type (and other info if it is need)?
Thank you in advance.
The structure of std::fpos<mbstate_t> is unspecified and there may be non-trivial state in the mbstate_t. You certainly can't portably serialize these objects. You can obtain a value of the offset type (std::streamoff) which is an integer type and its value can be serialized.

Is it sufficient that a class is POD to read it from binary?

For a client/server application I need to send and recive c++ objects. I don't need the corresponding classes to do anything fancy but want to have maximal performance (regarding network traffic and computation). So I though of simply transferring them as binary strings. Basicly I want to be able to do the following
//Create original object
MyClass oldObj();
//save to char array
char* save = new char[sizeof(MyClass)];
memcpy(save, &oldObj, sizeof(MyClass));
//Somewhere of course there would be the transfer to the client/server
//Read back from char array
MyClass newObj();
memcpy(&newObj, save, sizeof(MyClass));
My question: What does my class need to fullfill in order for this to work?
Naturaly Pointers as members won't work when transferring to another application. But is it sufficient that my Class is considered POD (in c++03 and/or c++11) and does not have any pointers or equivalents (like STL containers) as members?
Both machines need to:
Have the same Endianess (for int)
The same floating point representation (double)
The same size for all types.
The Same compiler
The Same flags used to build the application.
Pointers dont transfer well.
BUT the network is going to be the slowest part here.
The cost of serializing most objects is going to be irrelevant compared to the cost of transfer. Of course the bigger your object the higher the cost but it takes a while before it is significant to make a dent.
The higher cost of maintenance is also you should factor in.
What does my class need to fulfill in order for this to work?
It must not have pointer members, you already mention that.
It must not have members whose size is implementation defined, like int.
It must not have integers members, due to different endianness.
It must not have floating point members, due to different representations.
...and probably more!
Basically, you cannot do that except for very particularly constrained scenarios. You will have to pick a protocol and make your data conform to it to send it through the network safely.
Is not a big deal since performance will be bounded by network speed and latency, not by the operations needed on your values to conform to the protocol.
How much control do you have over the hardware/OS that this runs on? Are you writing code that is super-portable, or will it ONLY run on 32- and 64-bit x86 Windows [for example]?
To be fully "super-portable", as explained above, you can't have any form of "implementation defined" sized objects (such as int that can be 16, 18, 32, 36 or 64 bits, for example). Such items need to be stored as bytes of defined number and order to make sure it will not get cut off/re-ordered when transferring. Floating point can be even worse...
A lot of "super-portable" applications store their data as text. It's a little slower, but it makes it trivially portable, since text is just a stream of bytes whatever architecture you run it on, and it's ordered the same way whichever machine you use (as long as you stick to 0-9, A-Za-z, !?<>,.()*& and a few other characters - and beware of EBCDIC encoded machines, but they tend to handle "ascii-to-ebcdic" conversion). The other end just need to conver the text back to strings/integers/floats/doubles, whatever you need. A conversion from integer to string of digits takes one divide per digit (using hex or base-36 makes that a bit better, but makes it much less human readable - sometimes a good thing, sometimes a bad thing). This is clearly slower than storing 4 bytes. THe other drawback is that it's (depending on values used) often longer to store a number in text than as binary. So your network packets will be a little larger. This will have a greater impact than the conversion, as processors can do a lot of math in the time it takes to send 1KB with a 10Gbit network card. And of course, you need a few extra bytes (spaces, commas, newlines or whatever it may be) so that you can tell the difference between one number 123456 and three 12, 34, 56. [Of course, no need to use ", " between each]. And you need some code to parse the whole thing at the other end once it has arrived.
If you know that your system(s) always have 32-bit integers and IEEE-754 floating point numbers [these are extremely common!], then you may well get away with just worrying about byte order. And if you know that it's always going to be on "x86" or some such, you don't have to worry about byte order either. But you now may have to modify your code when you decide that "running my code on an iphone would be a good idea". Of course, you could leave that to the iphone side of things to conform to whatever the rest requires.
Other answers have mentioned how it is possible to use a class for this purpose. Personally, I prefer to use a struct instead. In C++, a struct can have member methods/operators, constructor/destructor, supports inheritance, etc just like a class does. However, a struct has a well-defined and predictable memory layout and can have that layout explicitally aligned via #pragma statements to add/remove the compiler's implicit padding (I have never tried aligning a class before, but I think it is supported). I always use an 8bit-aligned struct for data that has to be exchanged outside of the app's process. For all intents and purposes, in modern compilers, a struct is basically identical to a class, just the default visibility of its members is public instead of private. But I like to keep struct and class separated for different purposes. A struct is just a raw container of data that you can freely manpulate, overwrite in memory, etc. A class is an object whose memory layout and padding is compiler-defined and should not be messed with.

Why are empty classes 8 bytes and larger classes always > 8 bytes?

class foo { }
writeln(foo.classinfo.init.length); // = 8 bytes
class foo { char d; }
writeln(foo.classinfo.init.length); // = 9 bytes
Is d actually storing anything in those 8 bytes, and if so, what? It seems like a huge waste, If I'm just wrapping a few value types then the the class significantly bloats the program, specifically if I am using a lot of them. A char becomes 8 times larger while an int becomes 3 times as large.
A struct's minimum size is 1 byte.
In D, object have a header containing 2 pointer (so it may be 8bytes or 16 depending on your architecture).
The first pointer is the virtual method table. This is an array that is generated by the compiler filled with function pointer, so virtual dispatch is possible. All instances of the same class share the same virtual method table.
The second pointer is the monitor. It is used for synchronization. It is not sure that this field stay here forever, because D emphasis local storage and immutability, which make synchronization on many objects useless. As this field is older than these features, it is still here and can be used. However, it may disapear in the future.
Such header on object is very common, you'll find the same in Java or C# for instance. You can look here for more information : http://dlang.org/abi.html
D uses two machine words in each class instance for:
A pointer to the virtual function table. This contains the addresses of virtual methods. The first entry points towards the class's classinfo, which is also used by dynamic casts.
The monitor, which allows the synchronized(obj) syntax, documented here.
These fields are described in the D documentation here (scroll down to "Class Properties") and here (scroll down to "Classes").
I don't know the particulars of D, but in both Java and .net, every class object contains information about its type, and also holds information about whether it's the target of any monitor locks, whether it's eligible for finalization cleanup, and various other things. Having a standard means by which all objects store such information can make many things more convenient for both users and implementers of the language and/or framework. Incidentally, in 32-bit versions of .net, the overhead for each object is 8 bytes except that there is a 12-byte minimum object size. This minimum stems from the fact that when the garbage-collector moves objects around, it needs to temporarily store in the old location a reference to the new one as well as some sort of linked data structure that will permit it to examine arbitrarily-deep nested references without needing an arbitrarily-large stack.
Edit
If you want to use a class because you need to be able to persist references to data items, space is at a premium, and your usage patterns are such that you'll know when data items are still useful and when they become obsolete, you may be able to define an array of structures, and then pass around indices to the array elements. It's possible to write code to handle this very efficiently with essentially zero overhead, provided that the structure of your program allows you to ensure that every item that gets allocated is released exactly once and things are not used once they are released.
If you would not be able to readily determine when the last reference to an object is going to go out of scope, eight bytes would be a very reasonable level of overhead. I would expect that most frameworks would force objects to be aligned on 32-bit boundaries (so I'm surprised that adding a byte would push the size to nine rather than twelve). If a system is going have a garbage collector that works better than a Commodore 64(*), it would need to have an absolute minimum of a bit of overhead per object to indicate which things are used and which aren't. Further, unless one wants to have separate heaps for objects which can contain supplemental information and those which can't, one will every object to either include space for a supplemental-information pointer, or include space for all the supplemental information (locking, abandonment notification requests, etc.). While it might be beneficial in some cases to have separate heaps for the two categories of objects, I doubt the benefits would very often justify the added complexity.
(*) The Commodore 64 garbage collector worked by allocating strings from the top of memory downward, while variables (which are not GC'ed) were allocated bottom-up. When memory got full, the system would scan all variables to find the reference to the string that was stored at the highest address. That string would then be moved to the very top of memory and all references to it would be updated. The system would then scan all variables to find the reference to the string at the highest address below the one it just moved and update all references to that. The process would repeat until it didn't find any more strings to move. This algorithm didn't require any extra data to be stored with strings in memory, but it was of course dog slow. The Commodore 128 garbage collector stored with each string in GC space a pointer to the variable that holds a reference and a length byte that could be used to find the next lower string in GC space; it could thus check each string in order to find out whether it was still used, relocating it to the top of memory if so. Much faster, but at the cost of three bytes' overhead per string.
You should look into the storage requirements for various types. Every instruction, storage allocation (ie:variable/object, etc) uses up a specific amount of space. In c# an Int32 type integer object should store integer information to the tune of 4 bytes (32bit). It might have other information, too, because it is an object, but your character data type probably only requires 1 byte of information. If you have constructs like for or while in your class, those things will take up space, too, because each of those things is telling your class to do something. The class itself requires a number of instructions to be created in memory, which would account for the 8 initial bytes.
Take an assembler language course. You'll learn all you ever wanted to know and then some about why your programs use however much memory or take up however much storage when compiled.

Memory padding issue

I am working on a sample application in this application I am serializing some of the data. In client application I am reading the serialized data back. While doing this I observed some strange behavior.
In sample application size of object is different from size of data in client. I think this is because of memory padding. My problem is I am trying to write “BRUSHOBJ” to file. This structure is defined by Microsoft. I can change the declaration of this structure. Please let me know how to solve this problem.
Please let me know how to apply memory padding on slandered data type.
It sounds like you're trying to just cast the address of a struct to
char*, and use ostream::write on it. This simply doesn't work.
There's padding, but there's also the size of different types (which
varies from one platform to the next), byte order, and on some more
exotic platforms (including most mainframes) data representation itself.
Generally, you need a specification of what the output data should look
like, byte by byte, and you have to then write each byte with the
required value.
And this is just for simple types. A quick glance at BRUSHOBJ shows
that it contains a pointer, which you'll probably have to
follow—you'll certainly have to do something with it, since the
receiving end won't be able to do anything with a pointer into your
data. (I suspect, given the description, that you'll have to convert it
into some sort of identifier, and also transmit a dictionary mapping
such identifiers to objects. But I don't know enough about how this
structure is used to be sure.)
you have 2 options
serializing data
modify memory padding via #pragma pack
Serializing data has no relation with memory padding, you are just defining a way to write/ read back memory to/from a memory location (the memory stream).
I see the that _BRUSHOBJ struct has the following definition,
typedef struct _BRUSHOBJ {
ULONG iSolidColor;
PVOID pvRbrush;
FLONG flColorType;
} BRUSHOBJ;
please note that sending a pointer across process is nonsens. serializing a pointer should be done by writing the size of memory and the the memory itself. Anyway if you want to pass this BRUSHOBJ to a windows function you can get undefined behavior. It's not a supported/documented way of passing a BRUSHOBJ across process.
memory padding can by applied like this
#pragma pack(push)
#pragma pack(4)
struct myStruct
{
char Char1
int Int1;
};
#pragma pack(pop)
If you what to modify padding you should doit for a structure that is written by you.

How to interpret binary data in C++?

I am sending and receiving binary data to/from a device in packets (64 byte). The data has a specific format, parts of which vary with different request / response.
Now I am designing an interpreter for the received data. Simply reading the data by positions is OK, but doesn't look that cool when I have a dozen different response formats. I am currently thinking about creating a few structs for that purpose, but I don't know how will it go with padding.
Maybe there's a better way?
Related:
Safe, efficient way to access unaligned data in a network packet from C
You need to use structs and or unions. You'll need to make sure your data is properly packed on both sides of the connection and you may want to translate to and from network byte order on each end if there is any chance that either side of the connection could be running with a different endianess.
As an example:
#pragma pack(push) /* push current alignment to stack */
#pragma pack(1) /* set alignment to 1 byte boundary */
typedef struct {
unsigned int packetID; // identifies packet in one direction
unsigned int data_length;
char receipt_flag; // indicates to ack packet or keep sending packet till acked
char data[]; // this is typically ascii string data w/ \n terminated fields but could also be binary
} tPacketBuffer ;
#pragma pack(pop) /* restore original alignment from stack */
and then when assigning:
packetBuffer.packetID = htonl(123456);
and then when receiving:
packetBuffer.packetID = ntohl(packetBuffer.packetID);
Here are some discussions of Endianness and Alignment and Structure Packing
If you don't pack the structure it'll end up aligned to word boundaries and the internal layout of the structure and it's size will be incorrect.
I've done this innumerable times before: it's a very common scenario. There's a number of things which I virtually always do.
Don't worry too much about making it the most efficient thing available.
If we do wind up spending a lot of time packing and unpacking packets, then we can always change it to be more efficient. Whilst I've not encountered a case where I've had to as yet, I've not been implementing network routers!
Whilst using structs/unions is the most efficient approach in term of runtime, it comes with a number of complications: convincing your compiler to pack the structs/unions to match the octet structure of the packets you need, work to avoid alignment and endianness issues, and a lack of safety since there is no or little opportunity to do sanity checks on debug builds.
I often wind up with an architecture including the following kinds of things:
A packet base class. Any common data fields are accessible (but not modifiable). If the data isn't stored in a packed format, then there's a virtual function which will produce a packed packet.
A number of presentation classes for specific packet types, derived from common packet type. If we're using a packing function, then each presentation class must implement it.
Anything which can be inferred from the specific type of the presentation class (i.e. a packet type id from a common data field), is dealt with as part of initialisation and is otherwise unmodifiable.
Each presentation class can be constructed from an unpacked packet, or will gracefully fail if the packet data is invalid for the that type. This can then be wrapped up in a factory for convenience.
If we don't have RTTI available, we can get "poor-man's RTTI" using the packet id to determine which specific presentation class an object really is.
In all of this, it's possible (even if just for debug builds) to verify that each field which is modifiable is being set to a sane value. Whilst it might seem like a lot of work, it makes it very difficult to have an invalidly formatted packet, a pre-packed packets contents can be easilly checked by eye using a debugger (since it's all in normal platform-native format variables).
If we do have to implement a more efficient storage scheme, that too can be wrapped in this abstraction with little additional performance cost.
It's hard to say what the best solution is without knowing the exact format(s) of the data. Have you considered using unions?
I agree with Wuggy. You can also use code generation to do this. Use a simple data-definition file to define all your packet types, then run a python script over it to generate prototype structures and serialiation/unserialization functions for each one.
This is an "out-of-the-box" solution, but I'd suggest to take a look at the Python construct library.
Construct is a python library for
parsing and building of data
structures (binary or textual). It is
based on the concept of defining data
structures in a declarative manner,
rather than procedural code: more
complex constructs are composed of a
hierarchy of simpler ones. It's the
first library that makes parsing fun,
instead of the usual headache it is
today.
construct is very robust and powerful, and just reading the tutorial will help you understand the problem better. The author also has plans for auto-generating C code from definitions, so it's definitely worth the effort to read about.