command pattern serialization in c++ - c++

I want to do the folloing in C++:
create a command object
serialize it
(send it to another computer)
deserialize
execute
Two cases:
sender and receiver are both win 7
computers
sender is *nix and receiver is win
7
I found a tutorial for searialization: http://www.functionx.com/cpp/articles/serialization.htm. Is this the way to go? In python I could do:
def setAndPackCommand(self, object):
outFile = StringIO.StringIO()
pickledC = pickle.dump(object, outFile) # this packs object to outFile
stringToSend = outFile.getvalue() # decoding to string
def unpackAndExecute(self, stringToReceive):
inFile = StringIO.StringIO()
inFile.write(stringToReceive)
inFile.seek(0, 0)
receivedC = pickle.load(inFile)
receivedC.execute()
In this code the main point are pickle.dump and pickle.load. What are the C++ counterparts? Wikipedia says that c++ does not support serialization? What is the link above then?
What does binary serialization mean? memory is dumped to disk and deserialization needs exactly the same computer (no cross-platform transfers)?
br,
Juha

I would also recommend using a stable library like boost.serialization for serializing data.
If you are new to serialization, it means transforming objects into a data representation suitable for transmission or storage and rebuilding them from that data representation. The difficulty is not really big with so-called PODs (Plain Old Data objects). You can transmit the buffer as data and cast it back after the transfer by taking care of the data alignment and byte ordering (endianness). It becomes more complicated if objects reference other objects, and then it makes really sense to use a well designed library. Boost's serialization also supports versionning so you can update your format and keep up and backward compatible readers and writers (with some effort of course)
Here is a good introduction.

To briefly answer your questions, Wikipedia is right - C++ doesn't natively support serialisation. That doesn't mean that you can't roll your own solution though as demonstrated in the article you linked.
Binary serialisation refers to writing an object to a binary file format. Contrast to (for example), XML serialisation where the object is written to an XML-based format: in the former, you get a binary file where (for example) an int consists of 4 bytes of raw binary data. In the latter, you might get an int tag with a name attribute and its contents being the text value of the integer, such as <int name="myInt">12345</int>.
The big advantage of binary serialisation is (in most cases) that it is very compact, and very simple to convert to/from the object on the target platform. The downside is that, as you've suggested, it tends to be very machine-specific and so not very useful in your situation. Issues such as byte ordering and field alignment tend to vary from platform to platform, and so a binary format where those are not taken into account will most likely not be portable. That said, you can add code to take into account these differences, but it does increase the complexity of the solution.
The alternatives (text-based serialisation, XML serialisation, etc) have the advantage of generally being more cross platform and easier to edit by hand, but are generally less compact than the binary approach.
For the reasons outlined above, I'd avoid binary serialisation in your case (if possible), and go with a text based approach. An article I like is this one, which describes (among other things) an approach to serialisation that allows you to easily specify any kind of serialisation method you prefer to use.

Of course C++ program can do serialization, just not out of the box. Check out the Boost.Serialization library or Google's protocol buffers. The latter implements fast and portable cross-platform (binary) serialization, but it requires the use of a code generator.
(The tutorial you link to demonstrates a very simplistic, unportable approach to serialization. It also demonstrates very well how not to handle strings in C++.)

Related

Is the main purpose of object serialization in C++ for faster object loading?

I am reading code for a project written by others. The main task of the project is to read contents from a large structured text file (.txt) with 8 columns into a KnowledgeBase object, which have a number of methods and variables. The KnowledgeBase object is then output into a binary file. For example, the KnowledgeBase class has at least these two variables:
map<string, pair<string, string>> key_info
vector<ObjectInfo> objects
...
These variables are easy to understand when I track the code with gdb. Then, it seems it is converting such vectors and maps into binary forms. And the two variables above have their corresponding binary forms:
BinaryKeyInfo *bkeys
BinaryObjectInfo *bObjects
Later on when outputting to binary file, it has such code:
fwrite((char*)(&wcount),sizeof(int32_t),1,output);
fwrite((char*)bkeys,sizeof(KeyInfo_t),wcount,output);
The converting code from the original KnowledgeBase to binary is complicated. My question is, what's the main purpose of this conversion? Is it for faster loading of binary file into memory than plain text file? The plain text file is large. I learnt that object serialization is primarily for transmitting objects over the net, but I don't think the purpose here is for that. It is more like for speeding up data loading and memory saving. Could that be part of object serialization in C++?
Is the main purpose of object serialization in C++ for faster object loading?
No. The most important purpose of serialisation is to transform the state of the program into a format that can be stored on the filesystem, or that can be communicated across a network, and that can be de-serialised back. Often, the purpose of either is for another program to do the de-serialisation. Sometimes the de-serialiser is another instance of the same program.
The speed of de-serialisation is one metric that can be used to gauge whether one particular serialisation format is a good one. The ability to quickly undo what you have done is not the reason why you do it in the first place.
what's the benefit of converting them into binary vectors or maps?
As I mention above, the benefit of serialisation is the ability to store the serialised data on the filesystem, or to send it over a network.
what' the benefit between plain text files VS binary files?
Pros of text serialisation format:
Humans are able to read and write plain text. Humans generally are not able read nor write binary files.
It's generally easier to implement a plain text format de-/serialiser in a way that works across differing computers than it is to implement a binary format de-/serialiser that achieves the same.
Pros of binary serialisation format:
Typically faster and uses less storage and bandwidth.
Can be easier to implement if there is no need for communication between differing systems. This is typically only the case in very simple cases. (Furthermore, there usually is a need for cross-system compatibility, even if the need haven't been realised yet).

most common way to deal with endianness and files C++

I started out just reading/writing 8-bit integers to files using chars. It was not very long before I realized that I needed to be able to work with more than just 256 possible values. I did some research on how to read/write 16-bit integers to files and became aware of the concept of big and little endian. I did even more research and found a few different ways to deal with endianness and I also learned some ways to write endianness-independent code. My overall conclusion was that I have to first check if the system I am using is using big or little endian, change the endianness depending on what type the system is using, and then work with the values.
The one thing I have not been able to find is the best/most common way to deal with endianness when reading/writing to files in C++ (no networking). So how should I go about doing this? To help clarify, I am asking for the best way to read/write 16/32-bit integers to files between big and little endian systems. Because I am concerned about the endianness between different systems, I would also like a cross-platform solution.
The most common way is simply to pass your in-memory values through htons() or htonl() before writing them to the file, and also pass the read data through ntohs() or ntohl() after reading it back from the file. (htons()/ntohs() handle 16-bit values, htonl()/ntohl() handle 32-bit values)
When compiled for a big-endian CPU, these functions are no-ops (they just return the value you passed in to them verbatim), so the values will get written to the file in big-endian format. When compiled for a little-endian CPU, these functions endian-swap the passed-in value and return the swapped version, so again the values will get written to the file in big-endian format.
That way the values in the file are always stored in big-endian format, and they always get converted to/from the appropriate (CPU-native) format when being transferred to/from memory. This is the simplest way to do it (since you don't have to write or debug any conditional logic), and the most common (these functions are implemented and available on just about all platforms)
In practice, a good habit is to avoid binary data (to exchange data between computers) and prefer text files and textual protocols to exchange data. You could use textual formats like JSON, YAML, XML, .... (or sometimes invent your own). There are many C++ libraries related to them, e.g. jsoncpp.
Textual data is indeed more verbose (takes more disk space) and slightly slower to parse (but the disk I/O is often the bottleneck, not the CPU time "wasted" in parsing or encoding formats like JSON) but is much easier to work on.
Read also about serialization. You'll find lots of libraries doing that (using some "common" well defined data format such as XDR or ASN1). Many file formats contain some header describing the concrete encoding. The elf(5) format is a good example of that.
Be aware that most of the time the data is more valuable (economically) than the software working on it. So it is very important to document very well how your data is organized in files.
Consider also using databases. Sometimes simply using sqlite with tables containing JSON is very effective.
PS. Without an actual real world case, your question is too broad, and has no meaningful universal answer. There is no single best way!
Basile, I agree that there is no universal answer.
In my world, embedded real time systems, using a text representation is blasphemy. Textual representations and JSON is at least 2 orders of magnitude slower than binary representations. It may be fine for the web. But that makes a difference when you have to process several kilo bytes of data per seconds (to handle voice for instance) across DSPs and GPPs.
For a more in depth discussion on this toppic, check out chapter 7 of the ZeroMQ book.

Strings vs binary for storing variables inside the file format

We aim at using HDF5 for our data format. HDF5 has been selected because it is a hierarchical filesystem-like cross-platform data format and it supports large amounts of data.
The file will contain arrays and some parameters. The question is about how to store the parameters (which are not made up by large amounts of data), considering also file versioning issues and the efforts to build the library. Parameters inside the HDF5 could be stored as either (A) human-readable attribute/value pairs or (B) binary data in the form of HDF5 compound data types.
Just as an example, let's consider as a parameter a polygon with three vertex. Under case A we could have for instance a variable named Polygon with the string representation of the series of vertices, e.g. for instance (1, 2); (3, 4); (4, 1). Under case B, we could have instead a variable named Polygon made up by a [2 x 3] matrix.
We have some idea, but it would be great to have inputs from people who have already worked with something similar. More precisely, could you please list pro/cons of A and B and also say under what circumstances which would be preferable?
Speaking as someone who's had to do exactly what you're talking about a number of time, rr got it basically right, but I would change the emphasis a little.
For file versioning, text is basically the winner.
Since you're using an hdf5 library, I assume both serializing and parsing are equivalent human-effort.
text files are more portable. You can transfer the files across generations of hardware with the minimal risk.
text files are easier for humans to work with. If you want to extract a subset of the data and manipulate it, you can do that with many programs on many computers. If you are working with binary data, you will need a program that allows you to do so. Depending on how you see people working with your data, this can make a huge difference to the accessibility of the data and maintenance costs. You'll be able to sed, grep, and even edit the data in excel.
input and output of binary data (for large data sets) will be vastly faster than text.
working with those binary files in a new environmnet (e.g. a 128 bit little endian computer in some sci-fi future) will require some engineering.
similarly, if you write applications in other languages, you'll need to handle the encoding identically between applications. This will either mean engineering effort, or having the same libraries available on all platforms. Plain text this is easier...
If you want others to write applications that work with your data, plain text is simpler. If you're providing binary files, you'll have to provide a file specification which they can follow. With plain text, anyone can just look at the file and figure out how to parse it.
you can archive the text files with compression, so space concerns are primarily an issue for the data you are actively working with.
debugging binary data storage is significantly more work than debugging plain-text storage.
So in the end it depends a little on your use case. Is it meaningful to look at the data in the myriad tools that handle plain-text? Is it only meaningful to look at it with big-data hdf5 viewers? Will writing plain text be onerous to you in terms of time and space?
In general, when I'm faced with this issue, I basically always do the same thing: I store the data in plain text until I realize the speed problems are more irritating than working with binary would be, and then I switch. If you don't know in advance if you're crossing that threshold start with plain-text, and write your interface to your persistence layer in such a way that it will be easy to switch later. This is tiny bit of additional work, which you will probably get back thanks to plain text being easier to debug.
If you expect to edit the file by hand often (like XMLs or JSONs), then go with human readable format.
Otherwise go with binary - it's much easier to create a parser for it and it will run faster than any grammar parser.
Also note how there's nothing that prevents you from creating a converter between binary and human-readable form later.
Versioning files might sound nice, but are you really going to inspect the diffs for files "containing large arrays"?

Streaming Real and Debug Data To Disk in C++

What is a flexible way to stream data to disk in a c++ program in Windows?
I am looking to create a flexible stream of data that may contain arbitrary data (say time, average, a flag if reset, etc) to disk for later analysis. Data may come in at non-uniform, irregular intervals. Ideally this stream would have minimal overhead and be easily readable in something like MATLAB so I could easily analyze events and data.
I'm thinking of a binary file with a header file describing types of packets followed by a wild dump of data tagged with . I'm considering a lean, custom format but would also be interested in something like HDF5.
It is probably better to use an existing file format rather than a custom one. First you don't reinvent the wheel, second you will benefit from a well tested and optimized library.
HFD5 seems like a good bet. It is fast and reliable, and easy to read from Matlab. It has some overhead but it is to allow great flexibility and compatibility.
This requirement sounds suspiciously like a "database"

Binary parser or serialization?

I want to store a graph of different objects for a game, their classes may or may not be related, they may or may not contain vectors of simple structures.
I want parsing operation to be fast, data can be pretty big.
Adding new things should not be hard, and it should not break backward compatibility.
Smaller file size is kind of important
Readability counts
By serialization I mean, making objects serialize themselves, which is effective, but I will need to write different serialization methods for different objects for that.
By binary parsing/composing I mean, creating a new tree of parsers/composers that holds and reads data for these objects, and passing this around to have my objects push/pull their data.
I can also use json, but it can be pretty slow for reading, and it is not very size effective when it comes to pretty big sets of matrices, and numbers.
Point by point:
Fast Parsing: binary (since you don't necessarily have to "parse", you can just deserialize)
Adding New Things: text
Smaller: text (even if gzipped text is larger than binary, it won't be much larger).
Readability: text
So that's three votes for text, one point for binary. Personally, I'd go with text for everything except images (and other data which is "naturally" binary). Then, store everything in a big zip file (I can think of several games do this or something close to it).
Good reads: The Importance of Being Textual and Power Of Plain Text.
Check out protocol buffers from Google or thrift from Apache. Although billed as a way to write wire protocols easily, it's basically an object serialization mechanism that can create bindings in a dozen languages, has efficient binary representation, easy versioning, fast performance, and is well-supported.
We're using Boost.Serialization. Don't know how it performs next to those offered by samkass.