C++ way to serialize a message? - c++

In my current project I have a few different interfaces that require me to serialize messages into byte buffers. I feel like I'm probably not doing it in a way that would make a true C++ programmer happy (and I'd like to).
I would typically do something like this:
struct MyStruct {
uint32_t x;
uint64_t y;
uint8_t z[80];
};
uint8_t* serialize(const MyStruct& s) {
uint8_t* buffer = new uint8_t[sizeof(s)];
uint8_t* temp = buffer;
memcpy(temp, &s.x, sizeof(s.x));
temp += sizeof(s.x);
//would also have put in network byte order...
... etc ...
return buffer;
}
Excuse any typos, that was just an example off the top of my head. Obviously it can get more complex if the structure I'm serializing has internal pointers.
So, I have two questions that are closely related:
Is there any problem in the specific scenario above with serializing by casting the struct directly to a char buffer assuming I know that the destination systems are in the same endianness?
Main question: Is there a better... erm... C++? way to do this aside from the addition of smart pointers? I feel like this is such a common problem that the STL probably handles it - and if it doesn't, I'm sure there's a better way to do it anyway using C++ mechanisms.
EDIT Bonus points if you can do a clean example of serializing this structure in a better way using standard C++/STL without added libraries.

You might want to take a look at Google Protocol Buffers (also known as protobuf). You define your data in a language neutral IDL and then run it through a generator to generate your C++ classes. It'll take care of byte ordering issues and can provide a very compact binary form.
By using this you'll not only be able to save your C++ data but it'll be useable in other languages (C#, Java, Python etc) as there are protobuf implementation available for them.

You should probable use either Boost::serialization or streams directly. More information in either of the links to the right.
Is it possible to serialize and deserialize a class in C++?

just voted AzPs reply as the answer, checking Boost first is the way to go.
in addition about your code sample:
1 - changing the signature of your serialization function to a method taking a file:
void MyStruct::serialize(FILE* file) // or stream
{
int size = sizeof(this);
fwrite(&size, sizeof(int), 1, file); // write size
fwrite(this, 1, size, file); // write raw bytes of struct
}
reduces the necessity to copy the struct.
2 - yes, your code makes the serialized bytes dependent on your platform, compiler and compiler settings. this is not good or bad, if the same binary writes and reads the serialized bytes, this might be beneificial because of simplicity and performance. But it is not only endianness, also packing and struct layout affects compatibility. For instance a 32bit or a 64bit version of your app will change the layout of our struct quite for sure. Finally serializing the raw footprint also serializes padding bytes - the bytes the compiler might put between struct fields, an overhead undesirable for high traffic network streams (see google protocol buffers as they hunt every bit they can save).
EDIT:
i see you added "embedded". yes, then such simple serialize / deserialize (mirror implementation of the above serialize) methods might be a good and simple choice.

Related

Serialization doubts

I have to create an application layer protocol for a C++ application, but I have some doubts about how I can do it, especially about the serialization:
My idea is to create a class for describing the header, something like this:
class Header {
int type;
int length;
char[] message;
}
Now, in order to serialize it and to pass it through a socket, I'm thinking about using Boost Serialization. But my question is: is it "cross-platform"? In the sense that if I want to receive the data into a Python/Ruby/any-other-language server with its own class, can I do it or not (since I've serialized a C++ class)?
If not, is useful to serialize the class data into a JSON/XML file and transmit it?
If I want to serialize an object into a string, Does I have to pay attention to the big/little-endian and/or the string encoding and/or other details?
Since, not all the machines using the same number of bytes to define, for example, the primitive data types, is it necessary to use something like uint32_t data types to force the system to use a certain amount of bytes?
Thank you very much!
I think you should check this answer first.
Serializing into a C-style string is fine as far as big/little endian woes go, but is not as good performance-wise.
Using C-style string serialization will mostly solve this problem as well. On read side you should make sure that when parsing numbers they don't exceed local data size.
If this is a serious project then maybe consider using JSON or XML.

Is it possible to override std::istream read method?

I want to use std::istream to read data from a given class which only provides 2 methods:
// Returns a byte from the stream (consuming it)
uint8_t getChar(OwnIOStream stream);
// Makes the passed pointer point to the data in the stream
bool getCharBlockPtr(OwnIOStream stream, uint8_t** buffer, uin32_t maxSize, uint32_t* size);
I first thought of inheriting from stream_buf and implement the underflow method using the getChar() method. However I would like to use the getCharBlockPtr() instead to avoid copies of data (I assume calling underflow for each read byte will decrease the performance). The problem is that I need to know in advance the quantity of bytes I want to read each time. This is why I was thinking whether it would be possible to override the read method of istream.
Using boost is not an option.
Deriving from streambuf is a perfectly valid option.
Whether you need to do much (if any) more than override underflow depends. If OwnIOStream is already buffering data, then you might get adequate performance from just doing that. I'd start by doing that, and seeing how things go.
The next obvious step would be to use getCharBlockPtr. A streambuf has a setg that you use to set the pointers to the beginning, current position, and end of the buffer. At least if I'm interpreting things correctly, you'd call getCharBlockPtr and get the beginning and end. You'd then call setg to set the beginning and current positions to the beginning, and the end to the end. From there, the streambuf should be able to read data directly from the buffer. When it runs out, it'll call underflow, and you'll need to get more data again.
One note: it doesn't look like OwnIOStream supports a putback area, which streams normally do expect to support. Getting this to work correctly (especially putting back a character when at the beginning of a buffer) may be somewhat non-trivial to support correctly.
The general rule is that most classes from the standard C++ library are not meant to be derived. The exception here is the streambuf class which is intended to be the low level interface between an IO capable object and a C++ stream.
So the standard way here would be to build a custom streambuf calling getChar or better getCharBlockPtr that would allow buffering. Jerry's answer explains how to do it.
But it looks like you have special requirement here:
you know at a higher level the number of bytes you want to read
you are in a performance critical operation and want to avoid buffer copies if possible - but this last point should be reviewed after profiling...
Then, asking for the full power of a C++ stream may not be the way to go. Building a custom class that would only implement the methods of a istream that you need to call, as well as few extractors would still give a nice code at high level and could give a more optimized code at low level.
But you and only you can know:
whether your higher level code requires a C++ stream - then a custom class is not an option and you will have to use a custom streambuf
whether you could directly give the expected size to the streambuf from the calling code as a hint. More or less:
class OwnStreamBuf: public std::streambuf {
...
void sizehint(size_t hint) { /* size for next read access */
this->hint = hint
}
int_type underflow() {
...
cr = getCharBlockPtr(stream, buffer, hint, size); /* use hint as maxSize */
...
}
}
...
OwnStreamBuf buf(...);
class OwnIStream: public istream ...
OwnIStream is(buf)
...
buf.sizehint(n)
is >> special_obj;
...
whether you only need a handful of extractors and a custom class will be simpler to build from scratch.

Best way of serializing/deserializing a simple protocol in C++

I want to build a simple application protocol using Berkeley sockets in C++ using on Linux. The transport layer should be UDP, and the protocols will contain the following two parts:
The first part:
It is a fixed part which is the protocol Header with the following fields:
1. int HeaderType
2. int TransactionID
3. unsigned char Source[4]
4. unsigned char Destination[4]
5. int numberoftlvs
The second part
It will contain variable number of TLVs, each TLV will contain the following fields:
1. int type
2. int length
3. unsigned char *data "Variable length"
My question is for preparing the message to be sent over the wire,what's the best way to do serialization and deserialization, to be portable on all the systems like little Endian and big Endian?
Should I prepare a big buffer of "unsigned char", and start copying the fields one by one to it? And after that, just call send command?
If I am going to follow the previous way, how can I keep tracking the pointer to where to copy my fields, my guess would be to build for each datatype a function which will know how many bytes to move the pointer, correct?
If someone can provide me with a well explained example ,it will much appreciated.
some ideas... in no particular order... and probably not making sense all together
You can have a Buffer class. This class contains raw memory pointer where you are composing your message and it can contains counter or pointers to keep track of how much have you written, where you're writing and how far can you go.
Probably you would like to have one instance of the Buffer class for each thread reading/writing. No more, because you don't want to have expensive buffers like this around. Bound to a specific thread because you don't want to share them without locking (and locking is expensive)
Probably you would like to reuse a Buffer from one message to the next, avoiding the cost of creating and destroying it.
You might want to explore the idea of a Decorator a class that inherits or contains each of your data classes. In this case they idea for this decorator is to contain the methods to serialize and deserialize each of your data types.
One option for this is to make the Decorator a template and use class template specialization to provide the different formats.
Combining the Decorator methods and Buffer methods you should have all the control you need.
You can have in the Buffer class magical templated methods that have an arbitary object as parameter and automatically creates a Decorator for it and serializes.
Connversely, deserializing should get you a Decorator that should be made convertible to the decorated type.
I'm sorry I don't have the time right now to give you a full blown example, but I hope the above ideas can get you started.
As an example I shamelessly plug my own (de)serialization library that's packing into msgpackv5 format:
flurry

Datastructure Storage in Filesystem

I am trying to write a persistent datastructure in C++ , however I feel that I should be able to make it binary compatible with various other implementations of my datastructure readers, and hence, my current idea is to declare datastructure in the native memory without any abstraction.
For example, I would specify a linear block of memory as a datastructure (using new keyword) and then describe what the first byte means, what the second byte means and so on. I know I can do this using struct but then, the datastructure would be bound to one language and other languages will have to then use this structure. Also, the implementation might then change from compiler to compiler. I would instead like it as a memory standard.
Is what I am trying to do somewhat sensible? Or I am trying to over-simplify things and should really proceed with a struct data structure? Now onto the C++ part, if you believe that I should be using a struct data structure, then what are the disadvantages of using a full-fledged class?
(I am using a class anyway to wrap around the memory structure and provide functions to it since the datastructure is anyway persistent.)
EDIT
As justin as suggested, I do not need any such advanced interface wrapper around the memory structure, so my last point about class wrapper is not stated properly. What I mean is I would like to have a class interface for the memory representation, it does not necessarily have to be a wrapper.
Several file formats I have read/worked with do exactly that -- define a memory standard or layout, then typically back it up with a demonstration in C-like pseudo-structure. Sometimes they will provide struct or class representations, and some are completely abstracted by a library. Of course, these formats go on to document all fields, their sizes, the endianness of the data and so on.
I figure endian related issues, padding, complexity (e.g. introduced by variations in the data structures) and proper versioning are the biggest sources of errors. Another issue I find is the use of data structures of yesteryear and inconsistency of data structures used to represent similar functionalities -- You may receive a spec, and realize it contains several different string representations -- all of which are archaic, and somebody has to go on to support all of these (bidirectionally).
Proceeding that route:
You should not commit to a binary representation (or compilable program) if you don't want to support it (and attempts of long-lived formats fail/stumble along the way, as platforms and toolsets change). Just commit to a formal memory standard at first, then build on top of that with tests and example input files to verify the representation is properly serialized and deserialized correctly. A very basic test suite will help ensure your model is portable on all the systems you need, and can point out potential pitfalls or platform specific considerations you may not have been aware of.
If you really want to provide a compilable representation, I'd stick with a very compliant struct representation -- clients can take that (in memory) representation and turn it into any C++ abstraction/representation they like. That is to say, a serialized representation should probably not reflect that of a representation in memory, apart from trivially simple representations and the intermediate storage of such a representation (flattened and packed structs).
One of the important parts is that you should have tests which confirm your in memory object graph which you create with these structs are forward and backwards serializable and de-serializable, and support proper versioning -- so it often takes a bit of work to make a complex serialized representation compatible. So you see this approach just introduces one abstraction layer on top of another. In this regard, you may want give C++ abstraction the ability to create itself from the packed in memory representation, and to ensure that that representation can also correctly populate the packed structure without data loss.
Beyond that, is there any need to have a more advanced interface? If there is, then you may want to provide that information.
So yes, the memory standard is the part that you must get correct and stable, and to which all implementations should refer to and test against -- regardless of platform/architecture differences. IOW, you're on the right track ;)
In C++ there's no practical difference between struct and class (besides the default accessibility being public in struct). Traditionally, struct is used when a type only has (public) member variables and no member functions but this is only a convention, not a rule enforced by the compiler.
I'd certainly use a struct/class to describe the data. If someone wants to write a reader of your data structure, they can either import your header file or implement the data structure in their language of choice - in most programming languages this should be pretty simple.
I recommend you start your structure something like this:
typedef struct
{
int Version; // struct layout version
int ByteSize; // byte size of structure for validation
...
} MYDATA;
This way when your data structure is being passed around, your code can verify that the allocated structure size matches with how many bytes you'd expect for a given version of your structure. You could then easily introduce new versions of your structure by simply updating the version field and checking for the new size.
When you save your data to disk, make sure that you write it out field-by-field, rather than through a single write (using a pointer and sizeof() to ensure that other languages won't have to deal with potential padding that your C++ compiler may decide to put in. It's possible to manually lay out fields in the structure so that there's no padding but you have to be very, very careful while doing that and it's easy to make mistakes.

Passing raw data in C++

Up until now, whenever I wanted to pass some raw data to a function (like a function that loads an image from a buffer), I would do something like this:
void Image::load(const char* buffer, std::size_t size);
Today I took a look at the Boost libraries, more specifically at the property_tree/xml_parser.hpp header, and I noticed this function signature:
template<typename Ptree>
void read_xml(std::basic_istream<typename Ptree::key_type::value_type>&,
Ptree &, int = 0);
This actually made me curious: is this the correct way to pass around raw data in C++, by using streams? Or am I misinterpreting what the function is supposed to be used for?
If it's the former, could you please point me to some resource where I can learn how to use streams for this? I haven't found much myself (mainly API references), and I have't been able to find the Boost source code for the XML parser either.
Edit: Some extra details
Seems there's been some confusion as to what I want. Given a data buffer, how can I convert it to a stream such that it is compatible with the read_xml function I posted above? Here's my specific use case:
I'm using the SevenZip C library to read an XML file from an archive. The library will provide me with a buffer and its size, and I want to put that in stream format such that it is compatible with read_xml. How can I do that?
Well, streams are quite used in C++ because of their conveniences:
- error handling
- they abstract away the data source, so whether you are reading from a file, an audio source, a camera, they are all treated as input streams
- and probably more advantages I don't know of
Here is an overview of the IOstream library, perhaps that might better help you understand what's going on with streams:
http://www.cplusplus.com/reference/iostream/
Understanding what they are exactly will help you understand how and when to use them.
There's no single correct way to pass around data buffers. A combination of pointer and length is the most basic way; it's C-friendly. Passing a stream might allow for sequential/chunked processing - i. e. not storing the whole file in memory at the same time. If you want to pass a mutable buffer (that might potentially grow), a vector<char>& would be a good choice.
Specifically on Windows, a HGLOBAL or a section object handle might be used.
The C++ philosophy explicitly allows for many different styles, depending on context and environment. Get used to it.
Buffers of raw memory in C++ can either be of type unsigned char*, or you can create a std::vector<unsigned char>. You typically don't want to use just a char* for your buffer since char is not guaranteed by the standard to use all the bits in a single byte (i.e., this will end up varying by platform/compiler). That being said, streams have some excellent uses as well, considering that you can use a stream to read bytes from a file or some other input, etc., and from there, store that data in a buffer.
Seems there's been some confusion as to what I want. Given a data buffer, how can I convert it to a stream such that it is compatible with the read_xml function I posted above?
Easily (I hope PTree::Key_type::value_type would be something like char):
istringstream stream(string(data, len));
read_xml(stream, ...);
More on string streams here.
This is essentially using a reference to pass the stream contents. So behind the scene's it's essentially rather similar to what you did so far and it's essentially the same - just using a different notation. Simplified, the reference just hides the pointer aspect, so in your boost example you're essentially working with a pointer to the stream.
References got the advantage avoiding all the referencing/dereferencing and are therefore easier to handle in most situations. However they don't allow you multiple levels of (de-)referencing.
The following two example functions do essentially the same:
void change_a(int &var, myclass &cls)
{
var = cls.convert();
}
void change_b(int *var, myclass *cls)
{
*var = cls->convert();
}
Talking about the passed data itself: It really depends on what you're trying to achieve and what's more effective. If you'd like to modify a string, utilizing an object of class std::string might be more convenient than using a classic pointer to a buffer (char *). Streams got the advantage that they can represent several different things (e.g. data stream on the network, a compressed stream or simply a file or memory stream). This way you can write single functions or methods that accept a stream as input and will instantly work without worrying about the actual stream source. Doing this with classic buffers can be more complicated. On the other side you shouldn't forget that all objects will add some overhead, so depending on the job to be done a simple pointer to a character string might be perfectly fine (and the most effective solution). There's no "the one way to do it".