Need library for binary stream serialization, C++ - c++

What I'm looking for is similar to the serialization library built into RakNet (which I cannot use on my current project). I need to be able save/load binary streams into a custom format locally, and also send them over a network. The networking part is solved, but I really don't want to write my own methods for serializing all of my different types of data into binary, especially since it would be inefficient without any compression methods.
Here's some pseudocode similar to how RakNet's bitstreams work, this is along the lines of what I'm looking for:
class Foo
{
public:
void Foo::save(BitStream& out)
{
out->write(m_someInt);
out->write(m_someBool);
m_someBar.save(out);
// Alternative syntax
out->write<int>(m_someInt);
// Also, you can define custom serialization for custom types so you can do this...
out->write<Bar>(m_someBar);
// Or this...
out->write(m_someBar);
}
void Foo::load(BitStream& in)
{
in->read(m_someInt);
in->read(m_someBool);
in->read(m_someBar);
}
private:
int m_someInt;
bool m_someBool;
Bar m_someBar;
};
Are there any free C++ libraries out there that allow for something like this? I basically just want something to pack my data into binary, and compress it for serialization, and then decompress it back into binary that I can feed back into my data.
EDIT, adding more information:
Unfortunately neither Google Protocol Buffers or Boost Serialization will work for my needs. Both expect to serialize object members, I need to simply serialize data. For example, lets say I have a std::vector<Person>, and the class Person has a std::string for name, and other data in it, but I only want to serialize and deserialize their names. With Google Protocol Buffers it expects me to give it the Person object as a whole for serialization. I can achieve however, achieve this with Boost Serialization, but if I have another scenario where I need the entire Person to be serialized, there is no way to do that, you either have to serialize all of it, or none. Basically I need quite a bit of flexibility to craft the binary stream however I see fit, I just want a library to help me manage reading and writing binary data to/from the stream, and compressing/decompressing it.

Google Protocol Buffers
Boost serialization
UPDATE
Looking at the updated question I think it might be easiest to write a small custom library that does exactly what is required. I have a similar one and it is only a few hundred lines of code (without compression). It is extremely easy to write unit tests for this kind of code, so it can be reliable from day one.
To serialize custom types, I have a Persistent base class that has save and load methods:
class Storage {
public:
void writeInt( int i );
void writeString( string s );
int readInt();
string readString();
};
class Persistent {
public:
virtual void save( Storage & storage ) = 0;
virtual void load( Storage & storage ) = 0;
};
class Person : public Persistent {
private:
int height;
string name;
public:
void save( Storage & storage ) {
storage.writeInt( height );
storage.writeString( name );
}
void load( Storage & storage ) {
storage.readInt( height );
storage.readString( name );
}
};
And then there's a simple layer on top of that that stores some type information when saving and uses a Factory to create new objects when loading.
This could be further simplified by using C++'s streams (which I don't like very much, hence the Storage class). Or copying Boost's approach of using the & operator to merge load and save into a single method.

Related

Which design pattern should I use in this case?

I have a class named DS which can (1) read a data from file and accordingly build a data structure from scratch, or (2) read a pre-built data structure from file. I originally wrote:
class DS
{
DS(std::string file_name, bool type);
}
where file_name is the file to read and type specifies what we are reading, data or pre-built data structure. This method is not very elegant, as far as I am concerned. I also tried the following:
class DS
{
DS(std::string file_name);
void CreateFromData();
void ReadExisting();
}
But because modification is not allowed once built, I do not want the user to first call CreateFromData and then ReadExisting.
Are there some design patterns to address this issue?
Use static factory functions if the constructor signature isn't semantic enough. No need to get fancy with it.
class DS {
private:
enum class Source { FromExisting, FromData };
DS(const std::string& path, Source type);
public:
static DS ReadExisting(const std::string& path) {
return DS(path, Source::FromExisting);
}
static DS CreateFromData(const std::string& path) {
return DS(path, Source::FromData);
}
};
/* ... */
DS myData = DS::ReadExisting("...");
Here's how I'll do it:
Create two sub classes from a new DataFetch class - CreateFromData and ReadExisting; all three having getData method. Create another 'Data manager" class which will have instance of DataFetch, It would Data Manager's responsibility to create appropriate object based on "User" input, you could have two constructors for that. Now, Your DS's constructor will take Data manager's object, created in previous step and ask it to filling current DS object, via getData method.
This allows your design to add more types of data fetching later on, whilst removing any coupling from your DS and data fetching.
Essentially, as the user of DS you input a file path and expect to get back a data structure corresponding to the file’s content. You should not have to worry about the data format in the file at all. That’s an implementation detail of the storage format and should be a part of the loading logic.
So, this is what I would do:
Put a format ID at the beginning of each data file, identifying which storage format it uses. Or maybe even different file extensions are sufficient.
When reading the file the format ID determines which concrete loading logic is used.
Then the user of DS only has to provide the file name. The storage format is transparent.
Ideally you simplify the API and get rid of DS. All your caller sees and needs is a simple function:
// in the simplest case
OutputData load_data_from_file(const std::string& filepath);
// for polymorphic data structures
std::unique_ptr<IOutputData> load_data_from_file(const std::string& filepath);
That fits the use-case exactly: “I have a path to a data file. Give me the data from that file.”. Don’t make me deal with file loader classes or similar stuff. That’s an implementation detail. I don’t care. I just want that OutputData. ;)
If you only have the two current storage formats and that’s unlikely to change, don’t overcomplicate the logic. A simple if or switch is perfectly fine, for instance:
OutputData load_data_from_file(const std::string& filepath)
{
const auto format_id = /* load ID from the file */;
if (format_id == raw) {
return /* call loading logic for the raw format */;
}
else if (format_id == prebuilt) {
return /* call loading logic for the prebuilt format */;
}
throw InvalidFormatId();
}
Should things become more complicated later you can add all the needed polymorphic file loader class hierarchies, factories or template magic then.
Option 1: Enumeration Type
You essentially have two different modes for reading the data, which you differentiate via the parameter bool type. This is bad form for a number of reasons, not the least of which being that it's unclear what the two types are or even what type true refers to vs false.
The simplest way to remedy this is to introduce an enumeration type, which contains a named value for all possible types. This would be a minimalistic change:
class DS
{
enum class mode
{
build, read
};
DS(const std::string &file_name, mode m);
};
So then we could use it as:
DS obj1("something.dat", DS::mode::build); // build from scratch
DS obj2("another.dat", DS::mode::read); // read pre-built
This is the method that I would use, as it's very flexible and extensible if you ever want to support other modes. But the real benefit is clarity at the call site as to what's happening. true and false are often obscure when used as function arguments.
Option 2: Tagged Constructors
Another option to differentiate these functions which is common enough to mention is the notion of tagged constructors. This effectively amounts to adding a unique type for each mode you want to support and using it to overload the constructors.
class DS
{
static inline struct built_t {} build;
static inline struct read_t {} read;
DS(const std::string &file_name, build_t); // build from scratch
DS(const std::string &file_name, read_t); // read pre-built
};
So then we could use it as:
DS obj1("something.dat", DS::build); // build from scratch
DS obj2("another.dat", DS::read); // read pre-built
As you can see, the types build_t and read_t are introduced to overload the constructor. Indeed, when this technique is used we don't even name the parameter because it's purely a means of overload resolution. For a normal method we'd typically just make the function names different, but we can't do that for constructors, which is why this technique exists.
A convenience I added was defining static instances of these two tag types: build and read, respectively. If these were not defined we would have to type:
DS obj1("something.dat", DS::build_t{}); // build from scratch
DS obj2("another.dat", DS::read_t{}); // read pre-built
Which is less aesthetically pleasing. The use of inline is a C++17 feature that makes it so that we don't have to separately declare and define the static variables. If you're not using C++17 remove inline and define the variables in your implementation file as usual for a static member.
Of course, this method uses overload resolution and is thus performed at compile time. This makes it less flexible than the enumeration method because it cannot be determined at runtime, which would conceivably be needed for your project later down the road.

How to use streams in abstract classes

This might seem like an opinion question but I'm really looking for some good ways to go about doing this. so what's this? I basically want to have an abstract class named Repo let's say. This class is going to define the abstraction for what a Repo should be capable of doing. In this case, I just want to be able to save something, say you provide a name and data and it's supposed to store it for you. Then we can have a FileRepository that would save them on disc, S3Repository for example to store them in the AWS S3, or even MemoryRepository where we just save them in the memory.
Great, but how do I abstract this out? obviously I could get the bytes and each derived class would use their own stream to save it but what if the data is large and we don't want to load it up in the memory? let's say we want to save a 5GB file, we don't want to load that up in the memory.
I looked at the AWS SDK for C++ and it seems like they take a lambda with an ostream in it so you can write content. I tried to mimic something like that here so you can either just pass your istream, or give a lambda that takes an ostream and does whatever its heart desires.
Just wondering if there is a better way for this? it's often difficult to find good practices in c++ since there are a billion ways to do the same thing and many people do things very differently. I'd just love some insight here. I'm fairly new to C++ so a good explanation would be highly appreciated.
class Repo {
public:
virtual void add_with_ostream(const string& name, const std::function<void (ostream&)>& f) = 0;
template<typename T>
void add(const string& name, const T& data) {
this->add_with_ostream(name, [&data](ostream& output_stream) {
output_stream << data;
});
}
virtual void add_with_istream(const string&name, const istream& input_stream) {
this->add_with_ostream(name, [&input_stream](ostream& output_stream) {
output_stream << input_stream.rdbuf();
});
}
};

Prevent misuse of structures designed solely for transport

I work on a product which has multiple components, most of them are written in C++ but one is written in C. We often have scenarios where a piece of information flows through each of the components via IPC.
We define these messages using structs so we can pack them into messages and send them over a message queue. These structs are designed for 'transport' purposes only and are written in a way that serves only that purpose. The problem I'm running into is this: programmers are holding onto the struct and using it as a long-term container for the information.
In my eyes this is a problem because:
1) If we change the transport structure, all of their code is broken. There should be encapsulation here so that we don't run into this scenario.
2) The message structs are very awkward and are designed only to transport the information...It seems highly unlikely that this struct would also happen to be the most convenient form of accessing this data (long term) for these other components.
My question is: How can I programmatically prevent this mis-usage? I'd like to enforce that these structures are only able to be used for transport.
EDIT: I'll try to provide an example here the best I can:
struct smallElement {
int id;
int basicFoo;
};
struct mediumElement {
int id;
int basicBar;
int numSmallElements;
struct smallElement smallElements[MAX_NUM_SMALL];
};
struct largeElement {
int id;
int basicBaz;
int numMediumElements;
struct mediumElement[MAX_NUM_MEDIUM];
};
The effect is that people just hold on to 'largeElement' rather than extracting the data they need from largeElement and putting it into a class which meets their needs.
When I define message structures (in C++, not valid in C ) I make sure that :
the message object is copiable
the message object have to be built only once
the message object can't be changed after construction
I'm not sure if the messages will still be pods but I guess it's equivalent from the memory point of view.
The things to do to achieve this :
Have one unique constructor that setup every members
have all memebers private
have const member accessors
For example you could have this :
struct Message
{
int id;
long data;
Info info;
};
Then you should have this :
class Message // struct or whatever, just make sure public/private are correctly set
{
public:
Message( int id, long data, long info ) : m_id( id ), m_data( data ), m_info( info ) {}
int id() const { return m_id; }
long data() const { return m_data; }
Info info() const { return m_info; }
private:
int m_id;
long m_data;
Info m_info;
};
Now, the users will be able to build the message, to read from it, but not change it in the long way, making it unusable as a data container. They could however store one message but will not be able to change it later so it's only useful for memory.
OR.... You could use a "black box".
Separate the message layer in a library, if not already.
The client code shouldn't be exposed at all to the message struct definitions. So don't provide the headers, or hide them or something.
Now, provide functions to send the messages, that will (inside) build the messages and send them. That will even help with reading the code.
When receiving messages, provide a way to notify the client code. But don't provide the pessages directly!!! Just keep them somewhere (maybe temporarly or using a lifetime rule or something) inside your library, maybe in a kind of manager, whatever, but do keep them INSIDE THE BLACK BOX. Just provide a kind of message identifier.
Provide functions to get informations from the messages without exposing the struct. To achieve this you have several ways. I would be in this case, I would provide functions gathered in a namespace. Those functions would ask the message identifier as first parameter and will return only one data from the message (that could be a full object if necessary).
That way, the users just cant use the structs as data containers, because they dont have their definitions. they conly can access the data.
There a two problems with this : obvious performance cost and it's clearly heavier to write and change. Maybe using some code generator would be better. Google Protobuf is full of good ideas in this domain.
But the best way would be to make them understand why their way of doing will break soon or later.
The programmers are doing this because its the easiest path of least resistance to getting the functionality they want. It may be easier for them to access the data if it were in a class with proper accessors, but then they'd have to write that class and write conversion functions.
Take advantage of their laziness and make the easiest path for them be to do the right thing. For each message struct you creat, create a corresponding class for storing and accessing the data using a nice interface with conversion methods to make it a one liner for them to put the message into the class. Since the class would have nicer accessor methods, it would be easier for them to use it than to do the wrong thing. eg:
msg_struct inputStruct = receiveMsg();
MsgClass msg(inputStruct);
msg.doSomething()
...
msg_struct outputStruct = msg.toStruct();
Rather than find ways to force them to not take the easy way out, make the way you want them to use the code the easiest way. The fact that multiple programmers are using this antipattern, makes me think there is a piece missing to the library that should be provided by the library to accomodate this. You are pushing the creation of this necessary component back on the users of the code, and not likeing the solutions they come up with.
You could implement them in terms of const references so that server side constructs the transport struct but client usage is only allowed to have const references to them and can't actually instantiate or construct them to enforce the usage you want.
Unfortunately without a code snippets of your messages, packaging, correct usage, and incorrect usage I can't really provide more detail on how to implement this in your situation but we use something similar in our data model to prevent improper usage. I also export and provide template storage classes to ease the population from the message for client usage when they do want to store the retrieved data.
usually there is a bad idea to define transport messages as structures. It's better to define "normal" (useful for programmer) struct and serializer/deserializer for it. To automate serializer/deserializer coding it's possible to define structure with macroses in a separate file and generate typedef struct and serializer/deserializer automatically(boost preprocessor llibrary may also help)
i can't say it more simple than this: use protocol buffers. it will make your life so much easier in the long run.

Is this a good concept for sending serialized objects over Network?

I have a client and a server
I want to send objects from client to server
The objects must be send bundled together in a "big packet" containing many objects
The objects could be in a random order
The number of objects is not fixed
There could be objects in the packet which are unknown to the server (so he needs to dump them)
I haven't got much experience with Serialization.
I would prefer Boosts Serialization-framework (if thats possible with it)
I thought of the following concept
(incomplete Pseudocode, inspired by C++, no specific Boost::Serialization-code):
class SerializableObject
{
virtual int getIdentifier() =0;
virtual Archive serialize() =0;
}
class SubclassA : public SerializableObject
{
int getIdentifier() { return 1; }
Archive serialize() { ... }
...
}
class SubclassB : public SerializableObject
{
int getIdentifier() { return 2; }
Archive serialize() { ... }
...
}
Now on Clientside:
void packAndSendData()
{
archive << objectA.getIdentifier();
archive << objectA;
archive << objectB.getIdentifier();
archive << objectB;
myNetworkObject.sendData(archive);
}
On Serverside:
void receiveAndUnpackData()
{
archive = myNetworkObject.receiveData();
while(archive.containsObjects()) //possible?
{
int type = archive.deserializeNextObject();
if(type == 1)
SubclassA objectA = archive.deserializeNextObject();
if(type == 2)
SubclassB objectB = archive.deserializeNextObject();
else
archive.dropNextObject(); //How to do this? Possible?
}
}
So the questions are:
- Is this a good concept or are there other possibilities?
- Is such a concept possible with Boost::Serialization
- If not: Are there other libs which could help implementing the concept?
I've tried to compress the problem as much as possible and to give as much info as I could. Hope it is understandable what I meant and what I try to achieve.
If anyone has a better title for this question please fix the existing one, I had no idea of how describing this question with all of its aspects.
The approach you describe is a start, but have you thought about how you'd serialise references between objects. I.e. serialising an object graph. Also, if you may need to think about data format versioning if your client and server can change out of sync with each other. Its not necessarily a simple problem.
Are there other libs which could help
implementing the concept?
You could look at Google's Protocol Buffers project. It probably does what you want, and is language neutral.
Boost serialization is definitely a start, however there are other considerations such as
Fragmentation - i.e. what happens if the "collection" of your objects cannot fit in to one packet, how do you know at the other end that you've received all the data that represents the "collection"
Platforms - if you have a homogenous environment (say all linux), this is not an issue, but if you have a mixture, this could be a problem (I don't believe boost serialization works across different byte encodings - I could be wrong here, need to check docs) This applies to languages too (a point raised by Andy Johnson above) - with boost serialization, you're tied to C++.
I would recommend that you look at one of the OS messaging products, and specifically if you want to send structures, something like Open DDS. All the serialization etc. is handled nicely for you, and you get lot's more features and functionality. Of course it's quite heavy weight, and slightly less performant than asio + serialization, but you have to do less work.

Using xml to load objects. Which is the best approach?

TinyXML
I have a XML file that keeps a bunch of data that is loaded into objects. Right now, I have one giant method that parses the XML file and creates the appropriate objects depending on the contents of the XML file. This function is very large and imports lots of class definitions.
Would it be better to each class type to do its own loading from XML. That way the XML code is dispersed throughout my files and not in one location. The problem is that I need to pass it the exact node inside the XML file where that function should read from. Is this feasible? I'm using tinyxml so if imagine each class can be passed the XML stream (an array containing the XML data actually) and then I'd also pass the root element for that object \images\fractal\traversal\ so it knows what it should be reading.
Then the saving would work the same way.
Which approach is best and more widely used?
I don't know anything about TinyXML, but I have been using that kind of class design with libxml2 for several years now and it has been working fine for me.
Serialization functions should be friends of the classes they serialize. If you want to serialize and deserialize to XML you should write friend function that perform this function. You could even write custom ostream & operator <<() functions that do this, but this becomes problematic if you want to aggregate objects. A better strategy is to define a mechanism that turns individual objects into Node's in a DOM document.
I can think of an approach, based on a factory to serve up the objects based on a tag.
The difficulty here is not really how to decouple the deserialization of each object content, but rather to decouple the association of a tag and an object.
For example, let's say you have the following XML
<my_xml>
<bird> ... </bird>
</my_xml>
How do you know that you should build a Bird object with the content of the <bird> tag ?
There are 2 approaches there:
1 to 1 mapping, ig: <my_xml> represents a single object and thus knows how to deserialize itself.
Collection: <my_xml> is nothing more than a loose collection of objects
The first is quite obvious, you know what to expect and can use a regular constructor.
The problem in C++ is that you have static typing, and that makes the second case more difficult, since you need virtual construction there.
Virtual construction can be achieved using prototypes though.
// Base class
class Serializable:
{
public:
virtual std::auto_ptr<XmlNode*> serialize() const = 0;
virtual std::auto_ptr<Serializable> deserialize(const XmlNode&) const = 0;
};
// Collection of prototypes
class Deserializer:
{
public:
static void Register(Tag tag, const Serializable* item)
{
GetMap()[tag] = item;
}
std::auto_ptr<Serializable> Create(const XmlNode& node)
{
return GetConstMap()[node.tag()]->deserialize(node);
// I wish I could write that ;)
}
private:
typedef std::map<Tag, const Serializable*> prototypes_t;
prototypes_t& GetMap()
{
static prototypes_t _Map;
return _Map;
}
prototypes_t const& GetConstMap() { return GetMap(); }
};
// Example
class Bird: public Serializable
{
virtual std::auto_ptr<Bird> deserialize(const XmlNode& node);
};
// In some cpp (bird.cpp is indicated)
const Bird myBirdPrototype;
Deserializer::Register('bird', myBirdPrototype);
Deserialization is always a bit messy in C++, dynamic typing really helps there :)
Note: it also works with streaming, but is a bit more complicated to put in place safely. The problem of streaming is that you ought to make sure not to read past your data and to read all of your data, so that the stream is in a 'good' state for the next object :)