Generate operator== using Boost Serialization? - c++

Problem: I have a set of classes for which I have already implemented boost serialization methods. Now, I want to add an operator== to one class that contains many of the other classes as its members. This comparison should be straightforward: A deep, member wise comparison.
Idea: Since the existing serialization methods already tell the compiler everything it needs to know, I wonder if this can be used to generate efficient comparison operators.
Approach 1: The simplest thing would be to compare strings containing serializations of the objects to be compared. The runtime of this approach is probably much slower than handcrafted operator== implementations.
Approach 2: Implement a specialized boost serialization archive for comparisons. However, implementing this is much more complicated and time consuming than implementing either handcrafted operators or approach 1.

I did a similar thing recently for hashing of serializable types:
Hash an arbitrary precision value (boost::multiprecision::cpp_int)
Here I "abused" boost serialization to get a hash function for any Boost Multi Precision number type (that has a serializable backend).
The approach given in the linked answer is strictly more lightweight and much easier to implement than writing a custom archive type. Instead, I wrote a custom custom Boost Iostreams sink to digest the raw data.
namespace io = boost::iostreams;
struct hash_sink {
hash_sink(size_t& seed_ref) : _ptr(&seed_ref) {}
typedef char char_type;
typedef io::sink_tag category;
std::streamsize write(const char* s, std::streamsize n) {
boost::hash_combine(*_ptr, boost::hash_range(s, s+n));
return n;
}
private:
size_t* _ptr;
};
This is highly efficient because it doesn't store the archive anywhere in the process. (So it sidesteps the dreadful inefficiency of Approach 1)
Applying to Equality Comparison
Sadly, you can't easily equality compare in streaming mode (unless you can be sure that both streams can be consumed in tandem, which is a bit of finicky assumption).
Instead you would probably use something like boost::iostreams::device::array_sink or boost::iostreams::device::back_insert_device to receive the raw data.
If memory usage is a concern you might want to compress it in memory (Boost Iostreams comes with the required filters for zip/bzip too). But I guess this is not your worry - as you would likely not be trying to avoid duplicating code in that case. You could just implement the comparison directly.

Related

C++ serialization of data-structures

I'm studying serializations in C++. What's the advantage/difference of boost::serialization if compared to something like:
ifstream_obj.read(reinterpret_cast<char *>(&obj), sizeof(obj)); // read
// or
ofstream_obj.write(reinterpret_cast<char *>(&obj), sizeof(obj)); // write
// ?
and, which one is better to use?
The big advantages of Boost Serialization are:
it actually works for non-trivial (POD) data types (C++ is not C)
it allows you to decouple serialization code from archive backend, thereby giving you text, xml, binary serialization
If you use the proper archive you can even have portability (try that with your sample). This means you can send on one machine/OS/version and receive on another without problems.
Lastly, it adds (a) layer(s) of abstraction which make things a lot less error prone. Granted, you could have done the same for your suggested serialization approach without much issue.
Here's an answer that does the kind of serialization you suggest but safely:
How to pass class template argument to boost::variant?
Note that Boost Serialization is fully aware of bitwise serializable types and you can tell it about your own too:
Boost serialization bitwise serializability

How to perform flexible serialization of a polymorphic inheritance hierarchy?

I have tried to read carefully all the advice given in the C++FAQ on this subject. I have implemented my system according to item 36.8 and now after few months (with a lot of data serialized), I want to make changes in both public interface of some of the classes and the inheritance structure itself.
class Base
{
public:
Vector field1() const;
Vector field2() const;
Vector field3() const;
std::string name() const {return "Base";}
};
class Derived : public Base
{
public:
std::string name() const {return "Derived";}
};
I would like to know how to make changes such as:
Split Derived into Derived1 and Derived2, while mapping the original Derived into Derived1 for existing data.
Split Base::field1() into Base::field1a() and Base::field1b() while mapping field1 to field1a and having field1b empty for existing data.
I will have to
deserialize all the gigabytes of my old data
convert them to the new inheritance structure
reserialize them in a new and more flexible way.
I would like to know how to make the serialization more flexible, so that when I decide to make some change in the future, I would not be facing conversion hell like now.
I thought of making a system that would use numbers instead of names to serialize my objects. That is for example Base = 1, Derived1 = 2, ... and a separate number-to-name system that would convert numbers to names, so that when I want to change the name of some class, I would do it only in this separate number-to-name system, without changing the data.
The problems with this approach are:
The system would be brittle. That is changing anything in the number-to-name system would possibly change the meaning of gigabytes of data.
The serialized data would lose some of its human readability, since in the serialized data, there would be numbers instead of names.
I am sorry for putting so many issues into one question, but I am inexperienced at programming and the problem I am facing seems so overwhelming that I just do not know where to start.
Any general materials, tutorials, idioms or literature on flexible serialization is most welcomed!
It's probably a bit late for that now, but whenever designing
a serialization format, you should provide for versionning.
This can be mangled into the type information in the stream, or
treated as a separate (integer) field. When writing the class
out, you always write the latest version. When reading, you
have to read both the type and the version before you can
construct; if you're using the static map suggested in the FAQ,
then the key would be:
struct DeserializeKey
{
std::string type;
int version;
};
Given the situation you are in now, the solution is probably to
mangle the version into the type name in a clearly recognizable
way, say something along the lines of
type_name__version; if the
type_name isn't followed by two underscore,
then use 0. This isn't the most efficient method, but it's
usually acceptable, and will solve the problem with backwards
compatibility, while providing for evolution in the future.
For your precise questions:
In this case, Derived is just a previous version of
Derived1. You can insert the necessary factory function into
the map under the appropriate key.
This is just classical versionning. Version 0 of Base has
a field1 attribute, and when you deserialize, you use it to
initialize field1a, and you initialize field1b empty.
Version 2 of Base has both.
If you mangle the version into the type name, as I suggest
above, you shouldn't have to convert any existing data. Long
term, of course, either some of the older versions simply
disappear from your data sets, so that you can remove the
support for them, or your program keeps getting bigger, with
support for lots of older versions. In practice, I've usually
seen the latter.
maybe Thrift Can help you do it.

How to aggregate values in C++ using keys?

In C++ how can I aggregate values for a struct based on three keys?
In Perl I would do this using a hash of hashes (e.g. something like $hash{$key1}{$key2}{$key3}{'call_duration'} += 25);
As I'm a complete newbie to C++ could you please suggest a suitable approach?
I've had a look at topics on SO discussing nested hash equivalents in C++ using std::map, however it states that this is slow performance-wise and as I need to process records for a telecom operator, performance is critical.
It's not necessary that I follow an approach that uses the template library or anything that should resemble Perl in syntax and mindset, but if you have had to do something similar, could you please share a fast and suitable way to implement it?
I'm mostly limited to the C++ 98 standard (the technical lead has allowed the use of newer features provided that they are supported by the compiler and they are of critical benefit).
Apologies if the description is muddled and thanks in advance!
edit: The compiler version is GCC 4.1.2, importing tr1/functional as a library isn't frowned upon by it.
edit: Thanks very much to everyone that joined, in particular to Bartek and Rost for putting up with my stupid questions. I decided to choose Rost's answer as it's what I was actually able to get to work! :)
Common std::map shall be suitable, its performance is usually not a problem for most cases. Hash provides constant time access to elements, tree-based map provides logarithmic time, but in reality constant time maybe greater than logarithmic - it depends on specific implementation and specific data. In case when you fill container once and then only update data without key changing/inserting/deleting you could use sorted std::vector or Loki::AssocVector.
You shall first try std::map (or std::set if the key is actully part of data) and only then make decision is it too slow for you or not. Example:
// Composite key definition
struct CompositeKey
{
int key1;
std::string key2;
AnotherType key3;
CompositeKey(int i_key1, const std::string& i_key2, AnotherType i_key3):
key1(i_key1), key2(i_key2), key3(i_key3)
{}
bool operator < (const CompositeKey& i_rhs) const
{
// You must define your own less operator for ordering keys
}
};
// Usage
std::map<CompositeKey, Data> aggrData;
aggrData[CompositeKey(0, "KeyString", AnotherType())] = Data();
if(aggrData.find(CompositeKey(0, "KeyString", AnotherType())) != aggrData.end())
{
// Process found data
}
For further performance research you could try:
hash_map/hash_set (named stdex::hash_map in MSVC++ and __gnu_cxx::hash_map in GCC)
boost::unordered_map/boost::unordered_set
Loki::AssocVector
std::unordered_map/std::unordered_set - for C++11 only
All these containers have similar interface so it will not be difficult to encapsulate it and easily switch implementation if required.
The simple solution is to use struct aggregating the 3 keys, and use it as key.
struct Key
{
Type1 Key1;
Type2 Key2;
Type3 Key3;
// I forgot about the comparator - you have to provide it explicitly
};
Because you are somewhat limited with language, check if your compiler supports std::hash_map:
std::hash_map<Key, TValue> Data;
If not, you could always use boost::unordered_map.
If someone else stumbles on the same problem, the "proper solution", however, is this:
std::unordered_map<std::tuple<Type1, Type2, Type3>, TValue>;
EDIT: Sample usage
struct Key
{
int Int;
float Float;
string String;
// add ctor and operator<
};
std::hash_map<Key, int> Data;
Data[Key(5, 3.5f, "x")] = 10;

A collection of custom structures (a wrapper) with a single member (also a custom structure), to a collection of the single members

The problem is specific but the solution open ended. I'm a lone coder looking to bat some ideas around with some fellow programmers.
I have a wrapper for a maths library. The wrapper provides the system with a consistent interface, while allowing me to switch in/out math libraries for different platforms. The wrapper contains a single member, so say for my Matrix4x4 wrapper class there is an api_matrix_4x4 structure as the only member to the wrapper.
My current target platform has a nifty little optimised library, with a few of those nifty functions requiring a C-style array of the wrapper's embedded member, while my wrapper functions for those math API functions don't want to expose that member type to the rest of the system. So we have a collection of wrappers (reference/pointer to) going into the function, & the members of the wappers being needed in a collection inside the function, so they can be passed to the math API.
I'm predominantly using C++, including C++11 features, & can also go C-style. Ideally I want a no-exception solution, & to avoid as many, if not all dynamic allocations. My wrapper functions can use standard library arrays or vectors, or C-style pointers to arrays as parameters, & whatever is necessary internally, just no dynamic casting (Run-Time Type Information).
1) Can I cast a custom struct/class containing a single custom struct, to the custom struct? If so, what about if it was a standard library collection of them. I'm thinking about type slicing here.
2) Would you perhaps use a template to mask the type passed to the function, although the implementation can only act on a single type (based on the math API used), or is such usage of templates considered bad?
3) Can you think of a nifty solution, perhaps involving swaps/move semantics/emplacement? If so, please help by telling me about it.
4) Or am I resigned to the obvious, iterate through one collection, taking the member out into another, then using that for the API function?
Example of what I am doing by the wrapper struct & wrapper function signature, & example of what I am trying to avoid doing is given by the function implementation:
struct Vector3dWrapper
{
API_Specific_Vector_3d m_api_vector_3d;
inline void operation_needing_vector_3d_wrappers(std::vector<Vector3d>& vectors)
{
// Now need a collection of API_Specific_Vector_3ds
try
{
std::Vector<API_Specific_Vector_3d> api_vectors;
api_vectors.reserve(vectors.size());
for( auto vectors_itr = vectors.begin(); vectors_itr != vectors.end(); ++vectors)
{
// fill each Vector3d.m_api_vector_3d into api_vectors
}
}
catch(std::bad_alloc &e)
{
// handle... though in reality, try/catch is done elsewhere in the system.
}
// Signature is API_Multiply_Vectors_With_Matrix_And_Project(API_Specific_Vector_3d* vectors, size_t vector_count)
API_Multiply_Vectors_With_Matrix_And_Project(&api_vectors, api_vectors.size());
}
};
You can cast a standard-layout struct (such as a struct compatible with C) to its first member, but what's the point? Just access the first member and apply &.
Templates usually allow uniform parameterization over a set of types. You can write a template that's only instantiated once, but again that seems pointless. What you really want is a different interface library for each platform. Perhaps templates could help define common code shared between them. Or you could do the same in plain C by setting typedefs before #include.
Solution to what? The default copy and move semantics should work for flat, C-style structs containing numbers. As for deep copies, if the underlying libraries have pointer-based structures, you need to be careful and implement all the semantics you'll need. Safe… simple… default… "nifty" sounds dirty.
Not sure I understand what you're doing with collections. You mean that every function requires its parameters to be first inserted into a generic container object? Constructing containers sounds expensive. Your functions should parallel the functions in the underlying libraries as well as possible.

C++ way to serialize a message?

In my current project I have a few different interfaces that require me to serialize messages into byte buffers. I feel like I'm probably not doing it in a way that would make a true C++ programmer happy (and I'd like to).
I would typically do something like this:
struct MyStruct {
uint32_t x;
uint64_t y;
uint8_t z[80];
};
uint8_t* serialize(const MyStruct& s) {
uint8_t* buffer = new uint8_t[sizeof(s)];
uint8_t* temp = buffer;
memcpy(temp, &s.x, sizeof(s.x));
temp += sizeof(s.x);
//would also have put in network byte order...
... etc ...
return buffer;
}
Excuse any typos, that was just an example off the top of my head. Obviously it can get more complex if the structure I'm serializing has internal pointers.
So, I have two questions that are closely related:
Is there any problem in the specific scenario above with serializing by casting the struct directly to a char buffer assuming I know that the destination systems are in the same endianness?
Main question: Is there a better... erm... C++? way to do this aside from the addition of smart pointers? I feel like this is such a common problem that the STL probably handles it - and if it doesn't, I'm sure there's a better way to do it anyway using C++ mechanisms.
EDIT Bonus points if you can do a clean example of serializing this structure in a better way using standard C++/STL without added libraries.
You might want to take a look at Google Protocol Buffers (also known as protobuf). You define your data in a language neutral IDL and then run it through a generator to generate your C++ classes. It'll take care of byte ordering issues and can provide a very compact binary form.
By using this you'll not only be able to save your C++ data but it'll be useable in other languages (C#, Java, Python etc) as there are protobuf implementation available for them.
You should probable use either Boost::serialization or streams directly. More information in either of the links to the right.
Is it possible to serialize and deserialize a class in C++?
just voted AzPs reply as the answer, checking Boost first is the way to go.
in addition about your code sample:
1 - changing the signature of your serialization function to a method taking a file:
void MyStruct::serialize(FILE* file) // or stream
{
int size = sizeof(this);
fwrite(&size, sizeof(int), 1, file); // write size
fwrite(this, 1, size, file); // write raw bytes of struct
}
reduces the necessity to copy the struct.
2 - yes, your code makes the serialized bytes dependent on your platform, compiler and compiler settings. this is not good or bad, if the same binary writes and reads the serialized bytes, this might be beneificial because of simplicity and performance. But it is not only endianness, also packing and struct layout affects compatibility. For instance a 32bit or a 64bit version of your app will change the layout of our struct quite for sure. Finally serializing the raw footprint also serializes padding bytes - the bytes the compiler might put between struct fields, an overhead undesirable for high traffic network streams (see google protocol buffers as they hunt every bit they can save).
EDIT:
i see you added "embedded". yes, then such simple serialize / deserialize (mirror implementation of the above serialize) methods might be a good and simple choice.