c++ network serialization [closed] - c++

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I'm looking for a solution for serializing of c++ packets to a network stream.
I have seen many posts here refering people to:
ACE
Google Protocol Buffers
Boost::Serialization
Qt ::QDataStream
My Requirements/ Constraints:
The solution must be unaware of LitteEndian/BigEndian. Machine Architecture x86/x64 and platform independant.
The foot print (RAM & ROM) of the first 3 solution are too big for my platform,and the fourth is conflicting with the next requirement.
The solution won't require a lot of boilerplate code (there will be 200+ packet to be serialized).
Thanks,
Koby Meir

If you find that Google Protocol Buffers are to heavy (I can agree with that because compiled library could take more than 1 MB), you could try the lite version of protobuf which is a few times smaller.
It can be enabled in *.proto files by inserting the following line
option optimize_for = LITE_RUNTIME;
But if you need a protobuf solution with minimal overhead I would go with protobuf-c,
a C implementation of protobuf.
It will be a little harder to use, but binary code size overhead should be minimal (30-50 KB). I know that this C implementation is used for example by umurmur - a voice server that runs very well on embedded Linux ARM and MIPS routers.

Another thought: Serialize as text, parsing back on the other side.
We do this a lot (TCP, UDP, serial, other protocols). There is tremendous precedence for this approach, like in robotic control systems, Lab Information Management Systems, and any other place where "hook-up" among many vendors is helpful: Everybody just sends plain text (ASCII or UTF-8), and it's easy to read, easy to debug, easy to reverse-engineer, and easy to repair/hook-into. (If you want it opaque, you can encrypt your payload, like with public/private keys.)
This fits your requirement of agnostic-to-endien-ness/different platform requirements. We've used XML, which works fine and is fairly "standard" with some reference ontology for what you want as (possibly nested) "Key=Value" values, but we tend to prefer an INI-style format, using "named sections" as it gets more complicated. If you "nest" lots of stuff, you might look at JSON-like implementations (it's really easy).
Strong vote for ASCII, IMHO.

Wow, ACE, boost serialization ... these frameworks have serialization built in but looking at them for serialization alone is like buying a car because you need CD player. SUN/DEC RPC uses a format called XDR - very well described in Steven's "UNIX Network programming" - essentially it is a 1000 LOC header/c file. that you can use alone. CORBA also uses XDR underneath. Just look for "xdr.h" in Google code - plenty of OSS implementations.
If you still need something sophisticated, I would find ASN.1 to be most comprehensive solution which is granted little more complicated than needed for most applications, however ASN.1 compilers generate compact code. It is mostly used by telecom stacks, in cll phones, GSM messaging etc. ASN.1 is used to encode RSA keys.

I think you'll have to roll your own final solution, but there are a few building blocks you can use:
Boost::Spirit (Karma to serialize, Qi to deserialize)
that might still be too big for you, in which case you'll have to roll your own for serialization/deserialization, though you could use the Barton-Nackman idiom to keep the code readable and still use a simply serialize function
a light-weight communications layer, such as Arachnida (which you could use just for communication, without using its HTTP facilities).Arachnida builds on OpenSSL. If that's too heavy for you too, you really should roll your own.
Ignorance/knowledge of little/big endianness would pretty much be in the hands of your serialization code..
Good luck

I implemented something like this using the same technique microsoft used in MFC. Basically you implement methods in each serializable class to serialize and deserialize
whatever you want saved from that class. It's very lightweight and fast.
Instead of the 'archive' class Microsoft used I used a binary stream class.
My requirements didn't need endian awareness but it would be
trivial to implement if you implement a base class to serialize POD variables.

Related

What are the approaches and trade-offs involved in a platform-independent representation?

Before saying anything I have to say that, albeit I'm an experienced programmer in Java, I'm rather new to C / C++ programming.
I have to save a binary file in a format that makes it accessible from different operating systems & platforms. It should be very efficient because I have to deal with a lot of data. What approaches should I investigate for that? What are the main advantages and disadvantages?
Currently I'm thinking about using the network notation (something like htonl that is available both under unix and windows ). Is there a better way?
Network order (big-endian) is something of a de facto standard. However, if your program will be used mostly on x86 (which is little-endian), you may want to stick with that for performance reasons (the protocol will still be usable on big-endian machines, but they will instead have the performance impact).
Besides htonl (which converts 32-bit values), there's also htons (16-bit), and bswap_64 (non-standard for 64-bit).
If you want a binary format, but you'd like to abstract away some of the details to ease serialization and deserialization, consider Protocol Buffers or Thrift. Protocol Buffers are updatable (you can add optional or repeated (0 or more) fields to the schema without breaking existing code); not sure about Thrift.
However, before premature optimization, consider whether parsing is really the bottleneck. If reading every line of the file will require a database query or calculation, you may be able to use a more readable format without any noticeable performance impact.
I think there are a couple of decent choices for this kind of task.
In most cases, my first choice would probably be Sun's (now Oracle's) XDR. It's used in Sun's implementation of RPC, so it's been pretty heavily tested for quite a while. It's defined in RFC 1832, so documentation is widely available. There are also libraries (portable and otherwise) that know how to convert to/from this format. On-wire representation is reasonably compact and conversion fairly efficient.
The big potential problem with XDR is that you do need to know what the data represents to decode it -- i.e., you have to (by some outside means) ensure that the sender and receiver agree on (for example) the definition of the structs they'll send over the wire, before the receiver can (easily) understand what's being sent.
If you need to create a stream that's entirely self-describing, so somebody can figure out what it contains based only on the content of the stream itself, then you might consider ASN.1. It's crufty and nasty in some ways, but it does produce self-describing streams, is publicly documented, and it's used pretty widely (though mostly in rather limited domains). There are a fair number of libraries that implement encoding and decoding. I doubt anybody really likes it much, but if you need what it does, it's probably the first choice, if only because it's already known and somewhat accepted.
My first choice for a situation like this would be ASN.1 since it gives you the flexibility of using whatever programing language you desire on either end, as well as being platform independent. It hides the endian-ness issues from you so you don't have to worry about them. One end can use Java while the other end uses C or C++ or C#. It also supports multiple encoding rules you can chose from depending on your needs. There is PER (Packed Encoding Rules) if the goal is making the encoding as small as possible, or there is E-XER (Extended XML Encoding Rules) if you prefer to exchange information using XML, or there is DER (Distingushed Encoding Rules) if your application involves digital signatures or certificates. ASN.1 is widely used in telephony, but also used in banking, automobiles, aviation, medical devices, and several other area. It is mature proven technology that has stood the test of time and continues to be added in new areas where communication between disparate machines and programming languages is needed.
An excellent resource where you can try ASN.1 free is http://asn1-playground.oss.com where you can play with some existing ASN.1 specifications, or try creating your own, and see what the various encoding rules produce.
There are some excellent books available as a free download from http://www.oss.com/asn1/resources/books-whitepapers-pubs/asn1-books.html where the first one is titled "ASN.1 — Communication Between Heterogeneous Systems".

IDL-like parser that turns a document definition into powerful classes? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I am looking for an IDL-like (or whatever) translator which turns a DOM- or JSON-like document definition into classes which
are accessible from both C++ and Python, within the same application
expose document properties as ints, floats, strings, binary blobs and compounds: array, string dict (both nestable) (basically the JSON type feature set)
allow changes to be tracked to refresh views of an editing UI
provide a change history to enable undo/redo operations
can be serialized to and from JSON (can also be some kind of binary format)
allow to keep large data chunks on disk, with parts only loaded on demand
provide non-blocking thread-safe read/write access to exchange data with realtime threads
allow multiple editors in different processes (or even on different machines) to view and modify the document
The thing that comes closest so far is the Blender 2.5 DNA/RNA system, but it's not available as a separate library, and badly documented.
I'm most of all trying to make sure that such a lib does not exist yet, so I know my time is not wasted when I start to design and write such a thing. It's supposed to provide a great foundation to write editing UI components.
ICE is the closest product I could think of. I don't know if you can do serialization to disk with ICE, but I can't think of a reason why it wouldn't. Problem is it costs $$$. I haven't personally negotiated a license with them, but ICE is the biggest player I know of in this domain.
Then you have Pyro for python which is Distributed Objects only.
Distributed Objects in Objective-C (N/A for iPhone/iPad Dev, which sucks IMHO)
There are some C++ distributed objects libraries but they're mostly dead and unusable (CORBA comes to mind).
I can tell you that there would be a lot of demand for this type of technology. I've been delving into some serialization and remote object stuff since off-the-shelf solutions can be very expensive.
As for open-source frameworks to help you develop in-house, I recommend boost::asio's strands for async thread-safe read/write and boost::serialization for serialization. I'm not terribly well-read in JSON tech but this looks like an interesting read.
I wish something freely available already existed for this networking/serialization glue that so many projects could benefit from.
SWIG doesn't meet all your requirements, but does make interfacing c++ <-> python a lot easier.

Looking for a disk-based B+ tree implementation in C++ or C [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I am looking for a lightweight open source paging B+ tree implementation that uses a disk file for storing the tree.
So far I have found only memory-based implementations, or something that has dependency on QT (?!) and does not even compile.
Modern C++ is preferred, but C will do too.
I prefer to avoid full embeddable DBMS solution, because: 1) for my needs bare bone index that can use the simplest possible disk file organization is enough, no need for concurrency, atomicity and everything else. 2) I am using this to prototype my own index, and most likely will change some of the algorithms and storage layout. I want to do that with a minimum of effort. It's not going to be production code.
http://people.csail.mit.edu/jaffer/WB.
You can also consider re-using the B-Tree implementations from an open source embeddable database. (BDB, SQLite etc)
My own implementation is under http://www.die-schoens.de/prg License is Apache. Its disk-based, maps to shared memory where it also can do locking (i. e. multiuser), file format protects against crash etc. All of the above can be easily switched off (compile or runtime if you like). So bare bone would be almost ANSI-C, basically caching in ones own memory and not locking at all. Test program is included. Currently, it only deals with fixed-size fields, but I am working on that...
I second the suggestion for Berkeley DB. I used it before it was bought by Oracle. It is not a full relational database, but just stores key-value pairs. We switched to that after writing our own paging B-Tree implementation. It was a good learning experience, but we kept adding features until is was just a (poorly) implemented version of BDB.
If you want to do it yourself, here is an outline of what we did. We used mmap to map pages into memory. The structure of each page was index based, so with the page start address you could access any element on the page. Then we mapped and unmapped pages as necessary. We were indexing multi GB text files, back when 1 GB of main memory was considered a lot.
Faircom's C-Tree Plus has been available commercially for over 20 years. Don't work for them etc... FairCom
There is also Berkley DB which was bought by Oracle but is still free from their site.
I' pretty sure it's not the solution you're looking but why don't you store the tree in a file yourself? All you need is an approach for serialization and an if/ofstream.
Basically you could serialize it like that: go to root, write '0' in your file a divider like '|', the number of elements in root and then all root elements. Repeat with '1' for level 1 and so on. As long as you don't change the level keep the level index, empty leafs could look like 2|0.
You could look at Berkeley DB, its supported ny Oracle but it is open source and can be found here.
RogueWave, the software company, have a nice implementation of BTreeOnDisk as part of their Tools++ product. I've been using it since late 90's. The nice thing about it is that you can have multiple trees in one file. But you do need a commercial license.
In their code they do make a reference to a book by a guy called 'Ammeraal' (see http://home.planet.nl/~ammeraal/algds.html , Ammeraal, L. (1996) Algorithms and Data Structures in C++). He seems to have an imlementation of a BTree on disk, and the source code seems to be accessible online. I have never used it though.
I currently work on projects for which I'd like to distribute the source code so I need to find an open source replacement for the Rogue Wave classes. Unfortunately I don't want to rely on GPL type licenses, otherwise a solution would be simply to use 'libdb' or equivalent. I need a BSD type license, and for a long time I couldn't find anything suitable. But I will have a look at some the links in earlier posts.

boost serialization vs google protocol buffers? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
Does anyone with experience with these libraries have any comment on which one they preferred? Were there any performance differences or difficulties in using?
I've been using Boost Serialization for a long time and just dug into protocol buffers, and I think they don't have the exact same purpose. BS (didn't see that coming) saves your C++ objects to a stream, whereas PB is an interchange format that you read to/from.
PB's datamodel is way simpler: you get all kinds of ints and floats, strings, arrays, basic structure and that's pretty much it. BS allows you to directly save all of your objects in one step.
That means with BS you get more data on the wire but you don't have to rebuild all of your objects structure, whereas protocol buffers is more compact but there is more work to be done after reading the archive. As the name says, one is for protocols (language-agnostic, space efficient data passing), the other is for serialization (no-brainer objects saving).
So what is more important to you: speed/space efficiency or clean code?
I've played around a little with both systems, nothing serious, just some simple hackish stuff, but I felt that there's a real difference in how you're supposed to use the libraries.
With boost::serialization, you write your own structs/classes first, and then add the archiving methods, but you're still left with some pretty "slim" classes, that can be used as data members, inherited, whatever.
With protocol buffers, the amount of code generated for even a simple structure is pretty substantial, and the structs and code that's generated is more meant for operating on, and that you use protocol buffers' functionality to transport data to and from your own internal structures.
There are a couple of additional concerns with boost.serialization that I'll add to the mix. Caveat: I don't have any direct experience with protocol buffers beyond skimming the docs.
Note that while I think boost, and boost.serialization, is great at what it does, I have come to the conclusion that the default archive formats it comes with are not a great choice for a wire format.
It's important to distinguish between versions of your class (as mentioned in other answers, boost.serialization has some support for data versioning) and compatibility between different versions of the serialization library.
Newer versions of boost.serialization may not generate archives that older versions can deserialize. (the reverse is not true: newer versions are always intended to deserialize archives made by older versions). This has led to the following problems for us:
Both our client & server software create serialized objects that the other consumes, so we can only move to a newer boost.serialization if we upgrade both client and server in lockstep. (This is quite a challenge in an environment where you don't have full control of your clients).
Boost comes bundled as one big library with shared parts, and both the serialization code and the other parts of the boost library (e.g. shared_ptr) may be in use in the same file, I can't upgrade any parts of boost because I can't upgrade boost.serialization. I'm not sure if it's possible/safe/sane to attempt to link multiple versions of boost into a single executable, or if we have the budget/energy to refactor out bits that need to remain on an older version of boost into a separate executable (DLL in our case).
The old version of boost we're stuck on doesn't support the latest version of the compiler we use, so we're stuck on an old version of the compiler too.
Google seem to actually publish the protocol buffers wire format, and Wikipedia describes them as forwards-compatible, backwards-compatible (although I think Wikipedia is referring to data versioning rather than protocol buffer library versioning). Whilst neither of these is a guarantee of forwards-compatibility, it seems like a stronger indication to me.
In summary, I would prefer a well-known, published wire format like protocol buffers when I don't have the ability to upgrade client & server in lockstep.
Footnote: shameless plug for a related answer by me.
Boost Serialisation
is a library for writing data into a stream.
does not compress data.
does not support data versioning automatically.
supports STL containers.
properties of data written depend on streams chosen (e.g. endian, compressed).
Protocol Buffers
generates code from interface description (supports C++, Python and Java by default. C, C# and others by 3rd party).
optionally compresses data.
handles data versioning automatically.
handles endian swapping between platforms.
does not support STL containers.
Boost serialisation is a library for converting an object into a serialised stream of data. Protocol Buffers do the same thing, but also do other work for you (like versioning and endian swapping). Boost serialisation is simpler for "small simple tasks". Protocol Buffers are probably better for "larger infrastructure".
EDIT:24-11-10: Added "automatically" to BS versioning.
I have no experience with boost serialization, but I have used protocol buffers. I like protocol buffers a lot. Keep the following in mind (I say this with no knowledge of boost).
Protocol buffers are very efficient so I don't imagine that being a serious issue vs. boost.
Protocol buffers provide an intermediate representation that works with other languages (Python and Java... and more in the works). If you know you're only using C++, maybe boost is better, but the option to use other languages is nice.
Protocol buffers are more like data containers... there is no object oriented nature, such as inheritance. Think about the structure of what you want to serialize.
Protocol buffers are flexible because you can add "optional" fields. This basically means you can change the structure of protocol buffer without breaking compatibility.
Hope this helps.
boost.serialization just needs the C++ compiler and gives you some syntax sugar like
serialize_obj >> archive;
// ...
unserialize_obj << archive;
for saving and loading. If C++ is the only language you use you should give boost.serialization a serious shot.
I took a fast look at google protocol buffers. From what I see I'd say its not directly comparable to boost.serialization. You have to add a compiler for the .proto files to your toolchain and maintain the .proto files itself. The API doesn't integrate into C++ as boost.serialization does.
boost.serialization does the job its designed for very well: to serialize C++ objects :)
OTOH an query-API like google protocol buffers has gives you more flexibility.
Since I only used boost.serialization so far I cannot comment on performance comparison.
Correction to above (guess this is that answer) about Boost Serialization :
It DOES allow supporting data versioning.
If you need compression - use a compressed stream.
Can handle endian swapping between platforms as encoding can be text, binary or XML.
I never implemented anything using boost's library, but I found Google protobuff's to be more thought-out, and the code is much cleaner and easier to read. I would suggest having a look at the various languages you want to use it with and have a read through the code and the documentation and make up your mind.
The one difficulty I had with protobufs was they named a very commonly used function in their generated code GetMessage(), which of course conflicts with the Win32 GetMessage macro.
I would still highly recommend protobufs. They're very useful.
I know that this is an older question now, but I thought I'd throw my 2 pence in!
With boost you get the opportunity to I'm write some data validation in your classes; this is good because the data definition and the checks for validity are all in one place.
With GPB the best you can do is to put comments in the .proto file and hope against all hope that whoever is using it reads it, pays attention to it, and implements the validity checks themselves.
Needless to say this is unlikely and unreliable if your relying on someone else at the other end of a network stream to do this with the same vigour as oneself. Plus if the constraints on validity change, multiple code changes need to be planned, coordinated and done.
Thus I consider GPB to be inappropriate for developments where there is little opportunity to regularly meet and talk with all team members.
==EDIT==
The kind of thing I mean is this:
message Foo
{
int32 bearing = 1;
}
Now who's to say what the valid range of bearing is? We can have
message Foo
{
int32 bearing = 1; // Valid between 0 and 359
}
But that depends on someone else reading this and writing code for it. For example, if you edit it and the constraint becomes:
message Foo
{
int32 bearing = 1; // Valid between -180 and +180
}
you are completely dependent on everyone who has used this .proto updating their code. That is unreliable and expensive.
At least with Boost serialisation you're distributing a single C++ class, and that can have data validity checks built right into it. If those constraints change, then no one else need do any work other than making sure they're using the same version of the source code as you.
Alternative
There is an alternative: ASN.1. This is ancient, but has some really, really, handy things:
Foo ::= SEQUENCE
{
bearing INTEGER (0..359)
}
Note the constraint. So whenever anyone consumes this .asn file, generates code, they end up with code that will automatically check that bearing is somewhere between 0 and 359. If you update the .asn file,
Foo ::= SEQUENCE
{
bearing INTEGER (-180..180)
}
all they need to do is recompile. No other code changes are required.
You can also do:
bearingMin INTEGER ::= 0
bearingMax INTEGER ::= 360
Foo ::= SEQUENCE
{
bearing INTEGER (bearingMin..<bearingMax)
}
Note the <. And also in most tools the bearingMin and bearingMax can appear as constants in the generated code. That's extremely useful.
Constraints can be quite elaborate:
Garr ::= INTEGER (0..10 | 25..32)
Look at Chapter 13 in this PDF; it's amazing what you can do;
Arrays can be constrained too:
Bar ::= SEQUENCE (SIZE(1..5)) OF Foo
Sna ::= SEQUENCE (SIZE(5)) OF Foo
Fee ::= SEQUENCE
{
boo SEQUENCE (SIZE(1..<6)) OF INTEGER (-180<..<180)
}
ASN.1 is old fashioned, but still actively developed, widely used (your mobile phone uses it a lot), and far more flexible than most other serialisation technologies. About the only deficiency that I can see is that there is no decent code generator for Python. If you're using C/C++, C#, Java, ADA then you are well served by a mixture of free (C/C++, ADA) and commercial (C/C++, C#, JAVA) tools.
I especially like the wide choice of binary and text based wireformats. This makes it extremely convenient in some projects. The wireformat list currently includes:
BER (binary)
PER (binary, aligned and unaligned. This is ultra bit efficient. For example, and INTEGER constrained between 0 and 15 will take up only 4 bits on the wire)
OER
DER (another binary)
XML (also XER)
JSON (brand new, tool support is still developing)
plus others.
Note the last two? Yes, you can define data structures in ASN.1, generate code, and emit / consume messages in XML and JSON. Not bad for a technology that started off back in the 1980s.
Versioning is done differently to GPB. You can allow for extensions:
Foo ::= SEQUENCE
{
bearing INTEGER (-180..180),
...
}
This means that at a later date I can add to Foo, and older systems that have this version can still work (but can only access the bearing field).
I rate ASN.1 very highly. It can be a pain to deal with (tools might cost money, the generated code isn't necessarily beautiful, etc). But the constraints are a truly fantastic feature that has saved me a whole ton of heart ache time and time again. Makes developers whinge a lot when the encoders / decoders report that they've generated duff data.
Other links:
Good intro
Open source C/C++ compiler
Open source compiler, does ADA too AFAIK
Commercial, good
Commercial, good
Try it yourself online
Observations
To share data:
Code first approaches (e.g. Boost serialisation) restrict you to the original language (e.g. C++), or force you to do a lot of extra work in another language
Schema first is better, but
A lot of these leave big gaps in the sharing contract (i.e. no constraints). GPB is annoying in this regard, because it is otherwise very good.
Some have constraints (e.g. XSD, JSON), but suffer patchy tool support.
For example, Microsoft's xsd.exe actively ignores constraints in xsd files (MS's excuse is truly feeble). XSD is good (from the constraints point of view), but if you cannot trust the other guy to use a good XSD tool that enforces them for him/her then the worth of XSD is diminished
JSON validators are ok, but they do nothing to help you form the JSON in the first place, and aren't automatically called. There's no guarantee that someone sending you JSON message have run it through a validator. You have to remember to validate it yourself.
ASN.1 tools all seem to implement the constraints checking.
So for me, ASN.1 does it. It's the one that is least likely to result in someone else making a mistake, because it's the one with the right features and where the tools all seemingly endeavour to fully implement those features, and it is language neutral enough for most purposes.
To be honest, if GPB added a constraints mechanism that'd be the winner. XSD is close but the tools are almost universally rubbish. If there were decent code generators of other languages, JSON schema would be pretty good.
If GPB had constraints added (note: this would not change any of the wire formats), that'd be the one I'd recommend to everyone for almost every purpose. Though ASN.1's uPER is very useful for radio links.
As with almost everything in engineering, my answer is... "it depends."
Both are well tested, vetted technologies. Both will take your data and turn it into something friendly for sending someplace. Both will probably be fast enough, and if you're really counting a byte here or there, you're probably not going to be happy with either (let's face it both created packets will be a small fraction of XML or JSON).
For me, it really comes down to workflow and whether or not you need something other than C++ on the other end.
If you want to figure out your message contents first and you're building a system from scratch, use Protocol Buffers. You can think of the message in an abstract way and then auto-generate the code in whatever language you want (3rd party plugins are available for just about everything). Also, I find collaboration simplified with Protocol Buffers. I just send over a .proto file and then the other team has a clear idea of what data is being transfered. I also don't impose anything on them. If they want to use Java, go ahead!
If I already have built a class in C++ (and this has happened more often than not) and I want to send that data over the wire now, Boost Serialization obviously makes a ton of sense (especially where I already have a Boost dependency somewhere else).
You can use boost serialization in tight conjunction with your "real" domain objects, and serialize the complete object hierarchy (inheritance). Protobuf does not support inheritance, so you will have to use aggregation. People argue that Protobuf should be used for DTOs (data transfer objects), and not for core domain objects themselves. I have used both boost::serialization and protobuf. The Performance of boost::serialization should be taken into account, cereal might be an alternative.

C++ Serialization Performance

I'm building a distributed C++ application that needs to do lots of serialization and deserialization of simple data structures that's being passed between different processes and computers.
I'm not interested in serializing complex class hierarchies, but more of sending structures with a few simple members such as number, strings and data vectors. The data vectors can often be many megabytes large.
I'm worried that text/xml-based ways of doing it is too slow and I really don't want to write this myself since problems like string encoding and number endianess can make it way more complicated than it looks on the surface.
I've been looking a bit at protocol buffers and boost.serialize. According to the documents protocol buffers seems to care much about performance.
Boost seems somewhat more lightweight in the sense that you don't have an external language for specifying the data format which I find quite convenient for this particular project.
So my question comes down to this: does anyone know if the boost serialization is fast for the typical use case I described above?
Also if there are other libraries that might be right for this, I'd be happy to hear about them.
I would strongly suggest protocol buffers. They're incredibly simple to use, offer great performance, and take care of issues like endianness and backwards compatibility. To make it even more attractive, serialized data is language-independent thanks to numerous language implementations.
ACE and ACE TAO come to mind, but you might not like the size and scope of it.
http://www.cs.wustl.edu/~schmidt/ACE.html
Regarding your query about "fast" and boost. That is a subjective term and without knowing your requirements (throughput, etc) it is difficult to answer that for you. Not that I have any benchmarks for the boost stuff myself...
There are messaging layers you can use, but those are probably slower than boost. I'd say that you identified a good solution in boost, but I've only used ACE and other proprietary communications/messaging products.
My guess is that boost is fast enough. I have used it in previous projects to serialize data to and from disk, and its performance never even came up as an issue.
My answer here talks about serialization in general, which may be helpful to you beyond which serialization library you choose to use.
Having said that, it looks like you know most of the main trouble spots with serialization (endianess string encoding). You did leave out versioning and forwards/backwards compatibility. If time is not critical I recommend writing your own serialization code. It is an enlightening experience, and the lessons you learn are invaluable. Though I will warn you it will tend to make you hate XML based protocols for their bloatedness. :)
Whichever path you choose good luck with your project.
Also check out ONC-RPC (old SUN-RPC)
boost.serialization doesn't care about string encodings or endianness. You'll be similarly well off not using it if that matters to you.
You might want to look into ICE from ZeroC: http://www.zeroc.com/
It works similar to CORBA, except that it's entirely specced and defined by the company. The upside is that the implementations work as intended, since there aren't all that many. The downside is that if you're using a language they don't support, you're out of luck.
If you are only sending well defined defined data structures, then perhaps you should be looking at ASN.1 as an encoding methodology ?
There's also Thrift, which looks like an alpha project but is used and developed by Facebook, so it has a few users of it.
Or good old DCE, which was the standard MS decided to use for COM. Its now open-source, 20 years too late, but better than never.
Don't pre-emptively optimize. Measure first and optimize second.