Porting data serialization code from C++ linux/mac to C++ windows

Porting data serialization code from C++ linux/mac to C++ windows - c++

I have a software framework compiled and running successfully on both mac and linux. I am now trying to port it to windows (using mingw). So far, I have the software compiling and running under windows but its inevitably buggy. In particular, I have an issue with reading data that was serialized in macos (or linux) into the windows version of the program (segfaults).
The serialization process serializes values of primitive variables (longs, ints, doubles etc.) to disk.
This is the code I am using:
#include <iostream>
#include <fstream>
template <class T>
void serializeVariable(T var, std::ofstream &outFile)
{
outFile.write (reinterpret_cast < char *>(&var),sizeof (var));
}
template <class T>
void readSerializedVariable(T &var, std::ifstream &inFile)
{
inFile.read (reinterpret_cast < char *>(&var),sizeof (var));
}
So to save the state of a bunch of variables, I call serializeVariable for each variable in turn. Then to read the data back in, calls are made to readSerializedVariable in the same order in which they were saved. For example to save:
::serializeVariable<float>(spreadx,outFile);
::serializeVariable<int>(objectDensity,outFile);
::serializeVariable<int>(popSize,outFile);
And to read:
::readSerializedVariable<float>(spreadx,inFile);
::readSerializedVariable<int>(objectDensity,inFile);
::readSerializedVariable<int>(popSize,inFile);
But in windows, this reading of serialized data is failing. I am guessing that windows serializes data a little differently. I wonder if there is a way in which I could modify the above code so that data saved on any platform can be read on any other platform...any ideas?
Cheers,
Ben.

Binary serialization like this should work fine across those platforms. You do have to honor endianness, but that is trivial. I don't think these three platforms have any conflicts in this respect.
You really can't use as loose of type specifications when you do, though. int, float, size_t sizes can all change across platforms.
For integer types, use the strict sized types found in the cstdint header. uint32_t, int32_t, etc. Windows doesn't have the header available iirc, but you can use boost/cstdint.hpp instead.
Floating point should work as most compilers follow the same IEEE specs.
C - Serialization of the floating point numbers (floats, doubles)
Binary serialization really needs thorough unit testing. I would strongly recommend investing the time.

this is just a wild guess sry I can't help you more. My idea is that the byte order is different: big endian vs little endian. So anything larger than one byte will be messed up when loaded on a machine that has the order reversed.
For example I found this peace of code in msdn:
int isLittleEndian() {
long int testInt = 0x12345678;
char *pMem;
pMem = (char *) testInt;
if (pMem[0] == 0x78)
return(1);
else
return(0);
}
I guess you will have different results on linux vs windows. Best case would be if there is a flag option for your compiler(s) to use one format or the other. Just set it to be the same on all machines.
Hope this helps,
Alex

Just one more wild guess:
you forget open file in binary reading mode, and on windows file streams
convert sequence 13,10 to 10.

Did you consider using serialization libraries or formats, like e.g.:
XDR (supported by libc) or ASN1
s11n (a C++ serialization library)
Json, a very simple textual format with many libraries for it, e.g. JsonCpp, Jansson, Jaula, ....)
YAML, a more powerful textual format, with many libraries
or even XML, which is often used for serialization purposes...
(And for serialization of scalars, the htonl and companion routines should help)

Related

The Standard Way To Encode/Decode To/From Binary Object In C++

I want to encode/decode some basic type into/from binary.
The test code may looks like this.
int main()
{
int iter = 0;
char* binary = new char[100];
int32_t version = 1;
memcpy(binary, &version, sizeof(int32_t));
iter+=sizeof(int32_t);
const char* value1 = "myvalue";
memcpy(binary+iter, value1, strlen(value1));
iter+=strlen(value1);
double value2 = 0.1;
memcpy(binary+iter, &value2, sizeof(double));
#warning TODO - big/small endian - fixed type length
return 0;
}
But I still need to solve a lot of problems, such as the endian and fixed type length.
So I want to know if there is a standard way to implement this.
Simultaneously, I don't want to use any third-party implementation, such as Boost and so on. Because I need to keep my code simple and Independent.
If there is a function/class like NSCoding in Objc, it will be best. I wonder if there is same thing in C++ standard library.

No, there are no serialization functions within the standard library. Use a library or implement it by yourself.
Note that raw new and delete is a bad practice in C++.

The most standard thing you have in every OS base library is ntohs/ntohl and htons/htonl that you can use to go from 'host' to 'network' byte order that is considered the standard for serializing integers.
The problem is that there is not yet a standard API for 64bit types and you should anyway serialize strings by yourself (the most common method is to prepend string data with an int16/32 containing the string length in bytes).
Again C/C++ do not offer a standard way to serialize data from/to a binary buffer or an XML or a JSON, but there are tons of libraries that implement this, for example, one of the most used, also if it comes with a lot of dependencies is:
Boost serialize
Other libraries widely used but that require a precompilation step are:
Google procol buffers
FlatBuffers

Fixed size data types, C++, windows types

I'm trying to get fixed sized floats and ints across all windows computers. As in, I have a program, I compile it and distribute the executable. I want the data types to be of constant bit size/ordering across all windows computers.
My first question is whether windows types defined at http://msdn.microsoft.com/en-us/library/aa383751(v=vs.85).aspx have fixed sizes across all windows computers (let's say running the same OS- Windows 7).
Basically, I'd like to transfer data contained in a struct over a network and I want to avoid having to put it all in a string or encode it into a portable binary form.
Edit: What about floats??

You should use the types from
#include <cstdint>
Like uint64_t, int16_t, int8_t etc.
On the ordering: I'm pretty sure Windows only runs on Big-Endian hardware. Regardless, if platform portability is a concern, why don't you use a proper Serialization library and avoid nasty surprises?
Boost Serialization
Protobuf (Google's data interchange formatting library)

Floating Point
While there is no C++ standard defining the sizes for formats of floating point values Microsoft has specified that they consistently use 4-byte and 8-byte IEEE floating point format for float and double types respectively.
Integrals
As for integral types, Microsoft does have compiler-specific defines for fixed length variables. Some non-Microsoft compilers define fixed-size integral types using the cstdint header. Neither of these are based on official standards.
Serialization
This will be terribly unportable and will most likely turn into a maintenance nightmare as your structs get more complicated. What you are effectively doing is defining an error-prone binary serialization format that must be complied with through convention. This problem has already been solved more effectively.
I would highly recommend using a serialization format like protocol buffers or maybe boost::serialization for communication between machines. If your data is hitting the wire, then the performance of serialization/deserialization is going to be an incredibly small fraction of transmission time.
Alignment
Another serious issue that you'll have is how the struct is packed in memory. Your struct will most likely be laid-out in memory differently in a 32-bit process than it is in a 64-bit process.
In a 32-bit process, your struct members will be aligned on word boundaries, and on doubleword boundaries for 64-bit.
For example, this program outputs 20 on 32-bit and 24 on 64-bit platforms:
#include <iostream>
#include <cstdint>
struct mystruct {
uint32_t y;
double z;
uint8_t c;
float v;
} mystruct_t;
int main() {
std::cout << sizeof(mystruct_t);
}

If your compiler is recent enough to give you the standard <stdint.h> header required by ISO C99 (or <cstdint> in recent C++), you'll better use it (and then, it would make your code portable, w.r.t. this particular issue, even on non Windows systems). So use types like int32_t or int64_t etc.
If serialization across several platforms is a concern, consider using portable binary formats like XDR, ASN1, perhaps using the s11n library, or textual formats like JSON or YAML

Windows to iPhone binary files

Is it safe to pass binary files from Windows to iPhone that are written like:
std::ostream stream = // get it somehow
stream.write(&MyHugePODStruct, sizeof(MyHugePODStruct));
and read like:
std::istream stream = // get it somehow
stream.read(&MyHugePODStruct, sizeof(MyHugePODStruct));
While the definition of MyHugePODStruct is the same? if not is there any way to serialize this with either standard library (c++11 included) or boost safely? is there more clean way to this, because it seems like a non portable piece of code?

No, for many reasons. First off, this won't compile, because you have to pass a char * to read and write. Secondly, this isn't guaranteed to work on even one single platform, because the structure may contain padding (but that itself may differ between different among differently compiled versions of the code, even on the same platform). Next, there are 64/32-bit issues to consider which affect many of the primitive types (e.g. long double is padded to 12 bytes on x86, but to 16 bytes on x64). Last but not least there's endianness (though I'm not sure what the iOS endianness is).
So in short, no, don't do that.
You have to serialize each struct member separately, and according to its data type.
You might like to check out Boost.serialization, though I have no experience with it.

Testing C++ code for endian-independence

How can I test or check C++ code for endian-independence? It's already implemented, I would just like to verify that it works on both little- and big-endian platforms.
I could write unit tests and run them on the target platforms, but I don't have the hardware. Perhaps emulators?
Are there compile time checks that can be done?

If you have access to an x86-based Mac then you can take advantage of the fact that Mac OS X has PowerPC emulation built in as well as developer tool support for both x86 (little endian) and PowerPC (big endian). This enables you to compile and run a big and little endian executable on the same platform, e.g.
$ gcc -arch i386 foo.c -o foo_x86 # build little endian x86 executable
$ gcc -arch ppc foo.c -o foo_ppc # build big endian PowerPC executable
Having built both big endian and little endian executables you can then run whatever unit tests you have available on both, which will catch some classes of endianness-related problems, and you can also compare any data generated by the executables (files, network packets, whatever) - this should obviously match.

You can set up an execution environment in the opposite endianness using qemu. For example if you have access to little-endian amd64 or i386 hardware, you can set up qemu to emulate a PowerPC Linux platform, run your code there.

I read a story that used Flint (Flexible Lint) to diagnose this kind of errors.
Don't know the specifics anymore, but let me google the story back for you:
http://www.datacenterworks.com/stories/flint.html
An Example: Diagnosing Endianness Errors
On a recent engagement, we were porting code from an old Sequent to a SPARC, and after the specific pointer issues we discussed in the Story of Thud and Blunder, we needed to look for other null pointer issues and also endian-ness errors.

I would suggest adapting a coding technique that avoids the problem all together.
First, you have to understand in which situation an endianess problem occurs. Then either find an endianess-agnostic way to write this, or isolate the code.
For example, a typical problem where endianess issues can occur is when you use memory accesses or unions to pick out parts of a larger value. Concretely, avoid:
long x;
...
char second_byte = *(((char *)&x) + 1);
Instead, write:
long x;
...
char second_byte = (char)(x >> 8)
Concatenation, this is one of my favorites, as many people tend to think that you can only do this using strange tricks. Don't do this:
union uu
{
long x;
unsigned short s[2];
};
union uu u;
u.s[0] = low;
u.s[1] = high;
long res = u.x;
Instead write:
long res = (((unsigned long)high) << 16) | low

I could write unit tests and run them on the target platforms, but I don't have the hardware.
You can setup your design so that unit tests are easy to run independent of actually having hardware. You can do this using dependency injection. I can abstract away things like hardware interfaces by providing a base interface class that the code I'm testing talks to.
class IHw
{
public:
virtual void SendMsg1(const char* msg, size_t size) = 0;
virtual void RcvMsg2(Msg2Callback* callback) = 0;
...
};
Then I can have the concrete implementation that actually talks to hardware:
class CHw : public IHw
{
public:
void SendMsg1(const char* msg, size_t size);
void RcvMsg2(Msg2Callback* callback);
};
And I can make a test stub version:
class CTestHw : public IHw
{
public:
void SendMsg1(const char* msg, size_t);
void RcvMsg2(Msg2Callback* callback);
};
Then my real code can us the concrete Hw, but I can simulate it in test code with CTestHw.
class CSomeClassThatUsesHw
{
public:
void MyCallback(const char* msg, size_t size)
{
// process msg 2
}
void DoSomethingToHw()
{
hw->SendMsg1();
hw->RcvMsg2(&MyCallback);
}
private:
IHw* hw;
}

IMO, the only answer that comes close to being correct is Martin's. There are no endianness concerns to address if you aren't communicating with other applications in binary or reading/writing binary files. What happens in a little endian machine stays in a little endian machine if all of the persistent data are in the form of a stream of characters (e.g. packets are ASCII, input files are ASCII, output files are ASCII).
I'm making this an answer rather than a comment to Martin's answer because I am proposing you consider doing something different from what Martin proposed. Given that the dominant machine architecture is little endian while network order is big endian, many advantages arise if you can avoid byte swapping altogether. The solution is to make your application able to deal with wrong-endian inputs. Make the communications protocol start with some kind of machine identity packet. With this info at hand, your program can know whether it has to byte swap subsequent incoming packets or leave them as-is. The same concept applies if the header of your binary files has some indicator that lets you determine the endianness of those files. With this kind of architecture at hand, your application(s) can write in native format and can know how to deal with inputs that are not in native format.
Well, almost. There are other problems with binary exchange / binary files. One such problem is floating point data. The IEEE floating point standard doesn't say anything about how floating point data are stored. It says nothing regarding byte order, nothing about whether the significand comes before or after the exponent, nothing about the storage bit order of the as-stored exponent and significand. This means you can have two different machines of the same endianness that both follow the IEEE standard and you can still have problems communicating floating point data as binary.
Another problem, not so prevalent today, is that endianness is not binary. There are other options than big and little. Fortunately, the days of computers that stored things in 2143 order (as opposed to 1234 or 4321 order) are pretty much behind us, unless you deal with embedded systems.
Bottom line:
If you are dealing with a near-homogenous set of computers, with only one or two oddballs (but not too odd), you might want to think of avoiding network order. If the domain has machines of multiple architectures, some of them very odd, you might have to resort to the lingua franca of network order. (But do beware that this lingua franca does not completely resolve the floating point problem.)

I personally use Travis to test my software hosted on github and it supports running on multiple architectures [1], including s390x which is big endian.
I just had to add this to my .travis.yml:
arch:
- amd64
- s390x # Big endian arch
It's probably not the only CI proposing this, but that's the one I was already using. I run both unit tests and integrated test on both systems which gives me some reasonable confidence that it works fine no matter the endianness.
It's no silver bullet though, I'd like to have an easy way to test it manually too just to ensure there's no hidden error (e.g I'm using SDL, colors could be wrong. I'm using screenshot to validate the output but the code for taking screenshot could have errors compensating the display problem, so the tests could pass with the display being wrong).
[1] https://blog.travis-ci.com/2019-11-12-multi-cpu-architecture-ibm-power-ibm-z

Is there a way to enforce specific endianness for a C or C++ struct?

I've seen a few questions and answers regarding to the endianness of structs, but they were about detecting the endianness of a system, or converting data between the two different endianness.
What I would like to now, however, if there is a way to enforce specific endianness of a given struct. Are there some good compiler directives or other simple solutions besides rewriting the whole thing out of a lot of macros manipulating on bitfields?
A general solution would be nice, but I would be happy with a specific gcc solution as well.
Edit:
Thank you for all the comments pointing out why it's not a good idea to enforce endianness, but in my case that's exactly what I need.
A large amount of data is generated by a specific processor (which will never ever change, it's an embedded system with a custom hardware), and it has to be read by a program (which I am working on) running on an unknown processor. Byte-wise evaluation of the data would be horribly troublesome because it consists of hundreds of different types of structs, which are huge, and deep: most of them have many layers of other huge structs inside.
Changing the software for the embedded processor is out of the question. The source is available, this is why I intend to use the structs from that system instead of starting from scratch and evaluating all the data byte-wise.
This is why I need to tell the compiler which endianness it should use, it doesn't matter how efficient or not will it be.
It does not have to be a real change in endianness. Even if it's just an interface, and physically everything is handled in the processors own endianness, it's perfectly acceptable to me.

The way I usually handle this is like so:
#include <arpa/inet.h> // for ntohs() etc.
#include <stdint.h>
class be_uint16_t {
public:
be_uint16_t() : be_val_(0) {
}
// Transparently cast from uint16_t
be_uint16_t(const uint16_t &val) : be_val_(htons(val)) {
}
// Transparently cast to uint16_t
operator uint16_t() const {
return ntohs(be_val_);
}
private:
uint16_t be_val_;
} __attribute__((packed));
Similarly for be_uint32_t.
Then you can define your struct like this:
struct be_fixed64_t {
be_uint32_t int_part;
be_uint32_t frac_part;
} __attribute__((packed));
The point is that the compiler will almost certainly lay out the fields in the order you write them, so all you are really worried about is big-endian integers. The be_uint16_t object is a class that knows how to convert itself transparently between big-endian and machine-endian as required. Like this:
be_uint16_t x = 12;
x = x + 1; // Yes, this actually works
write(fd, &x, sizeof(x)); // writes 13 to file in big-endian form
In fact, if you compile that snippet with any reasonably good C++ compiler, you should find it emits a big-endian "13" as a constant.
With these objects, the in-memory representation is big-endian. So you can create arrays of them, put them in structures, etc. But when you go to operate on them, they magically cast to machine-endian. This is typically a single instruction on x86, so it is very efficient. There are a few contexts where you have to cast by hand:
be_uint16_t x = 37;
printf("x == %u\n", (unsigned)x); // Fails to compile without the cast
...but for most code, you can just use them as if they were built-in types.

A bit late to the party but with current GCC (tested on 6.2.1 where it works and 4.9.2 where it's not implemented) there is finally a way to declare that a struct should be kept in X-endian byte order.
The following test program:
#include <stdio.h>
#include <stdint.h>
struct __attribute__((packed, scalar_storage_order("big-endian"))) mystruct {
uint16_t a;
uint32_t b;
uint64_t c;
};
int main(int argc, char** argv) {
struct mystruct bar = {.a = 0xaabb, .b = 0xff0000aa, .c = 0xabcdefaabbccddee};
FILE *f = fopen("out.bin", "wb");
size_t written = fwrite(&bar, sizeof(struct mystruct), 1, f);
fclose(f);
}
creates a file "out.bin" which you can inspect with a hex editor (e.g. hexdump -C out.bin). If the scalar_storage_order attribute is suppported it will contain the expected 0xaabbff0000aaabcdefaabbccddee in this order and without holes. Sadly this is of course very compiler specific.

No, I dont think so.
Endianness is the attribute of processor that indicates whether integers are represented from left to right or right to left it is not an attribute of the compiler.
The best you can do is write code which is independent of any byte order.

Try using
#pragma scalar_storage_order big-endian to store in big-endian-format
#pragma scalar_storage_order little-endian to store in little-endian
#pragma scalar_storage_order default to store it in your machines default endianness
Read more here

No, there's no such capability. If it existed that could cause compilers to have to generate excessive/inefficient code so C++ just doesn't support it.
The usual C++ way to deal with serialization (which I assume is what you're trying to solve) this is to let the struct remain in memory in the exact layout desired and do the serialization in such a way that endianness is preserved upon deserialization.

I am not sure if the following can be modified to suit your purposes, but where I work, we have found the following to be quite useful in many cases.
When endianness is important, we use two different data structures. One is done to represent how it expected to arrive. The other is how we want it to be represented in memory. Conversion routines are then developed to switch between the two.
The workflow operates thusly ...
Read the data into the raw structure.
Convert to the "raw structure" to the "in memory version"
Operate only on the "in memory version"
When done operating on it, convert the "in memory version" back to the "raw structure" and write it out.
We find this decoupling useful because (but not limited to) ...
All conversions are located in one place only.
Fewer headaches about memory alignment issues when working with the "in memory version".
It makes porting from one arch to another much easier (fewer endian issues).
Hopefully this decoupling can be useful to your application too.

A possible innovative solution would be to use a C interpreter like Ch and force the endian coding to big.

Boost provides endian buffers for this.
For example:
#include <boost/endian/buffers.hpp>
#include <boost/static_assert.hpp>
using namespace boost::endian;
struct header {
big_int32_buf_t file_code;
big_int32_buf_t file_length;
little_int32_buf_t version;
little_int32_buf_t shape_type;
};
BOOST_STATIC_ASSERT(sizeof(h) == 16U);

Maybe not a direct answer, but having a read through this question can hopefully answer some of your concerns.

You could make the structure a class with getters and setters for the data members. The getters and setters are implemented with something like:
int getSomeValue( void ) const {
#if defined( BIG_ENDIAN )
return _value;
#else
return convert_to_little_endian( _value );
#endif
}
void setSomeValue( int newValue) {
#if defined( BIG_ENDIAN )
_value = newValue;
#else
_value = convert_to_big_endian( newValue );
#endif
}
We do this sometimes when we read a structure in from a file - we read it into a struct and use this on both big-endian and little-endian machines to access the data properly.

There is a data representation for this called XDR. Have a look at it.
http://en.wikipedia.org/wiki/External_Data_Representation
Though it might be a little too much for your Embedded System. Try searching for an already implemented library that you can use (check license restrictions!).
XDR is generally used in Network systems, since they need a way to move data in an Endianness independent way. Though nothing says that it cannot be used outside of networks.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js