Binary files and cross platform compatibility

Binary files and cross platform compatibility - c++

I have written a C++ library that saves my data (a collection of custom structs etc) into a binary file. I currently use (i.e. create and consume) the files locally, on my Windows (XP) machine. For simplicity, lets think of the library in two parts: a writer (Creates the files) and a reader or consumer (simply reads data from the files).
Recently though, I would like to also consume (i.e. read) the data files I have created on my XP machine, on my Linux machine. I must point out at this stage that both machines are PCs (so have the same endianess etc).
I can build a reader (and compile for Linux [Ubuntu 9.10 to be precise]), since I am the library creator. My question, before I embark down this road (of building the reader etc) is:
Assuming I have succesfully built the reader for Linux,
Can I simply copy accross, files that were created on the windows (XP) machine to the Linux (Ubuntu 9.10) machine and use the Linux reader to successfully read the copied over file?

For the files to be binary compatible:
endianness must match (as it does for you)
bitfield packing order must be the same
sizes and signedness of types must be the same
the compiler must make the same decisions about padding and alignment
It's certainly possible for all of these conditions to be fulfilled, or for you to not happen to be hitting any cases for which they are not. At the very least, though, I'd add some sanity checks and/or sentinel members to detect problems.

Binary files should be compatible across machines with the same endianess.
The issue you may have in your code is the size of ints, you can't necessarily assume that the compiler on different OS's has the same size int. So either copy blocks of bytes and cast them, or use int16, int32 etc.

Structs are not a file format, and you shouldn't try to use them as such.
When attempting to make structs work with fread and fwrite, there's a huge number of hacks to make it work. You byte-swap integers so that you can share files between little-endian and big-endian machines. You change your structs to use fixed-width integer types, so you can share between machines with different word sizes (such as between x86 and x64 machines). You add compiler-specific pragmas to control the padding of structs to share between compiler versions.
It works, but it's ugly. Not to mention, easy to get wrong.
Much like the recommendation in The byte order fallacy, a much better idea is to write code to read/write the fields individually. By writing your own code, you can ensure there's no padding, and you can choose integer sizes independently of the local size of integers, and you can support both endiannesses without byte-swapping (by reading/writing the bytes of an integer separately).
Unlike the hacky approach, this is hard to get wrong. Further, because you don't rely on any compiler or architecture specific behaviors, either your code will work on all compilers and architectures, or none. If you do it right, you shouldn't have any platform-specific bugs.
There is one downside; individually reading/writing the fields will be slower than just using fread/fwrite directly. You can set up a buffer (uint8_t buffer[]) and write the entirety of the data into it, and then write everything out at once, which might help, but it'll still be slower (because you'd still have to move the fields into the buffer one at a time), but for most purposes it'll still be fast enough (exceptions being embedded / real-time systems or extremely high performance computing).

If:
the machines have the same endianess (as you stated they have) and
you do open the streams in binary mode, as text mode might do funny things e.g. with line-ends and
you have programmed cleanly so you don't stumble over implementation-defined stuff like alignments, data type sizes, and struct packing,
then yes, your files should be portable.
The third bullet point is what makes a file format a "portable" one. Depending on what kind of data you have in your structs, it can be very easy or a bit tricky. Bitfields, or data being reinterpreted from a different type are especially tricky.

You might consider taking a look at the Boost Serialization Library.
A lot of thought has been put into it, and it will handle many of the potential cross-platform incompatibilities for you.
Of course, it's possible that it's overkill for your particular use case, especially if you've already got your writers & readers implemented.

Related

Are the C++ Standard Library File stream operations crippled in Microsoft?

I'm asking this question because I have been working on a project that requires collecting a lot of data REALLY fast, depending on the scenario. 5.7GBytes with a capital BYTE per second or 11.4GBytes per second.
We are working with a small striped raid array using 3 Samsung Pro NVME (for 11.4GB/s we have a larger array).
Currently, the project has been developed on Windows, I wanted to make things as portable as possible so I focused on using C++ Standard Library; however, no matter what I did I could not crack transferring files faster than 1.5GB/s
The strategy was simple to create a couple of huge swap buffers, and write them directly to disk as a huge unformatted binary file.
Using std::ofstream
and benchmarking manually setting varied buffer sizes through:
rdbuf()->pubsetbuf(buffer, BUFFER_SIZE);
open(Filename, std::ios::binary|std::ios::trunc);
followed by my managed write loop, I was able to find a sweet spot, but never able to crack 1.5GB/s
I then found the Windows SDK and its CreateFile function
In particular, the create file function using the FILE_FLAG_NO_BUFFERING flag.
This was a game-changer, as long as I made sure I fed it sector-aligned data (in my case everything needed to be some multiple of 512Bytes) I was suddenly able to take full advantage of the raid array throughput.
I revisited the std::ofstream function in an attempt to work with more OS-agnostic functions; however, even though one can specify zero buffer for std::ofstream, there doesn't appear to be any documentation with regards to any caveats to using that function with no buffer.
std::ofstream allows 64bit values for its write size, unlike Windows SDK WriteFile which only accepts DWORD's setting the maximum write size is the largest multiple of 512 one can squeeze into a uint32_t and you must manage your write in a loop if your file exceeds 4GB (mine do).
This just raises the question, is Microsoft simply not giving the C++ Standard Library Devs access to the necessary OS-level system calls to take advantage of Ultra-high-speed drive arrays? Or am I missing something in how to use the C++ Standard Library to its full potential?

"is Microsoft simply not giving the C++ Standard Library Devs..."
You might notice that the product you're using is called Microsoft Visual Studio. The Standard Library developers for Visual Studio work at Microsoft, although in a different team as the Windows developers.
The reason is a bit more simple: the Visual C++ devs can't possibly know and optimize for all possible use scenario's. It's a bit unusual to do text formatting at such high speeds. Remember, the point of ostream is to provide operator<<. ofstream is for formatted output to files. But for high-speed I/O you want binary output anyway.

To put it bluntly, the bandwidth you're aiming for are within the ballpark of the physical limits of current commodity hardware (~24GByte/s for 16×PCIe.4), and in my own work I found it very challenging to reach single-core memory transfer rates above 8GByte/s without the use of "dark magic" (aka hand crafted assembly and optimized system call code), and it involved carefully aligning the memory accesses and making use of vector extensions. But most importantly, to reach these levels of optimization requires to be aware of the kind of data that is being processed and what kind of access patters to expect and/or build caching intermediaries to accomodate for the underlying hardware.
Such optimizations are plain and simply outside of the scope of general purpose standard libraries. Standard libraries in their implementation must adhere to the behaviours written down in the specification, and some of these requirements tend to collide with what has to be done to make the most of the underlying hardware.
So I'm sorry to tell you, but you'll probably have to bite the bullet and use the low level system APIs directly, bypassing the standard library.

Converting endianness of struct-Data

What i have is:
a hex file with the bytes of a c-struct in it, orderd in big-endian
the struct definition as *.h file
the struct information as dwarf2 debug info
My application has to be written in C / C++. Intermediate scripts using for example python would be ok.
What i have to do is read the bytes of the hex-file and cast it into the struct type on a system that is little-endian.
And during this process, i will have to reverse the bytes of each struct member.
The obvious solution would be to write a conversion function, that does byteswapping for each struct-member, but since the struct has multiple layers and ~1200 members that are changing faster than i can update my conversion function, writing that by hand is no solution.
So i could generate the conversion function automatically by:
Finding and parsing the types inside multiple *.h files
Iterating members of all struct-types and generate swaps for them -> without some sort of reflection api not that easy)
loading the struct via the conversion function.
Since this solution seems like quite a bit of work, i was wandering if there is easier way like telling the compiler to swap it or use debug-info somehow.
Does anybody know a trick that might help in this case?
Thanks and greetings!
Remark:
Changing any of the processes leading to this / changing the input-conditions or delegating responsibilities to other developers involved is not pssible.
Changing something about the hex-file as an input is not possible. This file comes out of some other system that will not change to fix this problem here.
Padding, Datatype-sizes etc. are identical. This is ensured by other measures, too. So endianess is defenetly the only problem. This is also why i see no reason against using dwarf2 info to identify the bytes of every struct member.
I agree that the layout of the struct is very bad. But It has some reasons why it is that way and to be short, i can/am not allowed not change that anyway because of process-reasons and backwards compatibility.
To give some more scope:
The Software that all of this is used in is deployed to multiple different embedded devices (multiple types). The hex-file containes the calibration information of the software and is thus stored in a specific system that can only output this hex-file.
I am now porting the software to a little-endian device and i have to use the hex-file given from the "main" branch of software, which is big-endian, as an input.

There is no way to tell C or C++ compiler to swap bytes from LE to BE or vice versa automatically. You really have to do it yourself. If your data structs are really huge, probably the best way is to implement automatic conversion code generation.

The problem, as far as I understand it, is tricky but tractable. As far as I understand, data extraction won't be running on an embedded device, so it won't be resource constrained. I say - embrace the runtime inefficiency that desktop hardware allows, and go for easy to debug instead.
Instead of thinking of the source file as "almost what I need modulo a couple of minor adjustments", think of it as "generic binary file with an open ended, evolving schema". The schema description is the DWARF data.
What I would do: start a Python project. Use the pyelftools PyPI module to parse the DWARF. Scroll for the compile units (CUs). In each CU, scroll through the top level entries (DIEs). Look for a DW_TAG_structure_type DIE with a specific value of DW_AT_name (I hope the struct name is known in advance). Then go through the DW_TAG_member sub-DIEs. DW_AT_data_member_location will give you the offset, letting you work around the padding. Look at DW_AT_type to detect the member type (you'd have to resolve the DIE reference for that). Recurse into struct- and array-type members as necessary.
From that, generate a format string for the struct.unpack method - it can read big-endian ints seamlessly. Then use struct.pack to format it into whatever format the C++ consumer expects.
This depends on you being able to track the data file to the DWARF info of the generating executable, exactly the same build. I hope the processes of the organization allow for that.

Recent versions of GCC allow the declaration of the desired endianness irrespective of the target platform for a source code section using the pragma scalar_storage_order or a specific type using an attribute with the same identifier. The main catch: g++ does not support this. Also, this won't work in all cases. For example, taking a pointer to a member with transparent endianness conversion leads to an error. Unless you're okay with sticking to C for struct access (it all depends on your current codebase), this is not an option.
The persistence layout is based on the original struct layout - so be it. However, a more explicit approach of serializing the structs should be preferred for exactly the reason you bring this up. Besides the endianness issue, struct packing also affects compatibility and should be explicitly specified. For persistence, a packing of 1 would be optimal. For in-memory data structures, that alignment is far from optimal in terms of performance and concurrency characteristics. Also, different platforms might have incompatible data types (e.g. sizeof(long) on 64-bit Linux/Windows - LP64 vs. LLP64). So, keeping the persistence layout separate from in-memory data structures tends to have a long list of advantages and therefore usually outweights the disadvantage of having to maintain the serialization code separately. Particularly, if portability is a major concern.
You could take advantage of C/C++-based reflection libraries or implement one yourself. In case of C, this will definitely require macros (e.g. Metaresc). In case of C++, you might actually get away your original struct definitions (e.g. Boost.Precise and Flat Reflection).
If reflection is not an option, you could generate the serialization code either by parsing the headers or debug symbols. Generally, parsing C/C++ is more complex. By moving the structs involved into dedicated headers, you might get away with a simple C/C++ parser. To make things easier, you could simplify parsing by processing the gdb output of ptype based on debug symbols. Or, you could parse debug symbols directly. With a scripting language like Python, both approaches should be feasible (pygccxml and pyelftools come to mind).
Rather than sticking to generating the serialization code as part of the build process, you could generate that code once and require updates whenever the structs change in the future. That's what I would do in a multi-platform scenario. Doing that would also spare you the pain of implementing a perfect parser that can deal with all kinds of C/C++ input, it would only have to be good enough for one-time generation.

Creating and using a cross platform struct in C++

I am writing a cross platform game with networking capabilities (using SFML and RakNet) and I have come to the point where I have compiled the server on my Ubuntu server and got a client going on my Mac. All the development is done on my Mac so I have initially been testing the server on that, and it has worked fine.
I am sending structs over the network and then simply casting them back from char * to (for example) inet::PlayerAdded. Now this has been working fine (for the most part), but my question is: Will this always work? It seems like a very fragile approach. Will the struct's always be laid out the same even on other platforms, Windows, for example? What would you recommend?
#pragma pack(push, 1)
struct Player
{
int dir[2];
int left;
float depth;
float elevation;
float velocity[2];
char character[50];
char username[50];
};
// I have been added to the game and my ID is back
struct PlayerAdded: Packet
{
id_type id;
Player player;
};
#pragma pack(pop)

This won't work if (among other things) you attempt to do it from little-endian machine to big-endian machine as the correct int representation will be reversed between the two.
This could also fail if the alignment or packing of your structure changes from machine to machine. What if you have some 64-bit machines and some that are 32-bit?
You need to use a proper portable serialization library like Boost.Serialization or Google Protocol Buffers to ensure you have a wire protocol (aka transmissible data format) that can be decoded successfully independent of the hardware.
Once nice thing about Protocol Buffers is that you can compress the data transparently using a ZLIB-compatible stream that is also compatible with the protobuf streams. I have actually done this, it works well. I imagine other decorator streams can be used in an analogous way to enhance or optimize your basic wire protocol as needed.

Like many of the other answers I'd advise against sending raw binary data if it can be avoided. Something like Boost serial or Google Protobuf will do a fine job without too much overhead.
But you certainly can create cross-platform binary structures, it is done all the time and is a very valid way of exchanging data. Layering a "struct" over that data just makes sense. You do however have to be very careful of layout, fortunately most compilers give you many options to do that. "pack" is one such such option and takes care of a lot.
You also need to take care about data sizes. Simple include stdint.h and use the fixed size types like uint32_t. Be careful of floating point values as not all architectures will share the same value, for 32-bit float they likely do. Also for endianess most architectures will use the same, and if they don't you can simply flip it on the client which is different.

The answer to "... laid out the same even on other platforms ..." is generally no. This is so even if such issues as different CPUs and/or different endianness are addressed.
Different operating systems (even on the same hardware platform) might use different data representations; this is normally called the "platform ABI" and it's different between e.g. 32bit/64bit Windows, 32bit/64bit Linux, MacOSX.
'#pragma pack' is only half the way, because beyond alignment restrictions there can be data type size differences. For example, "long" on 64bit Windows is 32bit while it's 64bit on Linux and MacOSX.
That said, the problem obviously isn't new and has been addressed in the past already - the remote procedure call standard (RPC) contains mechanisms for how to define data structures in a platform-independent way and how to encode/decode "buffers" representing these structs. It's called "XDR" (eXternal Data Representation). See RFC1832.
As programming goes, this wheel has been reinvented several times; whether you convert to XML, do the low level work yourself with XDR, use google::protobuf, boost or Qt::Variant, there's lot to choose from.
On a purely implementation side: For simpliclty, just assume that "unsigned int" everywhere is 32bit aligned at a 32bit boundary; if you can encode all your data as array of 32bit values then the only externalization issue you have to deal with is endianness.

Potential problems porting to different architectures

I'm writing a Linux program that currently compiles and works fine on x86 and x86_64, and now I'm wondering if there's anything special I'll need to do to make it work on other architectures.
What I've heard is that for cross platform code I should:
Don't assume anything about the size of a pointer, int or size_t
Don't make assumptions about byte order (I don't do any bit shifting -- I assume gcc will optimize my power of two multiplication/division for me)
Don't use assembly blocks (obvious)
Make sure your libraries work (I'm using SQLite, libcurl and Boost, which all seem pretty cross-platform)
Is there anything else I need to worry about? I'm not currently targeting any other architectures, but I expect to support ARM at some point, and I figure I might as well make it work on any architecture if I can.
Also, regarding my second point about byte order, do I need to do anything special with text input? I read files with getline(), so it seems like that should be done automatically as well.

In my experience, once code works well on a couple architectures, it will port more easily to a third one. Input shouldn't be an issue. Structure alignment may be an issue if you do anything where alignment is an issue.
Pay attention to anything that might be platform-dependent: relying on bitfields being aligned the same way, assuming variables are a particular size, etc. If your code is relatively abstract from the hardware, you will likely encounter few problems. If you are doing something with something like networking code, you will have to make sure you align with network byte order properly.
I have ported device drivers from PPC to x86, and then to x86_64; in a few thousand lines, there were maybe a couple changes, primarily related to structure and integer ordering.
The only way to know for sure is to try it, of course.

Cross platform programming question (file I/O)

I have a C++ class that looks a bit like this:
class BinaryStream : private std::iostream
{
public:
explicit BinaryStream(const std::string& file_name);
bool read();
bool write();
private:
Header m_hdr;
std::vector<Row> m_rows;
}
This class reads and writes data in a binary format, to disk. I am not using any platform specific coding - relying instead on the STL. I have succesfully compiled on XP. I am wondering if I can FTP the files written on the XP platform and read them on my Linux machine (once I recompile the binary stream library on Linux).
Summary:
Files created on Xp machine using a cross platform library coompiled for XP.
Compile the same library (used in 1 above) on a Linux machine
Question: Can files created in 1 above, be read on a Linux machine (2) ?
If no, please explain why not, and how I may get around this issue.

Derive from std::basic_streambuf. That's what they are there for. Note, most STL classes are not designed to be derived from. The one I mention is an exception.

This depends entirely on the specifics of the binary encoding. One thing that's different about Linux vs. XP is that you're much more likely to find yourself on a big-endian platform, and if your binary encoding is endian specific you'll end up with issues.
You may also end up with issues relating to the end-of-line character. There isn't enough information here about how you're using ::std::iostream to give you a good answer to this question.
I would strongly suggest looking at the protobuf library. It is an excellent library for creating fast cross-platform binary encodings.

If you want that your code is portable across machines with different endianess, you need to stick to using one endianess in your files. Whenever you read or write files, you do conversions between the host byte order, and the file byte order. It's common to use what you call network byte order when you want to write files that are portable across all machines. Network byte order is defined to be big endian, and there are pre-made functions made to deal with those conversions (although they are very easy to write yourself).
For example, before writing a long to a file, you should convert it to network byte order using htonl(), and when reading from a file you should convert it back to host byte order with ntohl(). On big-endian system htonl() and ntohl() simply return the same number as passed to the function, but on little-endian system it swaps each byte in the variable.
If you don't care about supporting big-endian systems, none of this is an issue though, although it's still good practice.
Another important thing to pay attention to is padding of your structs/classes that you write, if you write them directly to the file (eg. Header and Row). Different compilers on different platforms can use different padding, which means that variables are aligned differently in the memory. This can break things big-time, if the compilers you use on different platform use different padding. So for structs that you intend to write directly to files/other streams, you should always specify padding. You should tell the compiler to pack your structs like this:
#pragma pack(push, 1)
struct Header {
// This struct uses 1-byte padding
...
};
#pragma pack(pop)
Remember that doing this will make using the struct more inefficient when you use it in your application, because access to unaligned memory addresses means more work for the system. This is why it's generally a good idea to have separate types for the packed structs that you write to streams, and a type that you actually use in the application (you just copy the members from one to other).
EDIT. Another way to deal with the issue, of course, is to serialize those structs yourself, which won't require using #pragma (pragmas are compiler-dependent feature, although all major compilers to my knowledge supports the pragma pack).

Here is an article Endianness that is related to your question. Look for "Endianness in files and byte swap". Briefly if If your Linux machine has the same endianes than it's OK, if not - there migth be problems.
For example when integer 1 is written in file on XP it looks like this: 10 00
But when integer 1 is written in file on machine with the other endianess it will look like this: 00 01
But if you use only one byte characters there must be no problem.

As long as it's plain binary files it should work

Because you're using the STL for everything, there's no reason your program shouldn't be able to read the files on a different platform.

If you are writing a struct / class directly out to the disc, then don't.
This might not be compatible between different builds on the same compiler, and almost certainly will break when you move to a different platform or compiler. It will definitely break if you change to a different architecture.
It isn't clear from the above code what you're actually writing to the file.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js