Cross platform programming question (file I/O)

Cross platform programming question (file I/O) - c++

I have a C++ class that looks a bit like this:
class BinaryStream : private std::iostream
{
public:
explicit BinaryStream(const std::string& file_name);
bool read();
bool write();
private:
Header m_hdr;
std::vector<Row> m_rows;
}
This class reads and writes data in a binary format, to disk. I am not using any platform specific coding - relying instead on the STL. I have succesfully compiled on XP. I am wondering if I can FTP the files written on the XP platform and read them on my Linux machine (once I recompile the binary stream library on Linux).
Summary:
Files created on Xp machine using a cross platform library coompiled for XP.
Compile the same library (used in 1 above) on a Linux machine
Question: Can files created in 1 above, be read on a Linux machine (2) ?
If no, please explain why not, and how I may get around this issue.

Derive from std::basic_streambuf. That's what they are there for. Note, most STL classes are not designed to be derived from. The one I mention is an exception.

This depends entirely on the specifics of the binary encoding. One thing that's different about Linux vs. XP is that you're much more likely to find yourself on a big-endian platform, and if your binary encoding is endian specific you'll end up with issues.
You may also end up with issues relating to the end-of-line character. There isn't enough information here about how you're using ::std::iostream to give you a good answer to this question.
I would strongly suggest looking at the protobuf library. It is an excellent library for creating fast cross-platform binary encodings.

If you want that your code is portable across machines with different endianess, you need to stick to using one endianess in your files. Whenever you read or write files, you do conversions between the host byte order, and the file byte order. It's common to use what you call network byte order when you want to write files that are portable across all machines. Network byte order is defined to be big endian, and there are pre-made functions made to deal with those conversions (although they are very easy to write yourself).
For example, before writing a long to a file, you should convert it to network byte order using htonl(), and when reading from a file you should convert it back to host byte order with ntohl(). On big-endian system htonl() and ntohl() simply return the same number as passed to the function, but on little-endian system it swaps each byte in the variable.
If you don't care about supporting big-endian systems, none of this is an issue though, although it's still good practice.
Another important thing to pay attention to is padding of your structs/classes that you write, if you write them directly to the file (eg. Header and Row). Different compilers on different platforms can use different padding, which means that variables are aligned differently in the memory. This can break things big-time, if the compilers you use on different platform use different padding. So for structs that you intend to write directly to files/other streams, you should always specify padding. You should tell the compiler to pack your structs like this:
#pragma pack(push, 1)
struct Header {
// This struct uses 1-byte padding
...
};
#pragma pack(pop)
Remember that doing this will make using the struct more inefficient when you use it in your application, because access to unaligned memory addresses means more work for the system. This is why it's generally a good idea to have separate types for the packed structs that you write to streams, and a type that you actually use in the application (you just copy the members from one to other).
EDIT. Another way to deal with the issue, of course, is to serialize those structs yourself, which won't require using #pragma (pragmas are compiler-dependent feature, although all major compilers to my knowledge supports the pragma pack).

Here is an article Endianness that is related to your question. Look for "Endianness in files and byte swap". Briefly if If your Linux machine has the same endianes than it's OK, if not - there migth be problems.
For example when integer 1 is written in file on XP it looks like this: 10 00
But when integer 1 is written in file on machine with the other endianess it will look like this: 00 01
But if you use only one byte characters there must be no problem.

As long as it's plain binary files it should work

Because you're using the STL for everything, there's no reason your program shouldn't be able to read the files on a different platform.

If you are writing a struct / class directly out to the disc, then don't.
This might not be compatible between different builds on the same compiler, and almost certainly will break when you move to a different platform or compiler. It will definitely break if you change to a different architecture.
It isn't clear from the above code what you're actually writing to the file.

Related

Converting endianness of struct-Data

What i have is:
a hex file with the bytes of a c-struct in it, orderd in big-endian
the struct definition as *.h file
the struct information as dwarf2 debug info
My application has to be written in C / C++. Intermediate scripts using for example python would be ok.
What i have to do is read the bytes of the hex-file and cast it into the struct type on a system that is little-endian.
And during this process, i will have to reverse the bytes of each struct member.
The obvious solution would be to write a conversion function, that does byteswapping for each struct-member, but since the struct has multiple layers and ~1200 members that are changing faster than i can update my conversion function, writing that by hand is no solution.
So i could generate the conversion function automatically by:
Finding and parsing the types inside multiple *.h files
Iterating members of all struct-types and generate swaps for them -> without some sort of reflection api not that easy)
loading the struct via the conversion function.
Since this solution seems like quite a bit of work, i was wandering if there is easier way like telling the compiler to swap it or use debug-info somehow.
Does anybody know a trick that might help in this case?
Thanks and greetings!
Remark:
Changing any of the processes leading to this / changing the input-conditions or delegating responsibilities to other developers involved is not pssible.
Changing something about the hex-file as an input is not possible. This file comes out of some other system that will not change to fix this problem here.
Padding, Datatype-sizes etc. are identical. This is ensured by other measures, too. So endianess is defenetly the only problem. This is also why i see no reason against using dwarf2 info to identify the bytes of every struct member.
I agree that the layout of the struct is very bad. But It has some reasons why it is that way and to be short, i can/am not allowed not change that anyway because of process-reasons and backwards compatibility.
To give some more scope:
The Software that all of this is used in is deployed to multiple different embedded devices (multiple types). The hex-file containes the calibration information of the software and is thus stored in a specific system that can only output this hex-file.
I am now porting the software to a little-endian device and i have to use the hex-file given from the "main" branch of software, which is big-endian, as an input.

There is no way to tell C or C++ compiler to swap bytes from LE to BE or vice versa automatically. You really have to do it yourself. If your data structs are really huge, probably the best way is to implement automatic conversion code generation.

The problem, as far as I understand it, is tricky but tractable. As far as I understand, data extraction won't be running on an embedded device, so it won't be resource constrained. I say - embrace the runtime inefficiency that desktop hardware allows, and go for easy to debug instead.
Instead of thinking of the source file as "almost what I need modulo a couple of minor adjustments", think of it as "generic binary file with an open ended, evolving schema". The schema description is the DWARF data.
What I would do: start a Python project. Use the pyelftools PyPI module to parse the DWARF. Scroll for the compile units (CUs). In each CU, scroll through the top level entries (DIEs). Look for a DW_TAG_structure_type DIE with a specific value of DW_AT_name (I hope the struct name is known in advance). Then go through the DW_TAG_member sub-DIEs. DW_AT_data_member_location will give you the offset, letting you work around the padding. Look at DW_AT_type to detect the member type (you'd have to resolve the DIE reference for that). Recurse into struct- and array-type members as necessary.
From that, generate a format string for the struct.unpack method - it can read big-endian ints seamlessly. Then use struct.pack to format it into whatever format the C++ consumer expects.
This depends on you being able to track the data file to the DWARF info of the generating executable, exactly the same build. I hope the processes of the organization allow for that.

Recent versions of GCC allow the declaration of the desired endianness irrespective of the target platform for a source code section using the pragma scalar_storage_order or a specific type using an attribute with the same identifier. The main catch: g++ does not support this. Also, this won't work in all cases. For example, taking a pointer to a member with transparent endianness conversion leads to an error. Unless you're okay with sticking to C for struct access (it all depends on your current codebase), this is not an option.
The persistence layout is based on the original struct layout - so be it. However, a more explicit approach of serializing the structs should be preferred for exactly the reason you bring this up. Besides the endianness issue, struct packing also affects compatibility and should be explicitly specified. For persistence, a packing of 1 would be optimal. For in-memory data structures, that alignment is far from optimal in terms of performance and concurrency characteristics. Also, different platforms might have incompatible data types (e.g. sizeof(long) on 64-bit Linux/Windows - LP64 vs. LLP64). So, keeping the persistence layout separate from in-memory data structures tends to have a long list of advantages and therefore usually outweights the disadvantage of having to maintain the serialization code separately. Particularly, if portability is a major concern.
You could take advantage of C/C++-based reflection libraries or implement one yourself. In case of C, this will definitely require macros (e.g. Metaresc). In case of C++, you might actually get away your original struct definitions (e.g. Boost.Precise and Flat Reflection).
If reflection is not an option, you could generate the serialization code either by parsing the headers or debug symbols. Generally, parsing C/C++ is more complex. By moving the structs involved into dedicated headers, you might get away with a simple C/C++ parser. To make things easier, you could simplify parsing by processing the gdb output of ptype based on debug symbols. Or, you could parse debug symbols directly. With a scripting language like Python, both approaches should be feasible (pygccxml and pyelftools come to mind).
Rather than sticking to generating the serialization code as part of the build process, you could generate that code once and require updates whenever the structs change in the future. That's what I would do in a multi-platform scenario. Doing that would also spare you the pain of implementing a perfect parser that can deal with all kinds of C/C++ input, it would only have to be good enough for one-time generation.

Big Endian and Little Endian support for byte ordering

We need to support 3 hardware platforms - Windows (little Endian) and Linux Embedded (big and little Endian). Our data stream is dependent on the machine it uses and the data needs to be broken into bit fields.
I would like to write a single macro (if possible) to abstract away the detail. On Linux I can use bswap_16/bswap_32/bswap_64 for Little Endian conversions.
However, I can't find this in my Visual C++ includes.
Is there a generic built-in for both platforms (Windows and Linux)?
If not, then what can I use in Visual C++ to do byte swapping (other than writing it myself - hoping some machine optimized built-in)?
Thanks.

On both platforms you have
for short (16bit): htons() and ntohs()
for long (32bit): htonl() and ntohl()
The missing htonll() and ntohll() for long long (64bit) could easily be build from those two. See this implementation for example.
Update-0:
For the example linked above Simon Richter mentions in a comment, that it not necessarily has to work. The reason for this is: The compiler might introduce extra bytes somewhere in the unions used. To work around this the unions need to be packed. The latter might lead to performance loss.
So here's another fail-safe approach to build the *ll functions: https://stackoverflow.com/a/955980/694576
Update-0.1:
From bames53' s comment I tend to conclude the 1st example linked above shall not be used with C++, but with C only.
Update-1:
To achieve the functionality of the *ll functions on Linux this approach might be the ' best'.

htons and htonl (and similar macros) are good if you insist on dealing with byte sex.
However, it's much better to sidestep the issue by outputting your data in ASCII or similar. It takes a little more room, and it transmits over the net a little more slowly, but the simplicity and futureproofing is worth it.
Another option is to numerically take apart your int's and short's. So you & 0xff and divide by 256 repeatedly. This gives a single format on all architectures. But ASCII's still got the edge because it's easier to debug with.

Not the same names, but the same functionality does exist.
EDIT: Archived Link -> https://web.archive.org/web/20151207075029/http://msdn.microsoft.com/en-us/library/a3140177(v=vs.80).aspx
_byteswap_uint64, _byteswap_ulong, _byteswap_ushort

Creating and using a cross platform struct in C++

I am writing a cross platform game with networking capabilities (using SFML and RakNet) and I have come to the point where I have compiled the server on my Ubuntu server and got a client going on my Mac. All the development is done on my Mac so I have initially been testing the server on that, and it has worked fine.
I am sending structs over the network and then simply casting them back from char * to (for example) inet::PlayerAdded. Now this has been working fine (for the most part), but my question is: Will this always work? It seems like a very fragile approach. Will the struct's always be laid out the same even on other platforms, Windows, for example? What would you recommend?
#pragma pack(push, 1)
struct Player
{
int dir[2];
int left;
float depth;
float elevation;
float velocity[2];
char character[50];
char username[50];
};
// I have been added to the game and my ID is back
struct PlayerAdded: Packet
{
id_type id;
Player player;
};
#pragma pack(pop)

This won't work if (among other things) you attempt to do it from little-endian machine to big-endian machine as the correct int representation will be reversed between the two.
This could also fail if the alignment or packing of your structure changes from machine to machine. What if you have some 64-bit machines and some that are 32-bit?
You need to use a proper portable serialization library like Boost.Serialization or Google Protocol Buffers to ensure you have a wire protocol (aka transmissible data format) that can be decoded successfully independent of the hardware.
Once nice thing about Protocol Buffers is that you can compress the data transparently using a ZLIB-compatible stream that is also compatible with the protobuf streams. I have actually done this, it works well. I imagine other decorator streams can be used in an analogous way to enhance or optimize your basic wire protocol as needed.

Like many of the other answers I'd advise against sending raw binary data if it can be avoided. Something like Boost serial or Google Protobuf will do a fine job without too much overhead.
But you certainly can create cross-platform binary structures, it is done all the time and is a very valid way of exchanging data. Layering a "struct" over that data just makes sense. You do however have to be very careful of layout, fortunately most compilers give you many options to do that. "pack" is one such such option and takes care of a lot.
You also need to take care about data sizes. Simple include stdint.h and use the fixed size types like uint32_t. Be careful of floating point values as not all architectures will share the same value, for 32-bit float they likely do. Also for endianess most architectures will use the same, and if they don't you can simply flip it on the client which is different.

The answer to "... laid out the same even on other platforms ..." is generally no. This is so even if such issues as different CPUs and/or different endianness are addressed.
Different operating systems (even on the same hardware platform) might use different data representations; this is normally called the "platform ABI" and it's different between e.g. 32bit/64bit Windows, 32bit/64bit Linux, MacOSX.
'#pragma pack' is only half the way, because beyond alignment restrictions there can be data type size differences. For example, "long" on 64bit Windows is 32bit while it's 64bit on Linux and MacOSX.
That said, the problem obviously isn't new and has been addressed in the past already - the remote procedure call standard (RPC) contains mechanisms for how to define data structures in a platform-independent way and how to encode/decode "buffers" representing these structs. It's called "XDR" (eXternal Data Representation). See RFC1832.
As programming goes, this wheel has been reinvented several times; whether you convert to XML, do the low level work yourself with XDR, use google::protobuf, boost or Qt::Variant, there's lot to choose from.
On a purely implementation side: For simpliclty, just assume that "unsigned int" everywhere is 32bit aligned at a 32bit boundary; if you can encode all your data as array of 32bit values then the only externalization issue you have to deal with is endianness.

C/C++ reliability of memcpy sizeof(typname) cross application over TCP

I had been doing some file IO in a project i am currently working on and so far I have been reading in a whole block of data using the following fast and convenient method:
struct Header { ... };
class Data { ... };
// note that I have not used compiler directives to pack/align/order bytes
// partly because I don't know how to.
Header _header;
Data _data;
std::ifstream fin(filename);
fin.read((char*)&_header, sizeof(Header));
fin.read((char*)&_data, sizeof(Data));
fin.close();
My question is whether it is ok to assume the bytes are aligned and order in the same way for every compiler and every different computer?
For example, if I take the Header struct and compile a client program, on linux, and a server program on windows. Are the bytes in the same order such that there will be no issues receiving and sending both ways?

No, that's not guaranteed at all. There is a specific network byte order and as far as I know, both WinAPI and POSIX provide local to network translation functions. In addition, the alignment you can control with compiler directives. But you have to explicitly take care of both of these things.

This is a solved problem, avoid re-inventing that wheel. XML is the lingua franca of machines talking to each other over the internet. It's chatty, if bandwidth is a concern then look for Google's protobuf. It has many language bindings, C++ is well supported.

Well, in the general case of course it isn't. Some platforms have different byte orderings (endianness) than others.
Also, on 64-bit platforms some common integer types (like size_t) are liable to be different sizes than you expect. All C guarantees is that sizeof(long)>=sizeof(short)>=sizeof(char), so there could even be a perverse platform out there where long, short, and char are all the same size.
Realistically, if both your platforms are Intel and both your OS's are 32-bit (or both are 64-bit) you are probably OK. Still it would be better to look into your compiler's directives for nailing alignment and ordering down a bit better. Sadly, C and C++ are not (yet) nearly as good at doing this as Ada. You could use a standard encoding like ASN.1 to fix this problem, but ASN.1 is a cure that is almost always worse than the disease.

You are using sizeof operator. So it should be safe, at least source level compatible.

Binary files and cross platform compatibility

I have written a C++ library that saves my data (a collection of custom structs etc) into a binary file. I currently use (i.e. create and consume) the files locally, on my Windows (XP) machine. For simplicity, lets think of the library in two parts: a writer (Creates the files) and a reader or consumer (simply reads data from the files).
Recently though, I would like to also consume (i.e. read) the data files I have created on my XP machine, on my Linux machine. I must point out at this stage that both machines are PCs (so have the same endianess etc).
I can build a reader (and compile for Linux [Ubuntu 9.10 to be precise]), since I am the library creator. My question, before I embark down this road (of building the reader etc) is:
Assuming I have succesfully built the reader for Linux,
Can I simply copy accross, files that were created on the windows (XP) machine to the Linux (Ubuntu 9.10) machine and use the Linux reader to successfully read the copied over file?

For the files to be binary compatible:
endianness must match (as it does for you)
bitfield packing order must be the same
sizes and signedness of types must be the same
the compiler must make the same decisions about padding and alignment
It's certainly possible for all of these conditions to be fulfilled, or for you to not happen to be hitting any cases for which they are not. At the very least, though, I'd add some sanity checks and/or sentinel members to detect problems.

Binary files should be compatible across machines with the same endianess.
The issue you may have in your code is the size of ints, you can't necessarily assume that the compiler on different OS's has the same size int. So either copy blocks of bytes and cast them, or use int16, int32 etc.

Structs are not a file format, and you shouldn't try to use them as such.
When attempting to make structs work with fread and fwrite, there's a huge number of hacks to make it work. You byte-swap integers so that you can share files between little-endian and big-endian machines. You change your structs to use fixed-width integer types, so you can share between machines with different word sizes (such as between x86 and x64 machines). You add compiler-specific pragmas to control the padding of structs to share between compiler versions.
It works, but it's ugly. Not to mention, easy to get wrong.
Much like the recommendation in The byte order fallacy, a much better idea is to write code to read/write the fields individually. By writing your own code, you can ensure there's no padding, and you can choose integer sizes independently of the local size of integers, and you can support both endiannesses without byte-swapping (by reading/writing the bytes of an integer separately).
Unlike the hacky approach, this is hard to get wrong. Further, because you don't rely on any compiler or architecture specific behaviors, either your code will work on all compilers and architectures, or none. If you do it right, you shouldn't have any platform-specific bugs.
There is one downside; individually reading/writing the fields will be slower than just using fread/fwrite directly. You can set up a buffer (uint8_t buffer[]) and write the entirety of the data into it, and then write everything out at once, which might help, but it'll still be slower (because you'd still have to move the fields into the buffer one at a time), but for most purposes it'll still be fast enough (exceptions being embedded / real-time systems or extremely high performance computing).

If:
the machines have the same endianess (as you stated they have) and
you do open the streams in binary mode, as text mode might do funny things e.g. with line-ends and
you have programmed cleanly so you don't stumble over implementation-defined stuff like alignments, data type sizes, and struct packing,
then yes, your files should be portable.
The third bullet point is what makes a file format a "portable" one. Depending on what kind of data you have in your structs, it can be very easy or a bit tricky. Bitfields, or data being reinterpreted from a different type are especially tricky.

You might consider taking a look at the Boost Serialization Library.
A lot of thought has been put into it, and it will handle many of the potential cross-platform incompatibilities for you.
Of course, it's possible that it's overkill for your particular use case, especially if you've already got your writers & readers implemented.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js