I'm working on a C++ program that, for good reason(1), requires a binary data format stored on disk. Composing that data are arbitrary struct entries.
My program has both 32-bit and 64-bit versions and it's possible that the binary data file could be written by one and read by another. This means that the fields of the stored structures must be of types with predictable sizes and alignments so that the resulting layout is identical for both natural word sizes.
I'm concerned that a future maintainer might accidentally violate this by adding an int without really thinking or having something like a single uint32_t followed immediately by a uint64_t.
Is there any way to do a compile-time check (i.e. static_assert) that a structure will be laid out identically on both 32-bit and 64-bit systems? What about a run-time check if the former isn't possible?
Conceptually, I think it would be something like this:
for (every field):
static_assert: sizeof_32(field) == sizeof_64(field)
static_assert: offset_of(next_field) == offset_of(field) + sizeof(field)
Or more simply:
static_assert: sizeof_32(struct) == sizeof_64(struct)
Given that the program is being compiled for both bit sizes, it would technically be okay to assert on only one architecture since that would still expose the problem.
It's also okay if the structures being checked are somewhat restricted (such as requiring explicit padding fields) so long as it can be guaranteed correct.
The file is memory-mapped and all reads/writes are random-access
through pointers. Serialization is not an option.
This is the closest thing to "automatic" that I could come up with:
For all structures that are going to be used within this persistent binary data, add an attribute with the expected instance size.
struct MyPersistentBinaryStructure {
// Expected size for 32/64-bit check.
static constexpr size_t kExpectedInstanceSize = 80;
... 80 bytes of fixed size fields and appropriate padding ...
};
Then, in the code that looks up the address of structures within that binary data, check that value:
template <typename T>
T* GetAsObject(Reference ref) {
static_assert(std::is_pod<T>::value, "only simple objects");
static_assert(T::kExpectedInstanceSize == sizeof(T), "inconsistent size");
return reinterpret_cast<T*>(GetPointerFromRef(ref));
}
Any build that compiles the structure to a different size will give a compile-time error. This doesn't future-proof the build because a definition that would be different for width X won't get caught until it is actually built on an architecture of width X, but at least you'll know and maybe be able to adapt the structure without breaking the format (e.g. 32-bit int -> int32_t).
Doing this turned out to be worth the effort as it immediately found three 32/64 incompatibilities within code that I'd manually checked with significant care. Two of those errors would have caused data corruption; the other was just some extra tail padding.
This is probably more of a hack than an answer, but I do believe you could use something like this (which will make things clearer for any future maintainer as well):
#include <limits.h>
#if ULONG_MAX == (0xffffffffffffffffUL) // 64 bit code here
// ...
#elif ULONG_MAX == (0xffffffffUL) // 32 bit code here
// ...
#else
#error unsupported
#endif
P.S.
Having said that... I would avoid directly using structs when writing to files.
There's too much that can go wrong, and that's in addition to file bloating (the structs are padded, meaning you'll get a lot of arbitrary junk data in the files).
Better to use a serialization function that stores and loads each field separately and does so byte by byte (or bit by bit), avoiding such issues as 32/64 bits and endianess.
EDIT:
I saw the comments about using a mapped file for IO... kinda reminiscent of a database implementation.
In this case, I would probably settle for an explicit comment in the code for the struct and have all fields (where possible) be either bit-size explicit or unions. i.e.:
// this type is defined to make sure pointer sizes are the same for
// 64bit and 32bit systems.
typedef union {
void *ptr;
char _space[8];
} pntr_field;
struct {
size_t length : 32; // explicit bit count for 64bit and 32bit compatibility
size_t number : 32;
pntr_field my_ptr; // side-note: I would avoid pointers,
size_t offset : 32; // side-note: offsets are better for persistence.
} my_struct;
However... even in this situation, assuming the file is expected to be transferrable across systems, I would probably use getter / setter functions with "root" offset style pointers.
This would allow both for the data to be compressed (avoid struct padding concerns) and allow for the ever-changing memory address of the mapped file (since every program restart, all the pointers will become invalid and what we really care about will be the offset of the data relative to the root of the file or the object)...
Good Luck!
Related
I have read about problems with loading structs from file. There are problems with endianness and different variable sizes. But let us say that there is a structure like this one:
struct Structure
{
uint8_t value1;
uint16_t value2;
uint32_t value3;
uint64_t value;4
};
Let us say that the file is always written in little-endian format, so application reads it in strict way. In such case endianness should not cause any problems. (Let us assume that there is some kind of convertEndinness() function which is clever enough to omit byte order issue). The second thing which I know is neccesery to consider is variable size variety. There is my question. Do fixed size types manage to handle this problem and what else should I consider in order to create multiplatform binary file?
Do fixed size types manage to handle this problem
Not quite.
The fixed-size types have fixed sizes, but their alignment requirements (and therefore padding) may vary between platforms and/or ABI flavours. So, your struct could still have different layout on different platforms even with the same endianness.
You can insist that there should be no padding, and use some compiler-specific and non-standard way to specify this (like #pragma pack or __attribute__((packed))). This can produce worse code for accessing misaligned members directly, though.
what else should I consider in order to create multiplatform binary file?
If you choose an endianness, use fixed-size types and specify the alignment correctly, you're probably fine.
I'd strongly suggested adding a header and/or some framing information, with a version and possibly some metadata about the sizes and alignments you chose. Otherwise you can never change this file format in the future without things breaking in unpleasant ways.
If I have a struct in C++, is there no way to safely read/write it to a file that is cross-platform/compiler compatible?
Because if I understand correctly, every compiler 'pads' differently based on the target platform.
No. That is not possible. It's because of lack of standardization of C++ at the binary level.
Don Box writes (quoting from his book Essential COM, chapter COM As A Better C++)
C++ and Portability
Once the decision is made to
distribute a C++ class as a DLL, one
is faced with one of the fundamental
weaknesses of C++, that is, lack of
standardization at the binary level.
Although the ISO/ANSI C++ Draft
Working Paper attempts to codify which
programs will compile and what the
semantic effects of running them will
be, it makes no attempt to standardize
the binary runtime model of C++. The
first time this problem will become
evident is when a client tries to link
against the FastString DLL's import library from
a C++ developement environment other
than the one used to build the
FastString DLL.
Struct padding is done differently by different compilers. Even if you use the same compiler, the packing alignment for structs can be different based on what pragma pack you're using.
Not only that if you write two structs whose members are exactly same, the only difference is that the order in which they're declared is different, then the size of each struct can be (and often is) different.
For example, see this,
struct A
{
char c;
char d;
int i;
};
struct B
{
char c;
int i;
char d;
};
int main() {
cout << sizeof(A) << endl;
cout << sizeof(B) << endl;
}
Compile it with gcc-4.3.4, and you get this output:
8
12
That is, sizes are different even though both structs have the same members!
The bottom line is that the standard doesn't talk about how padding should be done, and so the compilers are free to make any decision and you cannot assume all compilers make the same decision.
If you have the opportunity to design the struct yourself, it should be possible. The basic idea is that you should design it so that there would be no need to insert pad bytes into it. the second trick is that you must handle differences in endianess.
I'll describe how to construct the struct using scalars, but the you should be able to use nested structs, as long as you would apply the same design for each included struct.
First, a basic fact in C and C++ is that the alignment of a type can not exceed the size of the type. If it would, then it would not be possible to allocate memory using malloc(N*sizeof(the_type)).
Layout the struct, starting with the largest types.
struct
{
uint64_t alpha;
uint32_t beta;
uint32_t gamma;
uint8_t delta;
Next, pad out the struct manually, so that in the end you will match up the largest type:
uint8_t pad8[3]; // Match uint32_t
uint32_t pad32; // Even number of uint32_t
}
Next step is to decide if the struct should be stored in little or big endian format. The best way is to "swap" all the element in situ before writing or after reading the struct, if the storage format does not match the endianess of the host system.
No, there's no safe way. In addition to padding, you have to deal with different byte ordering, and different sizes of builtin types.
You need to define a file format, and convert your struct to and from that format. Serialization libraries (e.g. boost::serialization, or google's protocolbuffers) can help with this.
Long story short, no. There is no platform-independent, Standard-conformant way to deal with padding.
Padding is called "alignment" in the Standard, and it begins discussing it in 3.9/5:
Object types have alignment
requirements (3.9.1, 3.9.2). The
alignment of a complete object type is
an implementation-defined integer
value representing a number of bytes;
an object is allocated at an address
that meets the alignment requirements
of its object type.
But it goes on from there and winds off to many dark corners of the Standard. Alignment is "implementation-defined" meaning it can be different across different compilers, or even across address models (ie 32-bit/64-bit) under the same compiler.
Unless you have truly harsh performance requirements, you might consider storing your data to disc in a different format, like char strings. Many high-performance protocols send everything using strings when the natural format might be something else. For example, a low-latency exchange feed I recently worked on sends dates as strings formatted like this: "20110321" and times are sent similarly: "141055.200". Even though this exchange feed sends 5 million messages per second all day long, they still use strings for everything because that way they can avoid endian-ness and other issues.
If I have a struct in C++, is there no way to safely read/write it to a file that is cross-platform/compiler compatible?
Because if I understand correctly, every compiler 'pads' differently based on the target platform.
No. That is not possible. It's because of lack of standardization of C++ at the binary level.
Don Box writes (quoting from his book Essential COM, chapter COM As A Better C++)
C++ and Portability
Once the decision is made to
distribute a C++ class as a DLL, one
is faced with one of the fundamental
weaknesses of C++, that is, lack of
standardization at the binary level.
Although the ISO/ANSI C++ Draft
Working Paper attempts to codify which
programs will compile and what the
semantic effects of running them will
be, it makes no attempt to standardize
the binary runtime model of C++. The
first time this problem will become
evident is when a client tries to link
against the FastString DLL's import library from
a C++ developement environment other
than the one used to build the
FastString DLL.
Struct padding is done differently by different compilers. Even if you use the same compiler, the packing alignment for structs can be different based on what pragma pack you're using.
Not only that if you write two structs whose members are exactly same, the only difference is that the order in which they're declared is different, then the size of each struct can be (and often is) different.
For example, see this,
struct A
{
char c;
char d;
int i;
};
struct B
{
char c;
int i;
char d;
};
int main() {
cout << sizeof(A) << endl;
cout << sizeof(B) << endl;
}
Compile it with gcc-4.3.4, and you get this output:
8
12
That is, sizes are different even though both structs have the same members!
The bottom line is that the standard doesn't talk about how padding should be done, and so the compilers are free to make any decision and you cannot assume all compilers make the same decision.
If you have the opportunity to design the struct yourself, it should be possible. The basic idea is that you should design it so that there would be no need to insert pad bytes into it. the second trick is that you must handle differences in endianess.
I'll describe how to construct the struct using scalars, but the you should be able to use nested structs, as long as you would apply the same design for each included struct.
First, a basic fact in C and C++ is that the alignment of a type can not exceed the size of the type. If it would, then it would not be possible to allocate memory using malloc(N*sizeof(the_type)).
Layout the struct, starting with the largest types.
struct
{
uint64_t alpha;
uint32_t beta;
uint32_t gamma;
uint8_t delta;
Next, pad out the struct manually, so that in the end you will match up the largest type:
uint8_t pad8[3]; // Match uint32_t
uint32_t pad32; // Even number of uint32_t
}
Next step is to decide if the struct should be stored in little or big endian format. The best way is to "swap" all the element in situ before writing or after reading the struct, if the storage format does not match the endianess of the host system.
No, there's no safe way. In addition to padding, you have to deal with different byte ordering, and different sizes of builtin types.
You need to define a file format, and convert your struct to and from that format. Serialization libraries (e.g. boost::serialization, or google's protocolbuffers) can help with this.
Long story short, no. There is no platform-independent, Standard-conformant way to deal with padding.
Padding is called "alignment" in the Standard, and it begins discussing it in 3.9/5:
Object types have alignment
requirements (3.9.1, 3.9.2). The
alignment of a complete object type is
an implementation-defined integer
value representing a number of bytes;
an object is allocated at an address
that meets the alignment requirements
of its object type.
But it goes on from there and winds off to many dark corners of the Standard. Alignment is "implementation-defined" meaning it can be different across different compilers, or even across address models (ie 32-bit/64-bit) under the same compiler.
Unless you have truly harsh performance requirements, you might consider storing your data to disc in a different format, like char strings. Many high-performance protocols send everything using strings when the natural format might be something else. For example, a low-latency exchange feed I recently worked on sends dates as strings formatted like this: "20110321" and times are sent similarly: "141055.200". Even though this exchange feed sends 5 million messages per second all day long, they still use strings for everything because that way they can avoid endian-ness and other issues.
I have been having alot of trouble with this stupid struct. I don't see why it is doing this, and I am really not sure how to fix it. The only way I know how to fix it, is by removing the struct and doing it some other way(which I don't want to do).
So I am reading data from a file, and I am reading it in to a struct pointer all at once. It seems like the offset/pointer of my 'long long' gets messed up everytime. View in details below.
So here is my struct:
struct Entry
{
unsigned short type;
unsigned long long identifier;
unsigned int offset_specifier, length;
};
And here is my code for reading all the crap into the struct pointer/array:
Entry *entries = new Entry[SOME_DYNAMIC_AMOUNT];
fread(entries, sizeof(Entry), SOME_DYNAMIC_AMOUNT, openedFile);
As you can see, I write all that into my struct array. Now, I will show you the data I am reading(for the first struct in this example).
So this is the data that is going into the first element in 'entries'. The first item(the short, 'type'), seems to be read fine. After that, when the 'identifier' is read, it seems like the whole struct is shifted X amount of bytes. Here is a picture of the first element(after reversing the endian):
And here is the data in memory(the red square is where it begins):
I know that was a bit confusing, but I tried to explain it as well as possible. Thanks for any help, Hetelek. :)
Structures are padded with extra bytes so that the fields are faster to access. You can prevent this with #pragma pack:
#pragma pack(push, 1)
struct Entry
{
/* ... */
};
#pragma pack(pop)
Note that this might not be 100% portable (I know that at least GCC and MSVC support it for x86).
Reading and writing structs to a file in binary is perilous.
The problem you're running into here is that the compiler inserts padding (needed for alignment) between the type and identifier members of your structure. Apparently whatever program wrote the data (which you haven't told us about) used a different layout that the program that's trying to read the data.
This could happen if the two systems (the one writing the data and the one reading it) have different alignment requirements, and therefore different layouts for the Entry type.
Alignment is not the only potential problem, though; differences in endianness can also be a serious problem. Different systems might have differing sizes for the predefined integer types. You can't assume that struct Entry will have a consistent layout unless all the code that deals with it runs on a single system -- and ideally with the same version of the same compiler.
You might be able to use #pragma pack to work around this, but I don't recommend it. It's not portable, and it can be unsafe. At best, it will work around the problem of padding between members; there are still plenty of ways the layout can vary from one system to another.
It's impossible to give you a definitive solution without knowing where and how the data layout of the file you're reading is defined.
If we assume that the file layout for each record is, for example:
A 2-byte unsigned integer in network byte order (type)
An 8-byte integer in network byte order (identifier)
A 4-byte integer in network byte order (offset_specifier, length)
with no padding between them
then you should either read the data into an unsigned char[] buffer, or into objects of type uint16_t, uint32_t, and uint64_t (defined in <cstdint> or <stdint.h>), and then translate it from network byte order to local byte order.
You can wrap this conversion in a function that reads from the file and converts the data, storing it in an Entry struct.
If you're able to assume that the program will only run on a restricted set of systems, then you can bypass some of this. For example, you might be able to tweak the declaration of struct Entry so it matches the file format, and read and write it directly. Doing so will mean your code isn't portable to some systems. You'll have to decide which price you're willing to pay.
I've seen a few questions and answers regarding to the endianness of structs, but they were about detecting the endianness of a system, or converting data between the two different endianness.
What I would like to now, however, if there is a way to enforce specific endianness of a given struct. Are there some good compiler directives or other simple solutions besides rewriting the whole thing out of a lot of macros manipulating on bitfields?
A general solution would be nice, but I would be happy with a specific gcc solution as well.
Edit:
Thank you for all the comments pointing out why it's not a good idea to enforce endianness, but in my case that's exactly what I need.
A large amount of data is generated by a specific processor (which will never ever change, it's an embedded system with a custom hardware), and it has to be read by a program (which I am working on) running on an unknown processor. Byte-wise evaluation of the data would be horribly troublesome because it consists of hundreds of different types of structs, which are huge, and deep: most of them have many layers of other huge structs inside.
Changing the software for the embedded processor is out of the question. The source is available, this is why I intend to use the structs from that system instead of starting from scratch and evaluating all the data byte-wise.
This is why I need to tell the compiler which endianness it should use, it doesn't matter how efficient or not will it be.
It does not have to be a real change in endianness. Even if it's just an interface, and physically everything is handled in the processors own endianness, it's perfectly acceptable to me.
The way I usually handle this is like so:
#include <arpa/inet.h> // for ntohs() etc.
#include <stdint.h>
class be_uint16_t {
public:
be_uint16_t() : be_val_(0) {
}
// Transparently cast from uint16_t
be_uint16_t(const uint16_t &val) : be_val_(htons(val)) {
}
// Transparently cast to uint16_t
operator uint16_t() const {
return ntohs(be_val_);
}
private:
uint16_t be_val_;
} __attribute__((packed));
Similarly for be_uint32_t.
Then you can define your struct like this:
struct be_fixed64_t {
be_uint32_t int_part;
be_uint32_t frac_part;
} __attribute__((packed));
The point is that the compiler will almost certainly lay out the fields in the order you write them, so all you are really worried about is big-endian integers. The be_uint16_t object is a class that knows how to convert itself transparently between big-endian and machine-endian as required. Like this:
be_uint16_t x = 12;
x = x + 1; // Yes, this actually works
write(fd, &x, sizeof(x)); // writes 13 to file in big-endian form
In fact, if you compile that snippet with any reasonably good C++ compiler, you should find it emits a big-endian "13" as a constant.
With these objects, the in-memory representation is big-endian. So you can create arrays of them, put them in structures, etc. But when you go to operate on them, they magically cast to machine-endian. This is typically a single instruction on x86, so it is very efficient. There are a few contexts where you have to cast by hand:
be_uint16_t x = 37;
printf("x == %u\n", (unsigned)x); // Fails to compile without the cast
...but for most code, you can just use them as if they were built-in types.
A bit late to the party but with current GCC (tested on 6.2.1 where it works and 4.9.2 where it's not implemented) there is finally a way to declare that a struct should be kept in X-endian byte order.
The following test program:
#include <stdio.h>
#include <stdint.h>
struct __attribute__((packed, scalar_storage_order("big-endian"))) mystruct {
uint16_t a;
uint32_t b;
uint64_t c;
};
int main(int argc, char** argv) {
struct mystruct bar = {.a = 0xaabb, .b = 0xff0000aa, .c = 0xabcdefaabbccddee};
FILE *f = fopen("out.bin", "wb");
size_t written = fwrite(&bar, sizeof(struct mystruct), 1, f);
fclose(f);
}
creates a file "out.bin" which you can inspect with a hex editor (e.g. hexdump -C out.bin). If the scalar_storage_order attribute is suppported it will contain the expected 0xaabbff0000aaabcdefaabbccddee in this order and without holes. Sadly this is of course very compiler specific.
No, I dont think so.
Endianness is the attribute of processor that indicates whether integers are represented from left to right or right to left it is not an attribute of the compiler.
The best you can do is write code which is independent of any byte order.
Try using
#pragma scalar_storage_order big-endian to store in big-endian-format
#pragma scalar_storage_order little-endian to store in little-endian
#pragma scalar_storage_order default to store it in your machines default endianness
Read more here
No, there's no such capability. If it existed that could cause compilers to have to generate excessive/inefficient code so C++ just doesn't support it.
The usual C++ way to deal with serialization (which I assume is what you're trying to solve) this is to let the struct remain in memory in the exact layout desired and do the serialization in such a way that endianness is preserved upon deserialization.
I am not sure if the following can be modified to suit your purposes, but where I work, we have found the following to be quite useful in many cases.
When endianness is important, we use two different data structures. One is done to represent how it expected to arrive. The other is how we want it to be represented in memory. Conversion routines are then developed to switch between the two.
The workflow operates thusly ...
Read the data into the raw structure.
Convert to the "raw structure" to the "in memory version"
Operate only on the "in memory version"
When done operating on it, convert the "in memory version" back to the "raw structure" and write it out.
We find this decoupling useful because (but not limited to) ...
All conversions are located in one place only.
Fewer headaches about memory alignment issues when working with the "in memory version".
It makes porting from one arch to another much easier (fewer endian issues).
Hopefully this decoupling can be useful to your application too.
A possible innovative solution would be to use a C interpreter like Ch and force the endian coding to big.
Boost provides endian buffers for this.
For example:
#include <boost/endian/buffers.hpp>
#include <boost/static_assert.hpp>
using namespace boost::endian;
struct header {
big_int32_buf_t file_code;
big_int32_buf_t file_length;
little_int32_buf_t version;
little_int32_buf_t shape_type;
};
BOOST_STATIC_ASSERT(sizeof(h) == 16U);
Maybe not a direct answer, but having a read through this question can hopefully answer some of your concerns.
You could make the structure a class with getters and setters for the data members. The getters and setters are implemented with something like:
int getSomeValue( void ) const {
#if defined( BIG_ENDIAN )
return _value;
#else
return convert_to_little_endian( _value );
#endif
}
void setSomeValue( int newValue) {
#if defined( BIG_ENDIAN )
_value = newValue;
#else
_value = convert_to_big_endian( newValue );
#endif
}
We do this sometimes when we read a structure in from a file - we read it into a struct and use this on both big-endian and little-endian machines to access the data properly.
There is a data representation for this called XDR. Have a look at it.
http://en.wikipedia.org/wiki/External_Data_Representation
Though it might be a little too much for your Embedded System. Try searching for an already implemented library that you can use (check license restrictions!).
XDR is generally used in Network systems, since they need a way to move data in an Endianness independent way. Though nothing says that it cannot be used outside of networks.