I have been having alot of trouble with this stupid struct. I don't see why it is doing this, and I am really not sure how to fix it. The only way I know how to fix it, is by removing the struct and doing it some other way(which I don't want to do).
So I am reading data from a file, and I am reading it in to a struct pointer all at once. It seems like the offset/pointer of my 'long long' gets messed up everytime. View in details below.
So here is my struct:
struct Entry
{
unsigned short type;
unsigned long long identifier;
unsigned int offset_specifier, length;
};
And here is my code for reading all the crap into the struct pointer/array:
Entry *entries = new Entry[SOME_DYNAMIC_AMOUNT];
fread(entries, sizeof(Entry), SOME_DYNAMIC_AMOUNT, openedFile);
As you can see, I write all that into my struct array. Now, I will show you the data I am reading(for the first struct in this example).
So this is the data that is going into the first element in 'entries'. The first item(the short, 'type'), seems to be read fine. After that, when the 'identifier' is read, it seems like the whole struct is shifted X amount of bytes. Here is a picture of the first element(after reversing the endian):
And here is the data in memory(the red square is where it begins):
I know that was a bit confusing, but I tried to explain it as well as possible. Thanks for any help, Hetelek. :)
Structures are padded with extra bytes so that the fields are faster to access. You can prevent this with #pragma pack:
#pragma pack(push, 1)
struct Entry
{
/* ... */
};
#pragma pack(pop)
Note that this might not be 100% portable (I know that at least GCC and MSVC support it for x86).
Reading and writing structs to a file in binary is perilous.
The problem you're running into here is that the compiler inserts padding (needed for alignment) between the type and identifier members of your structure. Apparently whatever program wrote the data (which you haven't told us about) used a different layout that the program that's trying to read the data.
This could happen if the two systems (the one writing the data and the one reading it) have different alignment requirements, and therefore different layouts for the Entry type.
Alignment is not the only potential problem, though; differences in endianness can also be a serious problem. Different systems might have differing sizes for the predefined integer types. You can't assume that struct Entry will have a consistent layout unless all the code that deals with it runs on a single system -- and ideally with the same version of the same compiler.
You might be able to use #pragma pack to work around this, but I don't recommend it. It's not portable, and it can be unsafe. At best, it will work around the problem of padding between members; there are still plenty of ways the layout can vary from one system to another.
It's impossible to give you a definitive solution without knowing where and how the data layout of the file you're reading is defined.
If we assume that the file layout for each record is, for example:
A 2-byte unsigned integer in network byte order (type)
An 8-byte integer in network byte order (identifier)
A 4-byte integer in network byte order (offset_specifier, length)
with no padding between them
then you should either read the data into an unsigned char[] buffer, or into objects of type uint16_t, uint32_t, and uint64_t (defined in <cstdint> or <stdint.h>), and then translate it from network byte order to local byte order.
You can wrap this conversion in a function that reads from the file and converts the data, storing it in an Entry struct.
If you're able to assume that the program will only run on a restricted set of systems, then you can bypass some of this. For example, you might be able to tweak the declaration of struct Entry so it matches the file format, and read and write it directly. Doing so will mean your code isn't portable to some systems. You'll have to decide which price you're willing to pay.
Related
I have read about problems with loading structs from file. There are problems with endianness and different variable sizes. But let us say that there is a structure like this one:
struct Structure
{
uint8_t value1;
uint16_t value2;
uint32_t value3;
uint64_t value;4
};
Let us say that the file is always written in little-endian format, so application reads it in strict way. In such case endianness should not cause any problems. (Let us assume that there is some kind of convertEndinness() function which is clever enough to omit byte order issue). The second thing which I know is neccesery to consider is variable size variety. There is my question. Do fixed size types manage to handle this problem and what else should I consider in order to create multiplatform binary file?
Do fixed size types manage to handle this problem
Not quite.
The fixed-size types have fixed sizes, but their alignment requirements (and therefore padding) may vary between platforms and/or ABI flavours. So, your struct could still have different layout on different platforms even with the same endianness.
You can insist that there should be no padding, and use some compiler-specific and non-standard way to specify this (like #pragma pack or __attribute__((packed))). This can produce worse code for accessing misaligned members directly, though.
what else should I consider in order to create multiplatform binary file?
If you choose an endianness, use fixed-size types and specify the alignment correctly, you're probably fine.
I'd strongly suggested adding a header and/or some framing information, with a version and possibly some metadata about the sizes and alignments you chose. Otherwise you can never change this file format in the future without things breaking in unpleasant ways.
If I have a struct in C++, is there no way to safely read/write it to a file that is cross-platform/compiler compatible?
Because if I understand correctly, every compiler 'pads' differently based on the target platform.
No. That is not possible. It's because of lack of standardization of C++ at the binary level.
Don Box writes (quoting from his book Essential COM, chapter COM As A Better C++)
C++ and Portability
Once the decision is made to
distribute a C++ class as a DLL, one
is faced with one of the fundamental
weaknesses of C++, that is, lack of
standardization at the binary level.
Although the ISO/ANSI C++ Draft
Working Paper attempts to codify which
programs will compile and what the
semantic effects of running them will
be, it makes no attempt to standardize
the binary runtime model of C++. The
first time this problem will become
evident is when a client tries to link
against the FastString DLL's import library from
a C++ developement environment other
than the one used to build the
FastString DLL.
Struct padding is done differently by different compilers. Even if you use the same compiler, the packing alignment for structs can be different based on what pragma pack you're using.
Not only that if you write two structs whose members are exactly same, the only difference is that the order in which they're declared is different, then the size of each struct can be (and often is) different.
For example, see this,
struct A
{
char c;
char d;
int i;
};
struct B
{
char c;
int i;
char d;
};
int main() {
cout << sizeof(A) << endl;
cout << sizeof(B) << endl;
}
Compile it with gcc-4.3.4, and you get this output:
8
12
That is, sizes are different even though both structs have the same members!
The bottom line is that the standard doesn't talk about how padding should be done, and so the compilers are free to make any decision and you cannot assume all compilers make the same decision.
If you have the opportunity to design the struct yourself, it should be possible. The basic idea is that you should design it so that there would be no need to insert pad bytes into it. the second trick is that you must handle differences in endianess.
I'll describe how to construct the struct using scalars, but the you should be able to use nested structs, as long as you would apply the same design for each included struct.
First, a basic fact in C and C++ is that the alignment of a type can not exceed the size of the type. If it would, then it would not be possible to allocate memory using malloc(N*sizeof(the_type)).
Layout the struct, starting with the largest types.
struct
{
uint64_t alpha;
uint32_t beta;
uint32_t gamma;
uint8_t delta;
Next, pad out the struct manually, so that in the end you will match up the largest type:
uint8_t pad8[3]; // Match uint32_t
uint32_t pad32; // Even number of uint32_t
}
Next step is to decide if the struct should be stored in little or big endian format. The best way is to "swap" all the element in situ before writing or after reading the struct, if the storage format does not match the endianess of the host system.
No, there's no safe way. In addition to padding, you have to deal with different byte ordering, and different sizes of builtin types.
You need to define a file format, and convert your struct to and from that format. Serialization libraries (e.g. boost::serialization, or google's protocolbuffers) can help with this.
Long story short, no. There is no platform-independent, Standard-conformant way to deal with padding.
Padding is called "alignment" in the Standard, and it begins discussing it in 3.9/5:
Object types have alignment
requirements (3.9.1, 3.9.2). The
alignment of a complete object type is
an implementation-defined integer
value representing a number of bytes;
an object is allocated at an address
that meets the alignment requirements
of its object type.
But it goes on from there and winds off to many dark corners of the Standard. Alignment is "implementation-defined" meaning it can be different across different compilers, or even across address models (ie 32-bit/64-bit) under the same compiler.
Unless you have truly harsh performance requirements, you might consider storing your data to disc in a different format, like char strings. Many high-performance protocols send everything using strings when the natural format might be something else. For example, a low-latency exchange feed I recently worked on sends dates as strings formatted like this: "20110321" and times are sent similarly: "141055.200". Even though this exchange feed sends 5 million messages per second all day long, they still use strings for everything because that way they can avoid endian-ness and other issues.
I'm working on a C++ program that, for good reason(1), requires a binary data format stored on disk. Composing that data are arbitrary struct entries.
My program has both 32-bit and 64-bit versions and it's possible that the binary data file could be written by one and read by another. This means that the fields of the stored structures must be of types with predictable sizes and alignments so that the resulting layout is identical for both natural word sizes.
I'm concerned that a future maintainer might accidentally violate this by adding an int without really thinking or having something like a single uint32_t followed immediately by a uint64_t.
Is there any way to do a compile-time check (i.e. static_assert) that a structure will be laid out identically on both 32-bit and 64-bit systems? What about a run-time check if the former isn't possible?
Conceptually, I think it would be something like this:
for (every field):
static_assert: sizeof_32(field) == sizeof_64(field)
static_assert: offset_of(next_field) == offset_of(field) + sizeof(field)
Or more simply:
static_assert: sizeof_32(struct) == sizeof_64(struct)
Given that the program is being compiled for both bit sizes, it would technically be okay to assert on only one architecture since that would still expose the problem.
It's also okay if the structures being checked are somewhat restricted (such as requiring explicit padding fields) so long as it can be guaranteed correct.
The file is memory-mapped and all reads/writes are random-access
through pointers. Serialization is not an option.
This is the closest thing to "automatic" that I could come up with:
For all structures that are going to be used within this persistent binary data, add an attribute with the expected instance size.
struct MyPersistentBinaryStructure {
// Expected size for 32/64-bit check.
static constexpr size_t kExpectedInstanceSize = 80;
... 80 bytes of fixed size fields and appropriate padding ...
};
Then, in the code that looks up the address of structures within that binary data, check that value:
template <typename T>
T* GetAsObject(Reference ref) {
static_assert(std::is_pod<T>::value, "only simple objects");
static_assert(T::kExpectedInstanceSize == sizeof(T), "inconsistent size");
return reinterpret_cast<T*>(GetPointerFromRef(ref));
}
Any build that compiles the structure to a different size will give a compile-time error. This doesn't future-proof the build because a definition that would be different for width X won't get caught until it is actually built on an architecture of width X, but at least you'll know and maybe be able to adapt the structure without breaking the format (e.g. 32-bit int -> int32_t).
Doing this turned out to be worth the effort as it immediately found three 32/64 incompatibilities within code that I'd manually checked with significant care. Two of those errors would have caused data corruption; the other was just some extra tail padding.
This is probably more of a hack than an answer, but I do believe you could use something like this (which will make things clearer for any future maintainer as well):
#include <limits.h>
#if ULONG_MAX == (0xffffffffffffffffUL) // 64 bit code here
// ...
#elif ULONG_MAX == (0xffffffffUL) // 32 bit code here
// ...
#else
#error unsupported
#endif
P.S.
Having said that... I would avoid directly using structs when writing to files.
There's too much that can go wrong, and that's in addition to file bloating (the structs are padded, meaning you'll get a lot of arbitrary junk data in the files).
Better to use a serialization function that stores and loads each field separately and does so byte by byte (or bit by bit), avoiding such issues as 32/64 bits and endianess.
EDIT:
I saw the comments about using a mapped file for IO... kinda reminiscent of a database implementation.
In this case, I would probably settle for an explicit comment in the code for the struct and have all fields (where possible) be either bit-size explicit or unions. i.e.:
// this type is defined to make sure pointer sizes are the same for
// 64bit and 32bit systems.
typedef union {
void *ptr;
char _space[8];
} pntr_field;
struct {
size_t length : 32; // explicit bit count for 64bit and 32bit compatibility
size_t number : 32;
pntr_field my_ptr; // side-note: I would avoid pointers,
size_t offset : 32; // side-note: offsets are better for persistence.
} my_struct;
However... even in this situation, assuming the file is expected to be transferrable across systems, I would probably use getter / setter functions with "root" offset style pointers.
This would allow both for the data to be compressed (avoid struct padding concerns) and allow for the ever-changing memory address of the mapped file (since every program restart, all the pointers will become invalid and what we really care about will be the offset of the data relative to the root of the file or the object)...
Good Luck!
I have a challenging situation; we will have programs on Mac, PC, iOS and Android receiving files in a legacy format and parsing data from those files. We cannot change how those files are created.
The files are produced by a C++ program filling a struct with numbers and Strings and then writing it out. Here's a sanitized version.
struct MyObject {
String Kfkj(MAXKYS);
String Oern(MAXKYS);
String Vdflj(MAXKYS, 9);
int Muic;
int Tdfkj;
int VdfkAsdk;
int SsdjsdDsldsk;
int Ndsoief;
String TdflsajPdlj;
String TdckjdfPas;
String AdsfakjIdd;
int IdkfjdKasdkj;
int AsadkjaKadkja(MAXKYS);
int Kasldsdkj;
bool Usadl;
String PsadkjOasdj(9);
String PasdkjOsdkj;
};
Primitives and Strings, as you can see.
Then here is how they write it out to a file:
MyInstance MyObject;
FileName = "C:\MyFile.ab2"
ofstream fout (FileName, ios::binary);
fout.write((char*)& MyInstance, sizeof(MyInstance));
There is no option for us to translate it once and then distribute the file to other platforms; we must translate it on each and every different platform, and this is what we have to work with. I'd appreciate any information on how C++ serializes data, so we know how to parse the file.
EDIT: solution
The feedback I received from multiple answers here was VERY helpful. Using that, I did extensive analysis with hex editors and discovered:
the elements come in the file one after another
a "String," in this case, starts with an int describing how many characters follow the int for that String. If the String does not exist, it will still have that int with a value of 0.
integers, for the files and machines I saw, are two bytes, little-endian, and MOSTLY unsigned (there were a few that were signed, just to keep me on my toes)
the boolean was two bytes, with apparently -1 (FF FF) representing "true"
So far we have not ran into issues with different padding or endianness on different devices, but those are very real concerns. The skilled notes and warnings in these answers provides us with more ammunition to try to convince the client to change to a less fragile alternative, such as XML or JSON, for transferring data online across platforms.
As for those of you asking if the developer was fired... well, let's just say their code is very old, but after multiple conversations we're still having trouble convincing them writing out the C++ struct and trying to read that on different platforms is not a good idea.
You're going to run into many problems.
C++ doesn't have a specific format for serializing data per se. It is highly dependent on the computer architecture/processor that you are running on.
The compiler is allowed to add padding to help alignment on systems. When we say alignment we basically are referring to an architecture/processor's affinity for having data lie on specific byte boundaries. For example, some processors vastly prefer floating point numbers to lie at 4 or 8 byte boundaries - if they don't the processor may work much slower or may not work at all.
So, you can't simply know what padding your system is adding magically.
What you can do is use #pragma pack(1) / #pragma pack(0) to stop your compiler from padding your numbers.
PS: you also have to worry about endianness. What if one computer is running on big-endian and one is little endian? They will interpret bytes differently without a conversion.
Simply put, you either have to fix the application generating the files so it uses a proper serialization scheme OR you need to look at it running on a SPECIFIC computer, look at exactly how it writes the files, and write a translator for every target platform (which is just silly).
Interesting Suggestion
If you're really stuck, write an app that monitors the folder where you write files. Have the app pick up the files (since it's on the same PC it'll be able to read their format without issue). Have it write the files back in XML or some other true serialization format and distribute those instead.
Whoa - that's crazy. So String objects don't contain any pointers? Must not- because you claim this is working code.
Anyway, that code isn't doing any serialization. Its just writing the structure out to file exactly the way it is laid out in memory. The only issue you have is that on some platforms padding and sizeof integral types like int may be different.
You'll have to find the size of the integral types, and use that information in reader/writer for newer platforms to make sure they get laid out the same way on the legacy platform.
You're running a real risk with that code though. As it is, a compiler change could suddenly cause the file layout to change.
The format of your data file is entirely down to the compiler that your C++ program is compiled with, and the definition of your String class. You can rely on the fields being in the order they're declared in, and in this case, I think you can rely on there not being any padding at the start, but that's about all. Some tips that might help you out in this case:-
You don't give the definition of the String class you're using. If it's a typedef for std::string, you're completely screwed, because the contents of the string aren't in the memory. I assume your C++ programmers are using some special local buffer, in which case I'll guess you will find the first bytes of the object are the string, and there is some amount of useless padding afterwards. I hope the struct contains an int at the start telling you how much data in it is useful.
You'll probably find the int fields are four bytes long.
You'll probably find the bool field is one byte long, followed by three bytes of useless padding. Only one bit, most likely the bottom bit, will be set.
That's about all the useful guesswork I can offer you. In your target language, make sure to read the whole file in as the closest thing to a byte array available in the language, and only after that, use the language features to convert it into the right kind of thing in your language. Don't try reading it in as integers, as that won't let you byte-swap if you're on a platform with different endianness to the C++ program. I suggest also looking through the file in a text editor to reverse-engineer it and help you find the offset of each field.
Last piece of advice: consider printing P45s (or pink slips, or whatever you have in your country) for whichever programmers or project managers thought this kind of 'serialization' was a good idea. This kind of sloppy work might have been acceptable in a life-or-death situation, but they have seriously screwed you over in a way you're going to find it very hard to recover from. Writing the code to read in these files will not be that hard, if it's only one struct like this, but keeping it reliable will be a world of pain, and they've effectively made it impossible for themselves to change compilers or compiler version safely.
The way it's done, the struct is written in raw form to a file. So basically what you need to know to parse this file is the binary layout of your struct.
Basically, the fields are just one after the other, so to read an int, you just read 4 bytes and cast that to an int, etc.
Strings are a particular case. It's not clear from your code whether this "String" type is an inline array of characters, or a pointer to such an array. In the first case, you need to know how many characters each string contains and simply read that number of characters sequentially. In the second case, you won't be able to get the string back, since it won't have been written to file. The pointer will be useless to you.
One last concern is whether the struct is packed or not. Since you gave no indication to that, by default struct fields are aligned to 4-bytes boundaries, so there may be space for instance after the boolean field that you need to account for. If the struct is packed, then each field comes directly after the previous.
So, to make a long story short, figure out your struct binary layout using its definition and, if all else fails, inspecting the memory at run-time with the debugger, or use a hex editor to study the output file. Then write that specification down somewhere and this will give you what you need to read from the file. It's impossible to tell exactly what that layout is simply by looking at the pseudo-definition you gave.
Writing in an ofstream does not serialize data. This code write the raw memory content of the struct as it was a string of char. Depending of your compiler, its version, its options and the system it is running on the content will be completely different.
Even the number of bits of a char is allowed to change between c++ implementation.
Data referenced by the object of the struct won't be written (forget the content of std::string).
If you cannot change the writer code. You must know the alignment policy, the size of base type and the data representation. You will have to analyze files produced by hand, for example with an hexadecimal editor like this one
http://www.physics.ohio-state.edu/~prewett/hexedit/
, and probably look at your compiler documentation.
If you can change the writer code. Use proper serialization like json, protocol buffer or simply xml.
No one has pointed out something that sticks out to me as particularly problematic (maybe because I've been bit by it). That problem: the data member bool Usadl;. sizeof(bool) varies across platforms, across compilers, and even across releases of the same compiler. Common values for sizeof(bool) are 4 and 1. This will bite you. It's getting hard to find a big endian machine nowadays, very, very hard to find a computer where CHAR_BIT is not 8 or sizeof(int) is not 4. This is not the case for sizeof(bool).
In agreement with everyone else, Chad's team needs to document the structure of the records in the file, and then make sure the program that produces the file writes this structure explicitly, including element sizes, padding, and endianness. Don't depend on class layout to do this for you. That's just asking for trouble.
The best way would probably be to use JSON or if you want a more robust solution go with something like Avro. Avro has a C++ API and a Java API, so it covers most of the cases you're encountering.
I have a disk image which contains a standard image using fuse. The Superblock contains the following, and I have a function read_superblock(*buf) that returns the following raw data:
Bytes 0-3: Magic Number (0xC0000112)
4-7: Block Size (1024)
8-11: Total file system size (in blocks)
12-15: FAT length (in blocks)
16-19: Root Directory (block number)
20-1023: NOT USED
I am very new to C and to get me started on this project I am curious what is a simple way to read this into a structure or some variables and simply print them out to the screen using printf for debugging.
I was initially thinking of doing something like the following thinking I could see the raw data, but I think this is not the case. There is also no structure and I am trying to read it in as a string which also seems terribly wrong. for me to grab data out of. Is there a way for me to specify the structure and define the number of bytes in each variable?
char *buf;
read_superblock(*buf);
printf("%s", buf);
Yes, I think you'd be better off reading this into a structure. The fields containing useful data are all 32-bit integers, so you could define a structure that looks like this (using the types defined in the standard header file stdint.h):
typedef struct SuperBlock_Struct {
uint32_t magic_number;
uint32_t block_size;
uint32_t fs_size;
uint32_t fat_length;
uint32_t root_dir;
} SuperBlock_t;
You can cast the structure to a char* when calling read_superblock, like this:
SuperBlock_t sb;
read_superblock((char*) &sb);
Now to print out your data, you can make a call like the following:
printf("%d %d %d %d\n",
sb.magic_number,
sb.block_size,
sb.fs_size,
sb.fat_length,
sb.root_dir);
Note that you need to be aware of your platform's endianness when using a technique like this, since you're reading integer data (i.e., you may need to swap bytes when reading your data). You should be able to determine that quickly using the magic number in the first field.
Note that it's usually preferable to pass a structure like this without casting it; this allows you to take advantage of the compiler's type-checking and eliminates potential problems that casting may hide. However, that would entail changing your implementation of read_superblock to read data directly into a structure. This is not difficult and can be done using the standard C runtime function fread (assuming your data is in a file, as hinted at in your question), like so:
fread(&sb.magic_number, sizeof(sb.magic_number), 1, fp);
fread(&sb.block_size, sizeof(sb.block_size), 1, fp);
...
Two things to add here:
It's a good idea, when pulling raw data into a struct, to set the struct to have zero padding, even if it's entirely composed of 32-bit unsigned integers. In gcc you do this with #pragma pack(0) before the struct definition and #pragma pack() after it.
For dealing with potential endianness issues, two calls to look at are ntohs() and ntohl(), for 16- and 32-bit values respectively. Note that these swap from network byte order to host byte order; if these are the same (which they aren't on x86-based platforms), they do nothing. You go from host to network byte order with htons() and htonl(). However, since this data is coming from your filesystem and not the network, I don't know if endianness is an issue. It should be easy enough to figure out by comparing the values you expect (e.g. the block size) with the values you get, in hex.
It's not difficult to print the data after you successfully copied data into a structure Emerick proposed. Suppose the instance of the structure you use to hold data is named SuperBlock_t_Instance.
Then you can print its fields like this:
printf("Magic Number:\t%u\nBlock Size:\t%u\n etc",
SuperBlock_t_Instance.magic_number,
SuperBlock_t_Instance.block_size);