I am attempting to read in a binary file. The problem is that the creator of the file took no time to properly align data structures to their natural boundaries and everything is packed tight. This makes it difficult to read the data using C++ structs.
Is there a way to force a struct to be packed tight?
Example:
struct {
short a;
int b;
}
The above structure is 8 bytes: 2 for short a, 2 for padding, 4 for int b. However, on disk, the data is only 6 bytes (not having the 2 bytes of padding for alignment)
Please be aware the actual data structures are thousands of bytes and many fields, including a couple arrays, so I would prefer not to read each field individually.
If you're using GCC, you can do struct __attribute__ ((packed)) { short a; int b; }
On VC++ you can do #pragma pack(1). This option is also supported by GCC.
#pragma pack(push, 1)
struct { short a; int b; }
#pragma pack(pop)
Other compilers may have options to do a tight packing of the structure with no padding.
You need to use a compiler-specific, non-Standard directive to specify 1-byte packing. Such as under Windows:
#pragma pack (push, 1)
The problem is that the creator of the file took no time to properly
byte align the data structures and everything is packed tight.
Actually, the designer did the right thing. Padding is something that the Standard says can be applied, but it doesn't say how much padding should be applied in what cases. The Standard doesn't even say how many bits are in a byte. Even though you might assume that even though these things aren't specified they should still be the same reasonable value on modern machines, that's simply not true. On a 32-bit Windows machine for example the padding might be one thing whereas on the 64-bit version of Windows is might be something else. Maybe it will be the same -- that's not the point. The point is you don't know what the padding will be on different systems.
So by "packing it tight" the developer did the only thing they could -- use some packing that he can be reasonably sure that every system will be able to understand. In that case that commonly-understood packing is to use no padding in structures saved to disk or sent down a wire.
Related
I've come across a problem with interop between C# and C++ where I'm sharing memory between the two 'sides' of my application via a struct defined in both native and managed code. The struct on the native side is defined as such:
#pragma pack(push, 1)
struct RayTestCollisionDesc {
btVector3 hitPosition;
btRigidBody* hitBody;
RayTestCollisionDesc(btRigidBody* body, btVector3& position)
: hitBody(body), hitPosition(position) { }
};
#pragma pack(pop)
And a similar struct is defined on the managed (C#) side. On C# the struct size is 20 bytes (as I would expect on a 32-bit system). However, despite the pragma pack directive, the struct size on the C++ size is still 32. For clarity here's the sizeof() from C++ of each of those types:
sizeof(btVector3) : 16
sizeof(btRigidBody*) : 4
sizeof(RayTestCollisionDesc) : 32
Clearly pragma pack is only referring to packing between members of the struct, and not padding at the end of the struct (i.e. alignment). I also tried adding __declspec(align(1)) but that had no effect, and MSDN itself does say "__declspec(align(#)) can only increase alignment restrictions."
And FWIW I'm using the VS2013 compiler (Platform Toolset v120).
Is there a way to 'force' the struct size to 20 bytes?
You are transferring data between two different compilers. In general, it is impossible to make this match. Especially if you transfer data from one computer to another.
First write a spec for the data that you are transferring. The spec CANNOT be a C++ or C# struct. The spec must be something like "four bytes this, four bytes that, two bytes a third thing..." and so on "with a total of 20 bytes", for example.
Then whether you use C++ or C#, or whether a totally different developer uses Objective-C code to read the data, you read an array of 20 bytes, take the 20 bytes, look at them, and fill whatever struct you want with these 20 bytes. And for writing, you do the opposite. Now it does not matter what C++ compiler you use, what weird pragmas you are using, it just works.
Or use something portable like JSON.
You can't do interop this way. A C# object and a C++ struct are different. I suggest you tu use a serialization library like Capt'n Proto or Protobuf
I have been having alot of trouble with this stupid struct. I don't see why it is doing this, and I am really not sure how to fix it. The only way I know how to fix it, is by removing the struct and doing it some other way(which I don't want to do).
So I am reading data from a file, and I am reading it in to a struct pointer all at once. It seems like the offset/pointer of my 'long long' gets messed up everytime. View in details below.
So here is my struct:
struct Entry
{
unsigned short type;
unsigned long long identifier;
unsigned int offset_specifier, length;
};
And here is my code for reading all the crap into the struct pointer/array:
Entry *entries = new Entry[SOME_DYNAMIC_AMOUNT];
fread(entries, sizeof(Entry), SOME_DYNAMIC_AMOUNT, openedFile);
As you can see, I write all that into my struct array. Now, I will show you the data I am reading(for the first struct in this example).
So this is the data that is going into the first element in 'entries'. The first item(the short, 'type'), seems to be read fine. After that, when the 'identifier' is read, it seems like the whole struct is shifted X amount of bytes. Here is a picture of the first element(after reversing the endian):
And here is the data in memory(the red square is where it begins):
I know that was a bit confusing, but I tried to explain it as well as possible. Thanks for any help, Hetelek. :)
Structures are padded with extra bytes so that the fields are faster to access. You can prevent this with #pragma pack:
#pragma pack(push, 1)
struct Entry
{
/* ... */
};
#pragma pack(pop)
Note that this might not be 100% portable (I know that at least GCC and MSVC support it for x86).
Reading and writing structs to a file in binary is perilous.
The problem you're running into here is that the compiler inserts padding (needed for alignment) between the type and identifier members of your structure. Apparently whatever program wrote the data (which you haven't told us about) used a different layout that the program that's trying to read the data.
This could happen if the two systems (the one writing the data and the one reading it) have different alignment requirements, and therefore different layouts for the Entry type.
Alignment is not the only potential problem, though; differences in endianness can also be a serious problem. Different systems might have differing sizes for the predefined integer types. You can't assume that struct Entry will have a consistent layout unless all the code that deals with it runs on a single system -- and ideally with the same version of the same compiler.
You might be able to use #pragma pack to work around this, but I don't recommend it. It's not portable, and it can be unsafe. At best, it will work around the problem of padding between members; there are still plenty of ways the layout can vary from one system to another.
It's impossible to give you a definitive solution without knowing where and how the data layout of the file you're reading is defined.
If we assume that the file layout for each record is, for example:
A 2-byte unsigned integer in network byte order (type)
An 8-byte integer in network byte order (identifier)
A 4-byte integer in network byte order (offset_specifier, length)
with no padding between them
then you should either read the data into an unsigned char[] buffer, or into objects of type uint16_t, uint32_t, and uint64_t (defined in <cstdint> or <stdint.h>), and then translate it from network byte order to local byte order.
You can wrap this conversion in a function that reads from the file and converts the data, storing it in an Entry struct.
If you're able to assume that the program will only run on a restricted set of systems, then you can bypass some of this. For example, you might be able to tweak the declaration of struct Entry so it matches the file format, and read and write it directly. Doing so will mean your code isn't portable to some systems. You'll have to decide which price you're willing to pay.
So, I'm coding some packet structures (Ethernet, IP, etc) and noticed that some of them are followed by attribute((packed)) which prevents the gcc compiler from attempting to add padding to them. This makes sense, because these structures are supposed to go onto the wire.
But then, I counted the words:
struct ether_header
{
u_int8_t ether_dhost[ETH_ALEN]; /* destination eth addr */
u_int8_t ether_shost[ETH_ALEN]; /* source ether addr */
u_int16_t ether_type; /* packet type ID field */
} __attribute__ ((packed));
This is copied from a site, but my code also uses 2 uint8_t and 1 uint16_t. This adds up to two words (4 bytes).
Depending on the source, the system prefers that structures be aligned according to multiples of 4,8, or even 16 bits. So, I don't see why the attribute((packed)) is necessary, as afaik this shouldn't get packed.
Also, why the double brackets ((packed)) why not use one pair?
If your structure is already a multiple of the right size, then no, the __attribute__((packed)) isn't strictly necessary, but it's still a good idea, in case your structure size ever changes for any reason. If you add/delete fields, or change ETH_ALEN, you'll still want __attribute__((packed)).
I believe the double parentheses are needed to make it easy to make your code compatible with non-gcc compilers. By using them, you can then just do this:
#define __attribute__(x)
And then all attributes that you specify will disappear. The extra parentheses mean there is only one argument passed to the macro (instead of one or more), regardless of how many attributes you specify, and your compiler does not need to support variadic macros.
Although your system may prefer some specific alignment, other systems might not. Even if the __attribute__((packed)) has no effect, it's a good touch of paranoia.
As for why it's double-parenthesis, this GCC-specific extension requires double parenthesis. Single parenthesis will result in an error.
in win32, you can do like this:
#pragma pack(push) //save current status
#pragma pack(4)//set following as 4 aligned
struct test
{
char m1;
double m4;
int m3;
};
#pragma pack(pop) //restore
packed refers to the padding/alignment inside the structure, not the alignment of the structure. For instance
struct {
char x;
int y;
}
Most compilers will allocate y at offset 4 unless you declare the struct as packed (in which case y will get allocated at an offset of 1).
For this structure, even if ETH_ALEN is an odd number, you have two of them, so the uint16 variable will neccessarily be at a two or zero byte offset, and the packed won't do anything. Depending on packed is a bad idea for portability, because the mechanism for packing aren't portable and if you use them you may have to byte copy in and out of your member variables to avoid misalignment exceptions on platforms that this matter for.
I realize that in general the C and C++ standards gives compiler writers a lot of latitude. But in particular it guarantees that POD types like C struct members have to be laid out in memory the same order that they're listed in the structs definition, and most compilers provide extensions letting you fix the alignment of members. So if you had a header that defined a struct and manually specified the alignment of its members, then compiled two apps with different compilers using the header, shouldn't one app be able to write an instance of the struct into shared memory and the other app be able to read it without errors?
I am assuming though that the size of the types contained is consistent across two compilers on the same architecture (it has to be the same platform already since we're talking about shared memory). I realize that this is not always true for some types (e.g. long vs. long long in GCC and MSVC 64-bit) but nowadays there are uint16_t, uint32_t, etc. types, and float and double are specified by IEEE standards.
As long as you can guarantee the exact same memory layout, including offsets, and the data types have the same sizes between the 2 compilers then yes this is fine. Because at that point the struct is identical with respect to data access.
Yes, sure. I've done this many times. The problems and solutions are the same whether mixed code is compiled and linked together, or when transmitting struct-formatted data between machines.
In the bad old days, this frequently occurred when integrating MS C and almost anything else: Borland Turbo C. DEC VAX C, Greenhills C.
The easy part is getting the number of bytes for various data types to agree. For example short on a 32-bit compiler on one side being the same as int on a 16-bit compiler at the other end. Since common source code to declare structures is usually a good thing, a number of to-the-point declarations are helpful:
typedef signed long s32;
typedef signed short s16;
typedef signed char s8;
typedef unsigned long u32;
typedef unsigned short u16;
typedef unsigned char u8;
...
Microsoft C is the most annoying. Its default is to pad members to 16-bit alignment, and maybe more with 64-bit code. Other compilers on x86 don't pad members.
struct {
int count;
char type;
char code;
char data [100];
} variable;
It might seem like the offset of code should be the next byte after type, but there might be a padding byte inserted between. The fix is usually
#ifdef _MSC_VER // if it's any Microsoft compiler
#pragma pack(1) // byte align structure members--that is, no padding
#endif
There is also a compiler command line option to do the same.
The way memory is laid out is important in addition to the datatype size if you need struct from library 1 compiled by compiler 1 to be used in library 2 compiled by compiler 2.
It is indeed possible, you just have to make sure that all compilers involved generate the same data structure from the same code. One way to test this is to write a sample program that creates a struct and writes it to a binary file. Open the resulting files in a hex editor and verify that they are the same. Alternatively, you can cast the struct to an array of uint8_t and dump the individual bytes to the screen.
One way to make sure that the data sizes are the same is to use data types like int16_t (from stdint.h) instead of a plain old int which may change sizes between compilers (although this is rare on two compilers running on the same platform).
It's not as difficult as it sounds. There are many pre-compiled libraries out there that can be used with multiple compilers. The key thing is to build a test program that will let you verify that both compilers are treating the structure equally.
Refer to your compiler manuals.
most compilers provide extensions letting you fix the alignment of members
Are you restricting yourself to those compilers and a mutually compatible #pragma align style? If so, the safety is dictated by their specification.
In the interest of portability, you are possibly better off ditching #pragma align and relying on your ABI, which may provide a "reasonable" standard for compliance of all compilers of your platform.
As the C and C++ standards allow any deterministic struct layout methodology, they're essentially irrelevant.
For example, if I declare a long variable, can I assume it will always be aligned on a "sizeof(long)" boundary? Microsoft Visual C++ online help says so, but is it standard behavior?
some more info:
a. It is possible to explicitely create a misaligned integer (*bar):
char foo[5]
int * bar = (int *)(&foo[1]);
b. Apparently, #pragma pack() only affects structures, classes, and unions.
c. MSVC documentation states that POD types are aligned to their respective sizes (but is it always or by default, and is it standard behavior, I don't know)
As others have mentioned, this isn't part of the standard and is left up to the compiler to implement as it sees fit for the processor in question. For example, VC could easily implement different alignment requirements for an ARM processor than it does for x86 processors.
Microsoft VC implements what is basically called natural alignment up to the size specified by the #pragma pack directive or the /Zp command line option. This means that, for example, any POD type with a size smaller or equal to 8 bytes will be aligned based on its size. Anything larger will be aligned on an 8 byte boundary.
If it is important that you control alignment for different processors and different compilers, then you can use a packing size of 1 and pad your structures.
#pragma pack(push)
#pragma pack(1)
struct Example
{
short data1; // offset 0
short padding1; // offset 2
long data2; // offset 4
};
#pragma pack(pop)
In this code, the padding1 variable exists only to make sure that data2 is naturally aligned.
Answer to a:
Yes, that can easily cause misaligned data. On an x86 processor, this doesn't really hurt much at all. On other processors, this can result in a crash or a very slow execution. For example, the Alpha processor would throw a processor exception which would be caught by the OS. The OS would then inspect the instruction and then do the work needed to handle the misaligned data. Then execution continues. The __unaligned keyword can be used in VC to mark unaligned access for non-x86 programs (i.e. for CE).
By default, yes. However, it can be changed via the pack() #pragma.
I don't believe the C++ Standard make any requirement in this regard, and leaves it up to the implementation.
C and C++ don't mandate any kind of alignment. But natural alignment is strongly preferred by x86 and is required by most other CPU architectures, and compilers generally do their utmost to keep CPUs happy. So in practice you won't see a compiler generate misaligned data unless you really twist it's arm.
Yes, all types are always aligned to at least their alignment requirements.
How could it be otherwise?
But note that the sizeof() a type is not the same as it's alignment.
You can use the following macro to determine the alignment requirements of a type:
#define ALIGNMENT_OF( t ) offsetof( struct { char x; t test; }, test )
Depends on the compiler, the pragmas and the optimisation level. With modern compilers you can also choose time or space optimisation, which could change the alignment of types as well.
Generally it will be because reading/writing to it is faster that way. But almost every compiler has a switch to turn this off. In gcc its -malign-???. With aggregates they are generally aligned and sized based on the alignment requirements of each element within.