How do I organize members in a struct to waste the least space on alignment? - c++

[Not a duplicate of Structure padding and packing. That question is about how and when padding occurs. This one is about how to deal with it.]
I have just realized how much memory is wasted as a result of alignment in C++. Consider the following simple example:
struct X
{
int a;
double b;
int c;
};
int main()
{
cout << "sizeof(int) = " << sizeof(int) << '\n';
cout << "sizeof(double) = " << sizeof(double) << '\n';
cout << "2 * sizeof(int) + sizeof(double) = " << 2 * sizeof(int) + sizeof(double) << '\n';
cout << "but sizeof(X) = " << sizeof(X) << '\n';
}
When using g++ the program gives the following output:
sizeof(int) = 4
sizeof(double) = 8
2 * sizeof(int) + sizeof(double) = 16
but sizeof(X) = 24
That's 50% memory overhead! In a 3-gigabyte array of 134'217'728 Xs 1 gigabyte would be pure padding.
Fortunately, the solution to the problem is very simple - we simply have to swap double b and int c around:
struct X
{
int a;
int c;
double b;
};
Now the result is much more satisfying:
sizeof(int) = 4
sizeof(double) = 8
2 * sizeof(int) + sizeof(double) = 16
but sizeof(X) = 16
There is however a problem: this isn't cross-compatible. Yes, under g++ an int is 4 bytes and a double is 8 bytes, but that's not necessarily always true (their alignment doesn't have to be the same either), so under a different environment this "fix" could not only be useless, but it could also potentially make things worse by increasing the amount of padding needed.
Is there a reliable cross-platform way to solve this problem (minimize the amount of needed padding without suffering from decreased performance caused by misalignment)? Why doesn't the compiler perform such optimizations (swap struct/class members around to decrease padding)?
Clarification
Due to misunderstanding and confusion, I'd like to emphasize that I don't want to "pack" my struct. That is, I don't want its members to be unaligned and thus slower to access. Instead, I still want all members to be self-aligned, but in a way that uses the least memory on padding. This could be solved by using, for example, manual rearrangement as described here and in The Lost Art of Packing by Eric Raymond. I am looking for an automated and as much cross-platform as possible way to do this, similar to what is described in proposal P1112 for the upcoming C++20 standard.

(Don't apply these rules without thinking. See ESR's point about cache locality for members you use together. And in multi-threaded programs, beware false sharing of members written by different threads. Generally you don't want per-thread data in a single struct at all for this reason, unless you're doing it to control the separation with a large alignas(128). This applies to atomic and non-atomic vars; what matters is threads writing to cache lines regardless of how they do it.)
Rule of thumb: largest to smallest alignof(). There's nothing you can do that's perfect everywhere, but by far the most common case these days is a sane "normal" C++ implementation for a normal 32 or 64-bit CPU. All primitive types have power-of-2 sizes.
Most types have alignof(T) = sizeof(T), or alignof(T) capped at the register width of the implementation. So larger types are usually more-aligned than smaller types.
Struct-packing rules in most ABIs give struct members their absolute alignof(T) alignment relative to the start of the struct, and the struct itself inherits the largest alignof() of any of its members.
Put always-64-bit members first (like double, long long, and int64_t). ISO C++ of course doesn't fix these types at 64 bits / 8 bytes, but in practice on all CPUs you care about they are. People porting your code to exotic CPUs can tweak struct layouts to optimize if necessary.
then pointers and pointer-width integers: size_t, intptr_t, and ptrdiff_t (which may be 32 or 64-bit). These are all the same width on normal modern C++ implementations for CPUs with a flat memory model.
Consider putting linked-list and tree left/right pointers first if you care about x86 and Intel CPUs. Pointer-chasing through nodes in a tree or linked list has penalties when the struct start address is in a different 4k page than the member you're accessing. Putting them first guarantees that can't be the case.
then long (which is sometimes 32-bit even when pointers are 64-bit, in LLP64 ABIs like Windows x64). But it's guaranteed at least as wide as int.
then 32-bit int32_t, int, float, enum. (Optionally separate int32_t and float ahead of int if you care about possible 8 / 16-bit systems that still pad those types to 32-bit, or do better with them naturally aligned. Most such systems don't have wider loads (FPU or SIMD) so wider types have to be handled as multiple separate chunks all the time anyway).
ISO C++ allows int to be as narrow as 16 bits, or arbitrarily wide, but in practice it's a 32-bit type even on 64-bit CPUs. ABI designers found that programs designed to work with 32-bit int just waste memory (and cache footprint) if int was wider. Don't make assumptions that would cause correctness problems, but for "portable performance" you just have to be right in the normal case.
People tuning your code for exotic platforms can tweak if necessary. If a certain struct layout is perf-critical, perhaps comment on your assumptions and reasoning in the header.
then short / int16_t
then char / int8_t / bool
(for multiple bool flags, especially if read-mostly or if they're all modified together, consider packing them with 1-bit bitfields.)
(For unsigned integer types, find the corresponding signed type in my list.)
A multiple-of-8 byte array of narrower types can go earlier if you want it to. But if you don't know the exact sizes of types, you can't guarantee that int i + char buf[4] will fill an 8-byte aligned slot between two doubles. But it's not a bad assumption, so I'd do it anyway if there was some reason (like spatial locality of members accessed together) for putting them together instead of at the end.
Exotic types: x86-64 System V has alignof(long double) = 16, but i386 System V has only alignof(long double) = 4, sizeof(long double) = 12. It's the x87 80-bit type, which is actually 10 bytes but padded to 12 or 16 so it's a multiple of its alignof, making arrays possible without violating the alignment guarantee.
And in general it gets trickier when your struct members themselves are aggregates (struct or union) with a sizeof(x) != alignof(x).
Another twist is that in some ABIs (e.g. 32-bit Windows if I recall correctly) struct members are aligned to their size (up to 8 bytes) relative to the start of the struct, even though alignof(T) is still only 4 for double and int64_t.
This is to optimize for the common case of separate allocation of 8-byte aligned memory for a single struct, without giving an alignment guarantee. i386 System V also has the same alignof(T) = 4 for most primitive types (but malloc still gives you 8-byte aligned memory because alignof(maxalign_t) = 8). But anyway, i386 System V doesn't have that struct-packing rule, so (if you don't arrange your struct from largest to smallest) you can end up with 8-byte members under-aligned relative to the start of the struct.
Most CPUs have addressing modes that, given a pointer in a register, allow access to any byte offset. The max offset is usually very large, but on x86 it saves code size if the byte offset fits in a signed byte ([-128 .. +127]). So if you have a large array of any kind, prefer putting it later in the struct after the frequently used members. Even if this costs a bit of padding.
Your compiler will pretty much always make code that has the struct address in a register, not some address in the middle of the struct to take advantage of short negative displacements.
Eric S. Raymond wrote an article The Lost Art of Structure Packing. Specifically the section on Structure reordering is basically an answer to this question.
He also makes another important point:
9. Readability and cache locality
While reordering by size is the simplest way to eliminate slop, it’s not necessarily the right thing. There are two more issues: readability and cache locality.
In a large struct that can easily be split across a cache-line boundary, it makes sense to put 2 things nearby if they're always used together. Or even contiguous to allow load/store coalescing, e.g. copying 8 or 16 bytes with one (unaliged) integer or SIMD load/store instead of separately loading smaller members.
Cache lines are typically 32 or 64 bytes on modern CPUs. (On modern x86, always 64 bytes. And Sandybridge-family has an adjacent-line spatial prefetcher in L2 cache that tries to complete 128-byte pairs of lines, separate from the main L2 streamer HW prefetch pattern detector and L1d prefetching).
Fun fact: Rust allows the compiler to reorder structs for better packing, or other reasons. IDK if any compilers actually do that, though. Probably only possible with link-time whole-program optimization if you want the choice to be based on how the struct is actually used. Otherwise separately-compiled parts of the program couldn't agree on a layout.
(#alexis posted a link-only answer linking to ESR's article, so thanks for that starting point.)

gcc has the -Wpadded warning that warns when padding is added to a structure:
https://godbolt.org/z/iwO5Q3:
<source>:4:12: warning: padding struct to align 'X::b' [-Wpadded]
4 | double b;
| ^
<source>:1:8: warning: padding struct size to alignment boundary [-Wpadded]
1 | struct X
| ^
And you can manually rearrange members so that there is less / no padding. But this is not a cross platform solution, as different types can have different sizes / alignments on different system (Most notably pointers being 4 or 8 bytes on different architectures). The general rule of thumb is go from largest to smallest alignment when declaring members, and if you're still worried, compile your code with -Wpadded once (But I wouldn't keep it on generally, because padding is necessary sometimes).
As for the reason why the compiler can't do it automatically is because of the standard ([class.mem]/19). It guarantees that, because this is a simple struct with only public members, &x.a < &x.c (for some X x;), so they can't be rearranged.

There really isn't a portable solution in the generic case. Baring minimal requirements the standard imposes, types can be any size the implementation wants to make them.
To go along with that, the compiler is not allowed to reorder class member to make it more efficient. The standard mandates that the objects must be laid out in their declared order (by access modifier), so that's out as well.
You can use fixed width types like
struct foo
{
int64_t a;
int16_t b;
int8_t c;
int8_t d;
};
and this will be the same on all platforms, provided they supply those types, but it only works with integer types. There are no fixed-width floating point types and many standard objects/containers can be different sizes on different platforms.

Mate, in case you have 3GB of data, you probably should approach an issue by other way then swapping data members.
Instead of using 'array of struct', 'struct of arrays' could be used.
So say
struct X
{
int a;
double b;
int c;
};
constexpr size_t ArraySize = 1'000'000;
X my_data[ArraySize];
is going to became
constexpr size_t ArraySize = 1'000'000;
struct X
{
int a[ArraySize];
double b[ArraySize];
int c[ArraySize];
};
X my_data;
Each element is still easily accessible mydata.a[i] = 5; mydata.b[i] = 1.5f;....
There is no paddings (except a few bytes between arrays). Memory layout is cache friendly. Prefetcher handles reading sequential memory blocks from a few separate memory regions.
That's not as unorthodox as it might looks at first glance. That approach is widely used for SIMD and GPU programming.
Array of Structures (AoS), Structure of Arrays

This is a textbook memory-vs-speed problem. The padding is to trade memory for speed. You can't say:
I don't want to "pack" my struct.
because pragma pack is the tool invented exactly to make this trade the other way: speed for memory.
Is there a reliable cross-platform way
No, there can't be any. Alignment is strictly platform-dependent issue. Sizeof different types is a platform-dependent issue. Avoiding padding by reorganizing is platform-dependent squared.
Speed, memory, and cross-platform - you can have only two.
Why doesn't the compiler perform such optimizations (swap struct/class members around to decrease padding)?
Because the C++ specifications specifically guarantee that the compiler won't mess up your meticulously organized structs. Imagine you have four floats in a row. Sometimes you use them by name, and sometimes you pass them to a method that takes a float[3] parameter.
You're proposing that compiler should shuffle them around, potentially breaking all the code since the 1970s. And for what reason? Can you guarantee that every programmer ever will actually want to save your 8 bytes per struct? I'm, for one, sure that if I have 3 GB array, I'm having bigger problems than a GB more or less.

Although the Standard grants implementations broad discretion to insert arbitrary amounts of space between structure members, that's because the authors didn't want to try to guess all the situations where padding might be useful, and the principle "don't waste space for no reason" was considered self-evident.
In practice, almost every commonplace implementation for commonplace hardware will use primitive objects whose size is a power of two, and whose required alignment is a power of two that is no larger than the size. Further, almost every such implementation will place each member of a struct at the first available multiple of its alignment that completely follows the previous member.
Some pedants will squawk that code which exploits that behavior is "non-portable". To them I would reply
C code can be non-portable. Although it strove to give programmers the opportunity to write truly portable programs, the C89 Committee did not want to force programmers into writing portably, to preclude the use of C as a “high-level assembler”: the ability to write machine specific code is one of the strengths of C.
As a slight extension to that principle, the ability of code which need only run on 90% of machines to exploit features common to that 90% of machines--even though such code wouldn't exactly be "machine-specific"--is one of the strengths of C. The notion that C programmers shouldn't be expected to bend over backward to accommodate limitations of architectures which for decades have only been used in museums should be self-evident, but apparently isn't.

You can use #pragma pack(1), but the very reason of this is that the compiler optimizes. Accessing a variable through the full register is faster than accessing it to the least bit.
Specific packing is only useful for serialization and intercompiler compatibility, etc.
As NathanOliver correctly added, this might even fail on some platforms.

Related

How to ensure certain struct layout across compilations?

The C++ standard says nothing about packing and padding of structs, because it is implementation defined.
If it is implementation defined, then for example, why it is safe to pass a struct to a DLL, if this DLL could have been compiled with a different compiler, which could have different methods for struct padding?
Is the struct padding method enforced by the OS's ABI (for example, the padding will be the same on all Windows platforms)?
Or, is there standard method for padding when compiling for a PC (x64 or x86_64 systems) that is used in every modern compiler?
If there is nothing that can guarantee the layout of variables, then is it safe to assume that each basic type in C++ (char, all numeric variables and pointers) must be aligned to an address that is a multiple of its size, and because of that, padding inside a struct can be done by hand without performance problems or UB?
From what I have checked, g++ compiles structs in such a way, that it inserts minimum amount of padding, just to ensure alignment of the next variable.
For example:
struct foo
{
char a;
// char _padding1[3]; <- inserted by compiler
uint32_t b;
};
There are 3 bytes of padding after a because that is the minimum amount that will give us a suitably aligned address for b.
Can we take for granted that compilers will do this that way? Or, can we force this kind of padding by hand without UB or performance issues?
By hand, I mean:
#pragma pack(1)
struct foo
{
char a;
char _padding1[3]; //<- manually adding padding bytes
uint32_t b;
};
#pragma pack()
Just to be clear: I am asking about behavior of compilers only on PC platforms : Windows, Linux distros, and maybe MacOS.
Sorry if my question is in category of "you dig into this too much". I just couldn't find a satisfying answer on the Internet. Some people say that it is not guaranteed. Others say that compiling with different compilers on systems that use the same ABI guarantee that the same struct will have the same layout. Others show how to reduce struct padding assuming that compilers pack structs the way that I described above (it is with minimum required padding to align variables).
If it is implementation defined, then for example, why it is safe to pass struct to dll
Because the dll and the caller follow the same Application binary interface (ABI) that defines the layout.
By the way, dll are a language extension and not part of standard C++.
if this dll could have been compiled with different compiler, which could have different method for struct padding?
If the library and the dependent don't follow an intercompatible ABI, then they cannot work together.
Is structpadding method enforced by the OS's ABI
Yes, class layout (structs are classes) is defined by the ABI.
For example padding will be the same on all Windows platforms
Not quite, since Windows on ARM has a different ABI for example. But within the same CPU architecture, the layout would be the same in Windows.
Or is there standard method for padding when compiling for PC (x64 or x86_64 systems) that is used in every modern compiler?
No, there is no universal class layout followed by OS, even within x86_64 architecture.
From what I checked, g++ compiles structs in such way, that it inserts minimum amount of padding, just to ensure alignment of next variable.
All objects in C++ must be aligned as per the alignment requirement of the type of the object. This guarantee isn't compiler specific. However alignment requirements of types - and even the sizes of types - vary across different ABIs.
Bonus info: Compilers have language extensions that remove such guarantee.
There are 3 bytes of padding after a because it is minimum amount that will give us suitably aligned address for b. Can we take for granted that compilers will do this that way?
In general no. On some systems, alignof(std::uint32_t) == 1 in which case there wouldn't be need for any padding.
Within a single ABI, you can take for granted that the layout is the same, but across multiple systems - which might not follow the same ABI - you cannot take it for granted.
When dealing with binary layout across systems (for example, when reading from a file or network), the standard compliant way is to treat the data as an array of bytes1, and to copy each sequence of bytes2 from pre-determined offsets onto fixed width3 fundamental objects (not classes whose layout may differ). In practice, you don't need to care about sign representation although that used to be a problem historically.
If the optimiser does its job, there ideally shouldn't be any performance penalty if the layout of input data matches the native layout. In case it doesn't match, then there may be a cost (compared to a matching layout) that cannot be optimised away.
1 This isn't sufficient when byte size differs across systems, but you don't need to worry about that since you care about x86_64 only.
2 In order to support systems with varying byte endianness, you must interpret the bytes in order of their significance rather than memory order, but you don't need to worry about that since you care about x86_64 only.
3 I.e. not int, short, long etc., but rather std::int32_t etc.
The C and C++ standards were written to describe existing languages. In situations where 99+% of implementations would do things a certain way, and it was obvious that implementations should do things that way absent a compelling reason for doing otherwise, the standards would generally leave open the possibility of implementations doing something unusual.
Consider, for example, given something like:
struct foo {int i; char a,b[4],c,d,e;}; // Assume sizeof (int) is 4
struct foo myFoo;
On most platforms, making bar be a three-word type which contains all of the individual bytes packed together may be more efficient than doing anything else. On the other hand, on a platform that uses word-addressed storages, but includes instructions to load or store bytes at a specified byte offset from a specified word address, word-aligning the start of b may allow a construct like myfoo.b[i] to be processed by directly using the value of i as an offset onto the word-aligned address of myFoo.b.
The standards were designed by people designing compilers for such platforms to weigh the pros and cons of following normal practice versus deviating from it to better fit the target architecture.
Machines that use word addresses but allow byte-based loads and stores are of course exceptionally rare, and very little code that isn't deliberately written from such machines for which compatibility with such them would offer any added value whatsoever.
The committees weren't willing to say that such machines should be viewed as archaic and not worth supporting, but that doesn't mean they didn't expect and intend that programs written for commonplace implementations could exploit aspects of behavior that were shared by all commonplace implementations, even if not by some obscure ones.

How stable is C/C++ structure padding under the AAPCS (ARM ABI)?

Question
The C99 standard tells us:
There may be unnamed padding within a structure object, but not at its beginning.
and
There may be unnamed padding at the end of a structure or union.
I am assuming this applies to any of the C++ standards too, but I have not checked them.
Let's assume a C/C++ application (i.e. both languages are used in the application) running on an ARM Cortex-M would store some persistent data on a local medium (a serial NOR-flash chip for instance), and read it back after power cycling, possibly after an upgrade of the application itself in the future. The upgraded application may have been compiled with an upgraded compiler (we assume gcc).
Let's further assume that the developer is lazy (that's not me, of course), and directly streams some plain C or C++ structs to flash, instead of first serializing them as any paranoid experienced developer would do.
In fact, the developer in question is lazy, but not totally ignorant, since he has read the AAPCS (Procedure Call Standard for the Arm Architecture).
His rationale, besides laziness, is the following:
He does not want to pack the structs to avoid misalignment problems in the rest of the application.
The AAPCS specifies a fixed alignment for every single fundamental data type.
The only rational motivation for padding is to achieve proper alignment.
Therefore, he thinks, padding (and therefore member offsetof and total sizeof) is fully determined for any C or C++ struct by the AAPCS.
Therefore, he further reasons, there is no way my application would not be able to interpret some read back data that an earlier version of the same application would have written (assuming, of course, that the offset of the data in flash memory has not changed between writing and reading).
However, the developer has a conscience and he is a little worried:
The C standard does not mention any reason for padding. Achieving proper alignment may be the only rational reason for padding, but compilers are free to pad as much as they want, according to the standard.
How can he be sure that his compiler really follows the AAPCS?
Could his assumptions suddenly be broken by some apparently unrelated compiler flag that he would start using, or by a compiler upgrade?
My question is: how dangerously does that lazy developer live? In other words, how stable is padding in C/C++ structs under the assumptions above?
Conclusion
Two weeks after this question was asked, the only answer that has been
received does not really answer the asked question. I have also asked
the exact same question on an ARM community forum,
but got no answer at all.
I however choose to accept 3246135 as the answer because:
I take the absence of proper answer as very relevant information
for this case. The correctness of solutions to software problems
should be obvious. The assumptions made in my question may be true,
but I cannot easily prove it. Additionally, if the assumptions are
incorrect, the consequences, in the general case, could be
catastrophic.
Compared to the risk, the burden on the developer when using the
strategy exposed in the answer seems
very reasonable. Assuming a constant endianness (which is quite easy
to enforce), it is a hundred percent-safe (any deviation will generate
an error at compile-time) and it is much lighter than a full-blown
serialization. Basically, the strategy exposed in
the answer is a mandatory minimum
price to pay in order to make one's C/C++ structs persistent independently of any ABI.
If you are a developer asking yourself the question above, please do
not be lazy, and use instead the strategy exposed in the accepted
answer, or an alternative strategy that guarantees a constant padding
across software releases.
You can never by 100% sure that the compiler won't introduce padding in some capacity. However, you can mitigate the risks by following a few rules:
Use fixed size types for all members, i.e. uint32_t, int64_t, etc.
Start each member at an offset that is a multiple of the member's size (or if the member is an array / struct, the size of the largest member).
Avoid bitfields
Note that doing this will likely introduce some explicit padding fields to satisfy alignment.
For example:
struct orig {
int a;
char b;
int c[10];
short d;
char e[15];
long f;
int g;
};
The size of this struct's members, assuming sizeof(short) == 2, sizeof(int) == 4, and sizeof(long) == 8, would be 74. If you take into account likely padding:
struct orig_padded {
int a;
char b;
char pad1[3];
int c[10];
short d;
char e[15];
char pad2[7];
long f;
int g;
char pad3[4];
};
You have a struct size of 88.
With some rearranging we can reduce the size back to 74:
struct reordered {
int64_t f;
int32_t a;
int32_t c[10];
int32_t g;
int16_t d;
char b;
char e[15];
};
By ordering the fields in descending order of size, we basically remove padding between the fields and only leave potential padding at the end. Note also the use of fixed sizes to avoid some surprises. Then as a safeguard, we add:
static_assert(sizeof(struct reordered) == 74);
So if the compiled size of the struct ever changes, you'll know at compile time.
For more details, take a look at The Lost Art of Structure Packing.

Is explicit alignment necessary?

After some readings, I understand that compiler has done the padding for structs or classes such that each member can be accessed on its natural aligned boundary. So under what circumstance is it necessary for coders to make explicit alignment to achieve better performance? My question arises from here:
Intel 64 and IA-32 Architechtures Optimization Reference Manual:
For best performance, align data as follows:
Align 8-bit data at any address.
Align 16-bit data to be contained within an aligned 4-byte word.
Align 32-bit data so that its base address is a multiple of four.
Align 64-bit data so that its base address is a multiple of eight.
Align 80-bit data so that its base address is a multiple of sixteen.
Align 128-bit data so that its base address is a multiple of sixteen.
So suppose I have a struct:
struct A
{
int a;
int b;
int c;
}
// size = 12;
// aligned on boundary of: 4
By creating an array of type A, even if I do nothing, it is properly aligned. Then what's the point to follow the guide and make the alignment stronger?
Is it because of cache line split? Assuming the cache line is 64 bytes. With the 6th access of object in the array, the byte starts from 61 to 72, which slows down the program??
BTW, is there a macro in standard library that tells me the alignment requirement based on the running machine by returning a value of std::size_t?
Let me answer your question directly: No, there is no need to explicitly align data in C++ for performance.
Any decent compiler will properly align the data for underlying system.
The problem would come (variation on above) if you had:
struct
{
int w ;
char x ;
int y ;
char z ;
}
This illustrates the two common structure alignment problems.
(1) It is likely a compiler would insert (2) 3 alignment bytes after both x and z. If there is no padding after x, y is unaligned. If there is no padding after z, w and x will be unaligned in arrays.
The instructions are you are reading in the manual are targeted towards assembly language programmers and compiler writers.
When data is unaligned, on some systems (not Intel) it causes an exception and on others it take multiple processor cycles to fetch and write the data.
The only time I can thing of when you want explicit alignment is when you are directly copying/casting data between your struct to a char* for serialization in some type of binary protocol.
Here unexpected padding may cause problems with a remote user of your protocol.
In pseudocode:
struct Data PACKED
{
char code[3];
int val;
};
Data data = { "AB", 24 };
char buf[20];
memcpy(buf, data, sizeof(data));
send (buf, sizeof(data);
Now if our protocol expects 3 octets of code followed by a 4 octet integer value for val, we will run into problems if we use the above code. Since padding will introduce problems for us. The only way to get this to work is for the struct above to be packed (allignment 1)
There is indeed a facility in the language (it's not a macro, and it's not from the standard library) to tell you the alignment of an object or type. It's alignof (see also: std::alignment_of).
To answer your question: In general you should not be concerned with alignment. The compiler will take care of it for you, and in general/most cases it knows much, much better than you do how to align your data.
The only case where you'd need to fiddle with alignment (see alignas specifier) is when you're writing some code which allows some possibly less aligned data type to be the backing store for some possibly more aligned data type.
Examples of things that do this under the hood are std::experimental::optional and boost::variant. There's also facilities in the standard library explicitly for creating such a backing store, namely std::aligned_storage and std::aligned_union.
By creating an array of type A, even if I do nothing, it is properly aligned. Then what's the point to follow the guide and make the alignment stronger?
The ABI only describes how to use the data elements it defines. The guideline doesn't apply to your struct.
Is it because of cache line split? Assuming the cache line is 64 bytes. With the 6th access of object in the array, the byte starts from 61 to 72, which slows down the program??
The cache question could go either way. If your algorithm randomly accesses the array and touches all of a, b, and c then alignment of the entire structure to a 16-byte boundary would improve performance, because fetching any of a, b, or c from memory would always fetch the other two. However if only linear access is used or random accesses only touch one of the members, 16-byte alignment would waste cache capacity and memory bandwidth, decreasing performance.
Exhaustive analysis isn't really necessary. You can just try and see what alignas does for performance. (Or add a dummy member, pre-C++11.)
BTW, is there a macro in standard library that tells me the alignment requirement based on the running machine by returning a value of std::size_t?
C++11 (and C11) have an alignof operator.

Are float arrays always aligned to 16 byte boundaries?

My understanding is that you have to explicitly specify the alignment of an array, if you want it aligned.
However, the float arrays I declare always seem to be aligned to 16 byte.
float *ptr1 = new float[1];
cout<<"ptr1: "<<ptr1<<endl;
float *ptr2 = new float[3];
cout<<"ptr2: "<<ptr2<<endl;
float arr1[7];
cout<<"arr1: "<<arr1<<endl;
float arr2[9] __attribute__((aligned(2)));
cout<<"arr2: "<<arr2<<endl;
Here is the output
ptr1: 0x13dc010
ptr2: 0x13dc030
arr1: 0x7fff874885c0
arr2: 0x7fff87488590
Is there a reason for this? I am using gcc 4.6.3
However if it's a pointer to a float location or statically allocated, I don't see it
static float arr3[9] __attribute__((aligned(2)));
cout<<"arr3: "<<arr3<<endl;
float *x;
cout<<"x: "<<x<<endl;
Output:
arr3: 0x4030b2
x: 0x7fff8c7dd9e8
This code was run on x64.
Alignment requirements are determined by each compiler, influenced by hardware requirements and any relevant ABI.
The C and C++ languages discuss alignment for types, but they don't impose any specific requirements (except that, for example, a structure's alignment is at least the alignment of any of its members). A valid implementation could permit all data types to be byte aligned, or it could require each scalar type to be aligned to its own size (the latter is more common). Intermediate alignments are possible, such as aligning 8-byte types on 4-byte boundaries.
On the x86 in particular, aligning scalars to their size makes for more efficient access, but misaligned accesses still work correctly, though a little more slowly.
The required alignment for an array of float is the same as the required alignment for a single float object. If float is 4 bytes, then that alignment can be no greater than 4 bytes, since arrays do not have gaps between their elements.
A particular compiler might choose to impose stricter alignment on array objects, as you've (probably) seen, if it makes access to those objects a little more efficient.
And if the new operator is implemented by calling malloc, then all new-allocated objects will have an alignment strict enough for any type.
If float arrays are always aligned to 16-byte boundaries, it's because your compiler chooses to do allocate them that way, not because it's required by the language. On the other hand, if you use aliasing to force a float array to a 4-byte alignment (assuming sizeof (float) == 4), then accesses to that array and to its elements should still work correctly.
Incidentally, when I run your code (after wrapping it in a main program) on my x86_64 system, I get results similar to yours. When I run it on an x86 system, I get:
ptr1: 0x9e34008
ptr2: 0x9e34018
arr1: 0xbfefa160
arr2: 0xbfefa17c
I'm using gcc under Linux on both systems.
So the straightforward answer to your question is no, float arrays are not always aligned to 16-byte boundaries.
But in most cases there's no particular reason for you to care. Unless you play aliasing tricks (treating an object of some declared type as if it were of another type), the compiler will always give each object at least the alignment it needs for correct access.
I'm not sure I fully understand the question, but if you want a larger size I suggest using the type "double" instead of "float". "Float" generally has a size limit of 4 bytes (about 7 digits) whereas "double" usually has a size limit of 8 bytes (about 15 digits).
EDIT: Sorry, I misunderstood the question
That depend on most strict fundamental type alignment requirement of the platform like long double, which is 16 byte alignment require for my Debian 64 system.
The stack frame alignment maybe impact by this too.

Is there a relation between integer and register sizes?

Recently, I was challenged in a recent interview with a string manipulation problem and asked to optimize for performance. I had to use an iterator to move back and forth between TCHAR characters (with UNICODE support - 2bytes each).
Not really thinking of the array length, I made a curial mistake with not using size_t but an int to iterate through. I understand it is not compliant and not secure.
int i, size = _tcslen(str);
for(i=0; i<size; i++){
// code here
}
But, the maximum memory we can allocate is limited. And if there is a relation between int and register sizes, it may be safe to use an integer.
E.g.: Without any virtual mapping tools, we can only map 2^register-size bytes. Since TCHAR is 2 bytes long, half of that number. For any system that has int as 32-bits, this is not going to be a problem even if you dont use an unsigned version of int. People with embedded background used to think of int as 16-bits, but memory size will be restricted on such a device. So I wonder if there is a architectural fine-tuning decision between integer and register sizes.
The C++ standard doesn't specify the size of an int. (It says that sizeof(char) == 1, and sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long).
So there doesn't have to be a relation to register size. A fully conforming C++ implementation could give you 256 byte integers on your PC with 32-bit registers. But it'd be inefficient.
So yes, in practice, the size of the int datatype is generally equal to the size of the CPU's general-purpose registers, since that is by far the most efficient option.
If an int was bigger than a register, then simple arithmetic operations would require more than one instruction, which would be costly. If they were smaller than a register, then loading and storing the values of a register would require the program to mask out the unused bits, to avoid overwriting other data. (That is why the int datatype is typically more efficient than short.)
(Some languages simply require an int to be 32-bit, in which case there is obviously no relation to register size --- other than that 32-bit is chosen because it is a common register size)
Going strictly by the standard, there is no guarantee as to how big/small an int is, much less any relation to the register size. Also, some architectures have different sizes of registers (i.e: not all registers on the CPU are the same size) and memory isn't always accessed using just one register (like DOS with its Segment:Offset addressing).
With all that said, however, in most cases int is the same size as the "regular" registers since it's supposed to be the most commonly used basic type and that's what CPUs are optimized to operate on.
AFAIK, there is no direct link between register size and the size of int.
However, since you know for which platform you're compiling the application, you can define your own type alias with the sizes you need:
Example
#ifdef WIN32 // Types for Win32 target
#define Int16 short
#define Int32 int
// .. etc.
#elif defined // for another target
Then, use the declared aliases.
I am not totally aware, if I understand this correct, since some different problems (memory sizes, allocation, register sizes, performance?) are mixed here.
What I could say is (just taking the headline), that on most actual processors for maximum speed you should use integers that match register size. The reason is, that when using smaller integers, you have the advantage of needing less memory, but for example on the x86 architecture, an additional command for conversion is needed. Also on Intel you have the problem, that accesses to unaligned (mostly on register-sized boundaries) memory will give some penality. Off course, on todays processors things are even more complex, since the CPUs are able to process commands in parallel. So you end up fine tuning for some architecture.
So the best guess -- without knowing the architectore -- speeedwise is, to use register sized ints, as long you can afford the memory.
I don't have a copy of the standard, but my old copy of The C Programming Language says (section 2.2) int refers to "an integer, typically reflecting the natural size of integers on the host machine." My copy of The C++ Programming Language says (section 4.6) "the int type is supposed to be chosen to be the most suitable for holding and manipulating integers on a given computer."
You're not the only person to say "I'll admit that this is technically a flaw, but it's not really exploitable."
There are different kinds of registers with different sizes. What's important are the address registers, not the general purpose ones. If the machine is 64-bit, then the address registers (or some combination of them) must be 64-bits, even if the general-purpose registers are 32-bit. In this case, the compiler may have to do some extra work to actually compute 64-bit addresses using multiple general purpose registers.
If you don't think that hardware manufacturers ever make odd design choices for their registers, then you probably never had to deal with the original 8086 "real mode" addressing.