Strange size of class containing Eigen vectors

Strange size of class containing Eigen vectors - c++

A C++ class containing two Eigen vectors has a strange size. I have a MWE of my problem here:
#include <iostream>
#include "Eigen/Core"
class test0 {
Eigen::Matrix<double,4,1> R;
Eigen::Matrix<double,4,1> T;
};
class test1 {
Eigen::Matrix<double,4,1> R;
Eigen::Matrix<double,3,1> T;
};
class test2 {
Eigen::Matrix<double,4,1> R;
Eigen::Matrix<double,2,1> T;
};
class test3 {
Eigen::Matrix<double,7,1> T;
};
class test4 {
Eigen::Matrix<double,3,1> T;
};
int main(int argc, char *argv[])
{
std::cout << sizeof(test0) << ", " << sizeof(test1) << ", " << sizeof(test2) << ", " << sizeof(test3) << ", " << sizeof(test4) << std::endl;
return 0;
}
The output I get on my system (MacBook Pro, Xcode Clang++ compiler) is:
64, 64, 48, 56, 24
The class "test1" has some bizarre extra padding - I would have expected it to have size 56. I don't understand the reason for it, especially given that none of the other classes have any padding. Can anyone explain, or is this an error?

This happens because of how the Eigen library is implemented, and it is not related to compiler tricks. The backing storage for Eigen::Matrix<double, 4, 1> has the EIGEN_ALIGN_TO_BOUNDARY(16) tag on it, which has compiler-specific definitions that ask the type to be aligned on a 16-byte boundary. To ensure this, the compiler has to add 8 bytes of padding at the end of the structure, since otherwise the first matrix field would not be aligned on a 16-byte boundary if you had an array of test1.
Eigen simply does not try to impose similar requirements to the backing storage of Eigen::Matrix<double, 7, 1>.
This happens in Eigen/src/Core/DenseStorage.

Padding requirements aren't mandated by the language, they're actually mandated by your processor architecture. Your class is being padded so it's 64 bytes wide. You can override this of course, but it's done so that structures sit neatly in memory and can be read efficiently, aligning to cache lines.
In what circumstances a structure is padded is a complex question, but generally speaking "memory is cheap, cycles are not". Modern computers have loads of memory and since performance gains are becoming harder to find as we approach the limits of copper, so trading some off for performance is usually a good idea.
Some additional reading is available here.
Following up on the discussion in the comments, it's worth noting that the compiler isn't your god. Not every optimisation is a good idea, and even trivial changes to your code can have vast implications for some optimisations. If you don't like what your toolchain producing and think you can do better, then do it! Take some benchmarks, make your changes and then measure again. As you do all of that, take not how long you spend on it and then ask yourself - was that a good use of you or your employers time? :)

Related

How to explain the value of sizeof(std::vector<int>)?

In order to understand the memory consumption of std::vector<int> I wrote:
std::cout << sizeof(std::vector<int>) << std::endl;
This yields 32. I tried to understand where this value comes from. Some look in the source code revieled that std::vector stores pointers _MyFirst, _MyLastand _MyEnd which explaines 24 bytes of memory consumption (on my 64 bit system).
What about the last 8 byte? As I understand, the stored allocator does not use any memory. Also this might be implementation defined (is it?), so maybe this helps: I am working with MSVC 2017.5. I do not guarantee to have found all the members by looking into the code; the code looks very obfuscated to me.
Everything seems to be nicely aligned, but may the answer be the following?: Why isn't sizeof for a struct equal to the sum of sizeof of each member?. But I tested it with a simple struct Test { int *a, *b, *c; }; which satisfiessizeof(Test) == 24.
Some background
In my program, I will have a lot of vectors and it seems that most of them will be empty. This means that the ciritical memory consumption comes from there empty-state, i.e. the heap allocated memory is not so very important.
A simple "just for this usecase"-vector is implemented pretty quickly, so I wondered if I am missing anything and I will need 32 bytes of memory anyway, even with my own implementation (note: I will most probably not implement my own, this is just curiosity).
Update
I tested it again with the following struct:
struct Test
{
int *a, *b, *c;
std::allocator<int> alloc;
};
which now gave sizeof(Test) == 32. It seems that even though std::allocator has no memory consuming members (I think), its presence raises Test's size to 32 byte.
I recognized that sizeof(std::allocator<int>) yields 1, but I thought this is how a compiler deals with empty structs and that this is optimized away when it is used as a member. But this seems to be a problem with my compiler.

The compiler cannot optimise away an empty member. It is explicitly forbidden by the standard.
Complete objects and member subobjects of an empty class type shall have nonzero size
An empty base class subobject, on the other hand, may have zero size. This is exactly how GCC/libstdc++ copes with the problem: it makes the vector implementation inherit the allocator.

There doesn't to be something standarized about the data members of std::vector, thus you can assume it's implementation defined.
You mention the three pointers, thus you can check the size of a class (or a struct) with three pointers as its data members.
I tried running this:
std::cout << sizeof(classWith3PtrsOnly) << " " << sizeof(std::vector<int>) << std::endl;
on Wandbox, and got:
24 24
which pretty much implies that the extra 8 bytes come from "padding added to satisfy alignment constraints".

I've occurred the same question recently. Though I still not figure out how std::vector does this optimization, I found out a way get through by C++20.
C++ attribute: no_unique_address (since C++20)
struct Empty {};
struct NonEmpty {
int* p;
};
template<typename MayEmpty>
struct Test {
int* a;
[[no_unique_address]] MayEmpty mayEmpty;
};
static_assert(sizeof(Empty) == 1);
static_assert(sizeof(NonEmpty) == 8);
static_assert(sizeof(Test<Empty>) == 8);
static_assert(sizeof(Test<NonEmpty>) == 16);

If you ran the above test with Windows at DEBUG level, then be aware that "vector" implementation inherits from "_Vector_val" which has an additional pointer member at its _Container_base class (in addition to Myfirst, Mylast, Myend):
_Container_proxy* _Myproxy
It increases the vector class size from 24 to 32 bytes in DEBUG build only (where _ITERATOR_DEBUG_LEVEL == 2)

Type casting struct to integer and vice versa in C++

So, I've seen this thread Type casting struct to integer c++ about how to cast between integers and structs (bitfields) and undoubtly, writing a proper conversion function or overloading the relevant casting operators is the way to go for any cases where there is an operating system involved.
However, when writing firmware for a small embedded system where only one flash image is run, the case might be different insofar, as security isn't so much of a concern while performance is.
Since I can test whether the code works properly (meaning the bits of a bitfield are arranged the way I would expect them to be) each time when compiling my code, the answer might be different here.
So, my question is, whether there is a 'proper' way to convert between bitfield and unsigned int that does compile to no operations in g++ (maybe shifts will get optimised away when the compiler knows the bits are arranged correctly in memory).
This is an excerpt from the original question:
struct {
int part1 : 10;
int part2 : 6;
int part3 : 16;
} word;
I can then set part2 to be equal to whatever value is requested, and set the other parts as 0.
word.part1 = 0;
word.part2 = 9;
word.part3 = 0;
I now want to take that struct, and convert it into a single 32 bit integer. I do have it compiling by forcing the casting, but it does not seem like a very elegant or secure way of converting the data.
int x = *reinterpret_cast<int*>(&word);
EDIT:
Now, quite some time later, I have learned some things:
1) Type punning (changing the interpretation of data) by means of pointer casting is, undefined behaviour since C99 and C++98. These language changes introduced strict aliasing rules (They allow the compiler to reason that data is only accessed through pointers of compatible type) to allow for better optimisations. In effect, the compiler will not need to keep the ordering between accesses (or do the off-type access at all). For most cases, this does not seem to present a [immediate] problem, but when using higher optimisation settings (for gcc that is -O which includes -fstrict-aliasing) this will become a problem.
For examples see https://blog.regehr.org/archives/959
2) Using unions for type punning also seems to involve undefined behaviour in C++ but not C (See https://stackoverflow.com/a/25672839/4360539), however GCC (and probably others) does explicitly allow it: (See https://gcc.gnu.org/bugs/#nonbugs).
3) The only really reliable way of doing type punning in C++ seems to be using memcpy to copy the data to a new location and perform whatever is to be done and then to use another memcpy to return the changes. I did read somewhere on SO, that GCC (or most compilers probably) should be able to optimise the memcpy to a mere register copy for register-sized data types, but I cannot find it again.
So probably the best thing to do here is to use the union if you can be sure the code is compiled by a compiler supporting type punning through a union. For the other cases, further investigation would be needed how the compiler treats bigger data structures and memcpy and if this really involves copying back and forth, probably sticking with bitwise operations is the best idea.

union {
struct {
int part1: 10;
int part2: 6;
int part3: 16;
} parts;
int whole;
} word;
Then just use word.whole.

I had the same problem. I am guessing this is not very relevant today. But this is how I solved it:
#include <iostream>
struct PACKED{
int x:10;
int y:10;
int z:12;
PACKED operator=(int num )
{
*( int* )this = num;
return *this;
}
operator int()
{
int *x;
x = (int*)this;
return *x;
}
} __attribute__((packed));
int main(void) {
std::cout << "size: " << sizeof(PACKED) << std::endl;
PACKED bf;
bf = 0xFFF00000;
std::cout << "Values ( x, y, z ) = " << bf.x << " " << bf.y << " " << bf.z << std::endl;
int testint;
testint = bf;
std::cout << "As integer: " << testint << std::endl;
return 0;
}
This now fits on a int, and is assignable by standard ints. However I do not know how portable this solution is. The output of this is then:
size: 4
Values ( x, y, z ) = 0 0 -1
As integer: -1048576

How to use alignas to replace pragma pack?

I am trying to understand how alignas should be used, I wonder if it can be a replacement for pragma pack, I have tried hard to verify it but with no luck. Using gcc 4.8.1 (http://ideone.com/04mxpI) I always get 8 bytes for below STestAlignas, while with pragma pack it is 5 bytes. What I would like ot achive is to make sizeof(STestAlignas) return 5. I tried running this code on clang 3.3 (http://gcc.godbolt.org/) but I got error:
!!error: requested alignment is less than minimum alignment of 8 for type 'long' - just below alignas usage.
So maybe there is a minimum alignment value for alignas?
below is my test code:
#include <iostream>
#include <cstddef>
using namespace std;
#pragma pack(1)
struct STestPragmaPack {
char c;
long d;
} datasPP;
#pragma pack()
struct STestAttributPacked {
char c;
long d;
} __attribute__((packed)) datasAP;
struct STestAlignas {
char c;
alignas(char) long d;
} datasA;
int main() {
cout << "pragma pack = " << sizeof(datasPP) << endl;
cout << "attribute packed = " << sizeof(datasAP) << endl;
cout << "alignas = " << sizeof(datasA) << endl;
}
results for gcc 4.8.1:
pragma pack = 5
attribute packed = 5
alignas = 8
[26.08.2019]
It appears there is some standardisation movement in this topic. p1112 proposal - Language support for class layout control - suggest adding (among others) [[layout(smallest)]] attribute which shall reorder class members so as to make the alignment cost as small as possible (which is a common technique among programmers - but it often kills class definition readability). But this is not equal to what pragma(pack) does!

alignas cannot replace #pragma pack.
GCC accepts the alignas declaration, but still keeps the member properly aligned: satisfying the strictest alignment requirement (in this case, the alignment of long) also satisfies the requirement you specified.
However, GCC is too lenient as the standard actually explicitly forbids this in §7.6.2, paragraph 5:
The combined effect of all alignment-specifiers in a declaration shall not specify an alignment that is less strict than the alignment that would be required for the entity being declared if all alignment-specifiers were omitted (including those in other declarations).

I suppose you know that working with unaligned or missaligned data have risks and have costs.
For instance, retrieving a missaligned Data Structure of 5 bytes is more time-expensive than retrieving an 8 bytes aligned one. This is because, if your 5 "... byte data does not start on one of those 4 byte boundaries, the computer must read the memory twice, and then assemble the 4 bytes to a single register internally" (1).
Working with unaligned data requires more mathematical operations and ends in more time (and power) consumption by the ECU.
Please, consider that both C and C++ are conceived to be "hardware friendly" languages, which means not only "minimum memory usage" languages, but principally languages focused on efficiency and fastness processing. Data alignmnt (when it is not strictly required for "what I need to store") is a concept that implies another one: "many times, software and hardware are similar to life: you require sacrifices to reach better results!".
Please, consider also asking yourself is you do not have a wrong assumption. Something like: "smaller/st structures => faster/st processing". If this were the case, you might be (totally) wrong.
But if we suppose that your point is something like this: you do not care at all about efficiency, power consumption and fastness of your software, but just you are obsessed (because of your hardware limitations or just because of theoritcal interest) in "minimum memory usage", then and perhaps you might find useful the following readings:
(1) Declare, manipulate and access unaligned memory in C++
(2) C Avoiding Alignment Issues
BUT, please, be sure to read the following ones:
(3) What does the standard say about unaligned memory access?
Which redirects to this Standard's snipped:
(4) http://eel.is/c++draft/basic.life#1
(5) Unaligned memory access: is it defined behavior or not? [Which is duplicated but, maybe, with some extra information].

Unfortunately, alignment is not guaranted, neither in C++11 nor in C++14.
But it is effectived guaranted in C++17.
Please, check this excellent work from Bartlomiej Filipek:
https://www.bfilipek.com/2019/08/newnew-align.html

Data member offset in C++

"Inside the C++ Object Model" says that the offset of a data member in a class is always 1 more than the actual offset in order to distinguish between the pointer to 0 and the pointer to the first data member, here is the example:
class Point3d {
public:
virtual ~Point3d();
public:
static Point3d origin;
float x, y, z;
};
//to be used after, ignore it for the first question
int main(void) {
/*cout << "&Point3d::x = " << &Point3d::x << endl;
cout << "&Point3d::y = " << &Point3d::y << endl;
cout << "&Point3d::z = " << &Point3d::z << endl;*/
printf("&Point3d::x = %p\n", &Point3d::x);
printf("&Point3d::y = %p\n", &Point3d::y);
printf("&Point3d::z = %p\n", &Point3d::z);
getchar();
}
So in order to distinguish the two pointers below, the offset of a data member is always 1 more.
float Point3d::*p1 = 0;
float Point3d::*p2 = &Point3d::x;
The main function above is attempt to get the offset of the members to verify this argument, which is supposed to output: 5, 9, 13(Consider the vptr of 4bytes at the beginning). In MS Visual Studio 2012 however, the output is:
&Point3d::x = 00000004
&Point3d::y = 00000008
&Point3d::z = 0000000C
Question: So is MS C++ compiler did some optimization or something to prevent this mechanism?

tl;dr
Inside the C++ Object Model is a very old book, and most of its contents are implementation details of a particular compiler anyway. Don't worry about comparing your compiler to some ancient compiler.
Full version
An answer to the question linked to in a comment on this question addresses this quite well.
The offset of something is how many units it is from the start. The first thing is at the start so its offset is zero.
[...]
Note that the ISO standard doesn't specify where the items are laid out in memory. Padding bytes to create correct alignment are certainly possible. In a hypothetical environment where ints were only two bytes but their required alignment was 256 bytes, they wouldn't be at 0, 2 and 4 but rather at 0, 256 and 512.
And, if that book you're taking the excerpt from is really Inside the C++ Object Model, it's getting a little long in the tooth.
The fact that it's from '96 and discusses the internals underneath C++ (waxing lyrical about how good it is to know where the vptr is, missing the whole point that that's working at the wrong abstraction level and you should never care) dates it quite a bit.
[...]
The author apparently led the cfront 2.1 and 3 teams and, while this books seems of historical interest, I don't think it's relevant to the modern C++ language (and implementation), at least those bits I've read.

The language doesn't specify how member-pointers are represented, so anything you read in a book will just be an example of how they might be represented.
In this case, as you say, it sounds like the vptr occupies the first four bytes of the object; again, this is not something specified by the language. If that is the case, no accessible members would have an offset of zero, so there's no need to adjust the offsets to avoid zero; a member-pointer could simply be represented by the member's offset, with zero reserved for "null". It sounds like that is what your compiler does.
You may find that the offsets for non-polymorphic types are adjusted as you describe; or you may find that the representation of "null" is not zero. Either would be valid.

class Point3d {
public:
virtual ~Point3d();
public:
static Point3d origin;
float x, y, z;
};
Since your class contains a virtual destructor, and (most of) the compiler(s) typically puts a pointer to the virtual function table as the first element in the object, it makes sense that the first of your data is at offset 4 (I'm guessing your compiler is a 32-bit compiler).
Note however that the C++ standard does not stipulate how data members should be stored inside the class, and even less how much space, if any, the virtual function table should take up.
[And yes, it's invalid (undefined behaviour) to take the address of a element that is not to a "real" member object, but I don't think this is causing an issue in this particular example - it may with a different compiler or on a different processor architecture, etc]

Unless you specify a different alignment, your expectation of the offset bing 5, ... would be wwong anyway. Normaly the adresses of bigger elements than char are usually aligned on even adresses and I guess even to the next 4-byte boundary. The reason is efficiency of accessing the memory in the CPU.
On some architectures, accessing an odd address could cause an exception (i.e. Motorola 68000), depending on the member, or at least a performance slowdown.

While it's true that the a null pointer of type "pointer to member of a given type" must be different from any non-null value of that type, offsetting non-null pointers by one is not the only way that a compiler can ensure this. For example, my compiler uses a non-zero representation of null pointer-to-members.
namespace {
struct a {
int x, y;
};
}
#include <iostream>
int main() {
int a::*p = &a::x, a::*q = &a::y, a::*r = nullptr;
std::cout << "sizeof(int a::*) = " << sizeof(int a::*)
<< ", sizeof(unsigned long) = " << sizeof(long);
std::cout << "\n&a::x = " << *reinterpret_cast<long*>(&p)
<< "\n&a::y = " << *reinterpret_cast<long*>(&q)
<< "\nnullptr = " << *reinterpret_cast<long*>(&r)
<< '\n';
}
Produces the following output:
sizeof(int a::*) = 8, sizeof(unsigned long) = 8
&a::x = 0
&a::y = 4
nullptr = -1
Your compiler is probably doing something similar, if not identical. This scheme is probably more efficient for most 'normal' use cases for the implementation because it won't have to do an extra "subtract 1" every time you use a non-null pointer-to-member.

That book (available at this link) should make it much clearer that it is just describing a particular implementation of a C++ compiler. Details like the one you mention are not part of the C++ language specification -- it's just how Stanley B. Lippman and his colleagues decided to implement a particular feature. Other compilers are free to do things a different way.

Bit Aligning for Space and Performance Boosts

In the book Game Coding Complete, 3rd Edition, the author mentions a technique to both reduce data structure size and increase access performance. In essence it relies on the fact that you gain performance when member variables are memory aligned. This is an obvious potential optimization that compilers would take advantage of, but by making sure each variable is aligned they end up bloating the size of the data structure.
Or that was his claim at least.
The real performance increase, he states, is by using your brain and ensuring that your structure is properly designed to take take advantage of speed increases while preventing the compiler bloat. He provides the following code snippet:
#pragma pack( push, 1 )
struct SlowStruct
{
char c;
__int64 a;
int b;
char d;
};
struct FastStruct
{
__int64 a;
int b;
char c;
char d;
char unused[ 2 ]; // fill to 8-byte boundary for array use
};
#pragma pack( pop )
Using the above struct objects in an unspecified test he reports a performance increase of 15.6% (222ms compared to 192ms) and a smaller size for the FastStruct. This all makes sense on paper to me, but it fails to hold up under my testing:
Same time results and size (counting for the char unused[ 2 ])!
Now if the #pragma pack( push, 1 ) is isolated only to FastStruct (or removed completely) we do see a difference:
So, finally, here lies the question: Do modern compilers (VS2010 specifically) already optimize for the bit alignment, hence the lack of performance increase (but increase the structure size as a side-affect, like Mike Mcshaffry stated)? Or is my test not intensive enough/inconclusive to return any significant results?
For the tests I did a variety of tasks from math operations, column-major multi-dimensional array traversing/checking, matrix operations, etc. on the unaligned __int64 member. None of which produced different results for either structure.
In the end, even if their was no performance increase, this is still a useful tidbit to keep in mind for keeping memory usage to a minimum. But I would love it if there was a performance boost (no matter how minor) that I am just not seeing.

It is highly dependent on the hardware.
Let me demonstrate:
#pragma pack( push, 1 )
struct SlowStruct
{
char c;
__int64 a;
int b;
char d;
};
struct FastStruct
{
__int64 a;
int b;
char c;
char d;
char unused[ 2 ]; // fill to 8-byte boundary for array use
};
#pragma pack( pop )
int main (void){
int x = 1000;
int iterations = 10000000;
SlowStruct *slow = new SlowStruct[x];
FastStruct *fast = new FastStruct[x];
// Warm the cache.
memset(slow,0,x * sizeof(SlowStruct));
clock_t time0 = clock();
for (int c = 0; c < iterations; c++){
for (int i = 0; i < x; i++){
slow[i].a += c;
}
}
clock_t time1 = clock();
cout << "slow = " << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;
// Warm the cache.
memset(fast,0,x * sizeof(FastStruct));
time1 = clock();
for (int c = 0; c < iterations; c++){
for (int i = 0; i < x; i++){
fast[i].a += c;
}
}
clock_t time2 = clock();
cout << "fast = " << (double)(time2 - time1) / CLOCKS_PER_SEC << endl;
// Print to avoid Dead Code Elimination
__int64 sum = 0;
for (int c = 0; c < x; c++){
sum += slow[c].a;
sum += fast[c].a;
}
cout << "sum = " << sum << endl;
return 0;
}
Core i7 920 # 3.5 GHz
slow = 4.578
fast = 4.434
sum = 99999990000000000
Okay, not much difference. But it's still consistent over multiple runs.So the alignment makes a small difference on Nehalem Core i7.
Intel Xeon X5482 Harpertown # 3.2 GHz (Core 2 - generation Xeon)
slow = 22.803
fast = 3.669
sum = 99999990000000000
Now take a look...
6.2x faster!!!
Conclusion:
You see the results. You decide whether or not it's worth your time to do these optimizations.
EDIT :
Same benchmarks but without the #pragma pack:
Core i7 920 # 3.5 GHz
slow = 4.49
fast = 4.442
sum = 99999990000000000
Intel Xeon X5482 Harpertown # 3.2 GHz
slow = 3.684
fast = 3.717
sum = 99999990000000000
The Core i7 numbers didn't change. Apparently it can handle
misalignment without trouble for this benchmark.
The Core 2 Xeon now shows the same times for both versions. This confirms that misalignment is a problem on the Core 2 architecture.
Taken from my comment:
If you leave out the #pragma pack, the compiler will keep everything aligned so you don't see this issue. So this is actually an example of what could happen if you misuse #pragma pack.

Such hand-optimizations are generally long dead. Alignment is only a serious consideration if you're packing for space, or if you have an enforced-alignment type like SSE types. The compiler's default alignment and packing rules are intentionally designed to maximize performance, obviously, and whilst hand-tuning them can be beneficial, it's not generally worth it.
Probably, in your test program, the compiler never stored any structure on the stack and just kept the members in registers, which do not have alignment, which means that it's fairly irrelevant what the structure size or alignment is.
Here's the thing: There can be aliasing and other nasties with sub-word accessing, and it's no slower to access a whole word than to access a sub-word. So in general, it's no more efficient, in time, to pack more tightly than word size if you're only accessing, say, one member.

Visual Studio is a great compiler when it comes to optimization. However, bear in mind that the current "Optimization War" in game development is not on the PC arena. While such optimizations may quite well be dead on the PC, on the console platforms it's a completely different pair of shoes.
That said, you might want to repost this question on the specialized gamedev stackexchange site, you might get some answers directly from "the field".
Finally, your results are exactly the same up to the microsecond which is dead impossible on a modern multithreaded system -- I'm pretty sure you either use a very low resolution timer, or your timing code is broken.

Modern compilers align members on different byte boundaries depending on the size of the member. See the bottom of this.
Normally you really shouldn't care about structure padding but if you have an object that is going to have 1000000 instances or something the rule of the thumb is simply to order your members from biggest to smallest. I wouldn't recommend messing with the padding with #pragma directives.

The compiler is going to either optimize for size or speed and unless you explicitly tell it you wont know what you get. But if you follow the advice of that book you will win-win on most compilers. Put the biggest, aligned, things first in your struct then half size stuff, then single byte stuff if any, add some dummy variables to align. Using bytes for things that dont have to be can be a performance hit anyway, as a compromise use ints for everything (have to know the pros and cons of doing that)
The x86 has made for a lot of bad programmers and compilers because it allows unaligned accesses. Making it hard for many folks to move to other platforms (that are taking over). Although unaligned accesses work on an x86 you take a serious performance hit. Which is why it is important to know how compilers work both in general as well as the particular one you are using.
having caches, and as with the modern computer platforms relying on caches to get any kind of performance, you want to both be aligned and packed. The simple rule being taught gives you both...in general. It is very good advice. Adding compiler specific pragmas is not nearly as good, makes the code non-portable, and doesnt take much searching through SO or googling to find out how often the compiler ignores the pragma or doesnt do what you really wanted.

On some platforms the compiler doesn't have an option: objects of types bigger than char often have strict requirements to be at a suitably aligned address. Typically the alignment requirements are identical to the size of the object up to the size of the biggest word supported by the CPU natively. That is short typically requires to be at an even address, long typically requires to be at an address divisible by 4, double at an address divisible by 8, and e.g. SIMD vectors at an address divisible by 16.
Since C and C++ require ordering of members in the order they are declared, the size of structures will differ quite a bit on the corresponding platforms. Since bigger structures effectively cause more cache misses, page misses, etc., there will be a substantial performance degradation when creating bigger structures.
Since I saw a claim that it doesn't matter: it matters on most (if not all) systems I'm using. There is a simple examples of showing different sizes. How much this affects the performance obviously depends on how the structures are to be used.
#include <iostream>
struct A
{
char a;
double b;
char c;
double d;
};
struct B
{
double b;
double d;
char a;
char c;
};
int main()
{
std::cout << "sizeof(A) = " << sizeof(A) << "\n";
std::cout << "sizeof(B) = " << sizeof(B) << "\n";
}
./alignment.tsk
sizeof(A) = 32
sizeof(B) = 24

The C standard specifies that fields within a struct must be allocated at increasing addresses. A struct which has eight variables of type 'int8' and seven variables of type 'int64', stored in that order, will take 64 bytes (pretty much regardless of a machine's alignment requirements). If the fields were ordered 'int8', 'int64', 'int8', ... 'int64', 'int8', the struct would take 120 bytes on a platform where 'int64' fields are aligned on 8-byte boundaries. Reordering the fields yourself will allow them to be packed more tightly. Compilers, however, will not reorder fields within a struct absent explicit permission to do so, since doing so could change program semantics.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js