Can __attribute__((packed)) affect the performance of a program?

Can __attribute__((packed)) affect the performance of a program? - c++

I have a structure called log that has 13 chars in it. after doing a sizeof(log) I see that the size is not 13 but 16. I can use the __attribute__((packed)) to get it to the actual size of 13 but I wonder if this will affect the performance of the program. It is a structure that is used quite frequently.
I would like to be able to read the size of the structure (13 not 16). I could use a macro, but if this structure is ever changed ie fields added or removed, I would like the new size to be updated without changing a macro because I think this is error prone. Have any suggestion?

Yes, it will affect the performance of the program. Adding the padding means the compiler can use integer load instructions to read things from memory. Without the padding, the compiler must load things separately and do bit shifting to get the entire value. (Even if it's x86 and this is done by the hardware, it still has to be done).
Consider this: Why would compilers insert random, unused space if it was not for performance reasons?

Don't use __attribute__((packed)). If your data structure is in-memory, allow it to occupy its natural size as determined by the compiler. If it's for reading/writing to/from disk, write serialization and deserialization functions; do not simply store cpu-native binary structures on disk. "Packed" structures really have no legitimate uses (or very few; see the comments on this answer for possible disagreeing viewpoints).

Yes, it can affect the performance. In this case, if you allocate an array of such structures with the ((packed)) attribute, most of them must end up unaligned (whereas if you use the default packing, they can all be aligned on 16 byte boundaries). Copying such structures around can be faster if they are aligned.

Yes, it can affect performance. How depends on what it is and how you use it.
An unaligned variable can possibly straddle two cache lines. For example, if you have 64-byte cache lines, and you read a 4-byte variable from an array of 13-byte structures, there is a 3 in 64 (4.6%) chance that it will be spread across two lines. The penalty of an extra cache access is pretty small. If everything your program did was pound on that one variable, 4.6% would be the upper bound of the performance hit. If logging represents 20% of the program's workload, and reading/writing to the that structure is 50% of logging, then you're already at a small fraction of a percent.
On the other hand, presuming that the log needs to be saved, shrinking each record by 3 bytes is saving you 19%, which translates to a lot of memory or disk space. Main memory and especially the disk are slow, so you will probably be better off packing the log to reduce its size.
As for reading the size of the structure without worrying about the structure changing, use sizeof. However you like to do numerical constants, be it const int, enum, or #define, just add sizeof.

As with all other performance optimizations, you'll need to profile your code to find the right answer. The right answer will vary by architecture --- and how your use your structure.
If you're creating gigantic arrays the space savings from packing might mean the difference between fitting and not fitting in cache. Or your data might already fit into your cache, in which case it will make no difference. If you're allocating large numbers of the structures in an STL associative container that allocates the storage for your struct with operator new it might not matter at all --- operator new might round your storage up to something that's aligned anyway.
If most of your structures live on the stack the extra storage might already be optimized away anyway.
For a change this simple to test, I suggest building a timing rig and then trying things both ways. For further optimizations I suggest using a profiler to identify your bottlenecks and go from there.

Related

is using an integer to store many bool worth the effort?

I was considering ways to reduce memory footprint, and it is constantly mentioned that a bool takes up more memory than it logically needs to, as a byproduct of processor design.
it is also sometimes mentioned that one could store several bool within an int.
I am wondering if this would actually be more memory efficient?
if we have a usecase where we can use a significant portion of 32 (or 64) bool. and we decide to store all of them in a single int. then on the surface we have saved
7 (bits) * 32 (size of int) = 224 (bits) or 28 (bytes)
but in order to get each of those bits from the int, we needed to use some method of masking
such as:
bit shifting the int both directions (int<<x)>>y here we need to load and store x,y which are probably an int, but you could get them smaller depending on the use case
masking the int: int & int2 here we also store an additional int, which is stored and loaded
even if these aren't stored as variables, and they are defined statically within the code, it still ends up using additional memory, as it will increase the memory footprint of the instructions. as well as the instructions for the masking steps.
is there any way to do this that isn't actually worse for memory usage than just taking the hit on 7 wasted bits?

You are describing a text book example of a trade-off.
Yes, several bools in one int is hugeley more memory efficient - in itself.
Yes, you need to spend code to use that.
Yes, for only a few bools (for different values of "few"), the code might take more space than you save.
However, you could look at the kind of memory which is used. In some environments, RAM (which is saved by your idea) is much more expensive than ROM (which has to be paid for your idea).
Also, the price to pay is mostly paid once for implementation and only paid a fraction for using, especially when the using code is reused, e.g. in loops.
Altogether, in case of many bools, you can save more than you pay.
The point of actually saving needs to be determined for the special case.
On the other hand, you have missed on "currency" on the price-tag for the idea. You not only pay in memory, you also pay in execution time. You focused your question on memory, so I won't elaborate here. But for anything time critical, you should take the longer execution time into conisderation. You might find that saving memory is quite achievable with your idea, but the whole thing gets unbearably slow.
Again from the other side, as Eric Postpischil points out in a comment, execution speed can also improve due to cache effects from better memory footprint.

I am wondering if this would actually be more memory efficient?
Potentially yes. Storing multiple bools inside a single object may use less storage compared to having distinct bool object for each, if the number of bools is great enough to offset the cost in memory use of the instructions.
Also consider that there are more considerations than space efficiency. Most typically, people are concerned about time efficiency as well. In this regard, compacting bools may more or less efficient depending on the details of the use case.
is using an integer to store many bool worth the effort?
It can be worth the effort. It can also be counter productive. The difference can be minuscule or significant. Both in terms of time and space efficiency. Most accurate way to find out is to measure it.
It's not necessary to implement this yourself though, since there are solutions in the standard library. std::vector<bool> and std::bitset both implement compact storage of bools. Using bitfields may also be an option (just remember to not rely on the internal representation).

Is there an order you should declare variables in C++? [duplicate]

I was reading a blog post by a game coder for Introversion and he is busily trying to squeeze every CPU tick he can out of the code. One trick he mentions off-hand is to
"re-order the member variables of a
class into most used and least used."
I'm not familiar with C++, nor with how it compiles, but I was wondering if
This statement is accurate?
How/Why?
Does it apply to other (compiled/scripting) languages?
I'm aware that the amount of (CPU) time saved by this trick would be minimal, it's not a deal-breaker. But on the other hand, in most functions it would be fairly easy to identify which variables are going to be the most commonly used, and just start coding this way by default.

Two issues here:
Whether and when keeping certain fields together is an optimization.
How to do actually do it.
The reason that it might help, is that memory is loaded into the CPU cache in chunks called "cache lines". This takes time, and generally speaking the more cache lines loaded for your object, the longer it takes. Also, the more other stuff gets thrown out of the cache to make room, which slows down other code in an unpredictable way.
The size of a cache line depends on the processor. If it is large compared with the size of your objects, then very few objects are going to span a cache line boundary, so the whole optimization is pretty irrelevant. Otherwise, you might get away with sometimes only having part of your object in cache, and the rest in main memory (or L2 cache, perhaps). It's a good thing if your most common operations (the ones which access the commonly-used fields) use as little cache as possible for the object, so grouping those fields together gives you a better chance of this happening.
The general principle is called "locality of reference". The closer together the different memory addresses are that your program accesses, the better your chances of getting good cache behaviour. It's often difficult to predict performance in advance: different processor models of the same architecture can behave differently, multi-threading means you often don't know what's going to be in the cache, etc. But it's possible to talk about what's likely to happen, most of the time. If you want to know anything, you generally have to measure it.
Please note that there are some gotchas here. If you are using CPU-based atomic operations (which the atomic types in C++0x generally will), then you may find that the CPU locks the entire cache line in order to lock the field. Then, if you have several atomic fields close together, with different threads running on different cores and operating on different fields at the same time, you will find that all those atomic operations are serialised because they all lock the same memory location even though they're operating on different fields. Had they been operating on different cache lines then they would have worked in parallel, and run faster. In fact, as Glen (via Herb Sutter) points out in his answer, on a coherent-cache architecture this happens even without atomic operations, and can utterly ruin your day. So locality of reference is not necessarily a good thing where multiple cores are involved, even if they share cache. You can expect it to be, on grounds that cache misses usually are a source of lost speed, but be horribly wrong in your particular case.
Now, quite aside from distinguishing between commonly-used and less-used fields, the smaller an object is, the less memory (and hence less cache) it occupies. This is pretty much good news all around, at least where you don't have heavy contention. The size of an object depends on the fields in it, and on any padding which has to be inserted between fields in order to ensure they are correctly aligned for the architecture. C++ (sometimes) puts constraints on the order which fields must appear in an object, based on the order they are declared. This is to make low-level programming easier. So, if your object contains:
an int (4 bytes, 4-aligned)
followed by a char (1 byte, any alignment)
followed by an int (4 bytes, 4-aligned)
followed by a char (1 byte, any alignment)
then chances are this will occupy 16 bytes in memory. The size and alignment of int isn't the same on every platform, by the way, but 4 is very common and this is just an example.
In this case, the compiler will insert 3 bytes of padding before the second int, to correctly align it, and 3 bytes of padding at the end. An object's size has to be a multiple of its alignment, so that objects of the same type can be placed adjacent in memory. That's all an array is in C/C++, adjacent objects in memory. Had the struct been int, int, char, char, then the same object could have been 12 bytes, because char has no alignment requirement.
I said that whether int is 4-aligned is platform-dependent: on ARM it absolutely has to be, since unaligned access throws a hardware exception. On x86 you can access ints unaligned, but it's generally slower and IIRC non-atomic. So compilers usually (always?) 4-align ints on x86.
The rule of thumb when writing code, if you care about packing, is to look at the alignment requirement of each member of the struct. Then order the fields with the biggest-aligned types first, then the next smallest, and so on down to members with no aligment requirement. For example if I'm trying to write portable code I might come up with this:
struct some_stuff {
double d; // I expect double is 64bit IEEE, it might not be
uint64_t l; // 8 bytes, could be 8-aligned or 4-aligned, I don't know
uint32_t i; // 4 bytes, usually 4-aligned
int32_t j; // same
short s; // usually 2 bytes, could be 2-aligned or unaligned, I don't know
char c[4]; // array 4 chars, 4 bytes big but "never" needs 4-alignment
char d; // 1 byte, any alignment
};
If you don't know the alignment of a field, or you're writing portable code but want to do the best you can without major trickery, then you assume that the alignment requirement is the largest requirement of any fundamental type in the structure, and that the alignment requirement of fundamental types is their size. So, if your struct contains a uint64_t, or a long long, then the best guess is it's 8-aligned. Sometimes you'll be wrong, but you'll be right a lot of the time.
Note that games programmers like your blogger often know everything about their processor and hardware, and thus they don't have to guess. They know the cache line size, they know the size and alignment of every type, and they know the struct layout rules used by their compiler (for POD and non-POD types). If they support multiple platforms, then they can special-case for each one if necessary. They also spend a lot of time thinking about which objects in their game will benefit from performance improvements, and using profilers to find out where the real bottlenecks are. But even so, it's not such a bad idea to have a few rules of thumb that you apply whether the object needs it or not. As long as it won't make the code unclear, "put commonly-used fields at the start of the object" and "sort by alignment requirement" are two good rules.

Depending on the type of program you're running this advice may result in increased performance or it may slow things down drastically.
Doing this in a multi-threaded program means you're going to increase the chances of 'false-sharing'.
Check out Herb Sutters articles on the subject here
I've said it before and I'll keep saying it. The only real way to get a real performance increase is to measure your code, and use tools to identify the real bottle neck instead of arbitrarily changing stuff in your code base.

It is one of the ways of optimizing the working set size. There is a good article by John Robbins on how you can speed up the application performance by optimizing the working set size. Of course it involves careful selection of most frequent use cases the end user is likely to perform with the application.

We have slightly different guidelines for members here (ARM architecture target, mostly THUMB 16-bit codegen for various reasons):
group by alignment requirements (or, for newbies, "group by size" usually does the trick)
smallest first
"group by alignment" is somewhat obvious, and outside the scope of this question; it avoids padding, uses less memory, etc.
The second bullet, though, derives from the small 5-bit "immediate" field size on the THUMB LDRB (Load Register Byte), LDRH (Load Register Halfword), and LDR (Load Register) instructions.
5 bits means offsets of 0-31 can be encoded. Effectively, assuming "this" is handy in a register (which it usually is):
8-bit bytes can be loaded in one instruction if they exist at this+0 through this+31
16-bit halfwords if they exist at this+0 through this+62;
32-bit machine words if they exist at this+0 through this+124.
If they're outside this range, multiple instructions have to be generated: either a sequence of ADDs with immediates to accumulate the appropriate address in a register, or worse yet, a load from the literal pool at the end of the function.
If we do hit the literal pool, it hurts: the literal pool goes through the d-cache, not the i-cache; this means at least a cacheline worth of loads from main memory for the first literal pool access, and then a host of potential eviction and invalidation issues between the d-cache and i-cache if the literal pool doesn't start on its own cache line (i.e. if the actual code doesn't end at the end of a cache line).
(If I had a few wishes for the compiler we're working with, a way to force literal pools to start on cacheline boundaries would be one of them.)
(Unrelatedly, one of the things we do to avoid literal pool usage is keep all of our "globals" in a single table. This means one literal pool lookup for the "GlobalTable", rather than multiple lookups for each global. If you're really clever you might be able to keep your GlobalTable in some sort of memory that can be accessed without loading a literal pool entry -- was it .sbss?)

While locality of reference to improve the cache behavior of data accesses is often a relevant consideration, there are a couple other reasons for controlling layout when optimization is required - particularly in embedded systems, even though the CPUs used on many embedded systems do not even have a cache.
- Memory alignment of the fields in structures
Alignment considerations are pretty well understood by many programmers, so I won't go into too much detail here.
On most CPU architectures, fields in a structure must be accessed at a native alignment for efficiency. This means that if you mix various sized fields the compiler has to add padding between the fields to keep the alignment requirements correct. So to optimize the memory used by a structure it's important to keep this in mind and lay out the fields such that the largest fields are followed by smaller fields to keep the required padding to a minimum. If a structure is to be 'packed' to prevent padding, accessing unaligned fields comes at a high runtime cost as the compiler has to access unaligned fields using a series of accesses to smaller parts of the field along with shifts and masks to assemble the field value in a register.
- Offset of frequently used fields in a structure
Another consideration that can be important on many embedded systems is to have frequently accessed fields at the start of a structure.
Some architectures have a limited number of bits available in an instruction to encode an offset to a pointer access, so if you access a field whose offset exceeds that number of bits the compiler will have to use multiple instructions to form a pointer to the field. For example, the ARM's Thumb architecture has 5 bits to encode an offset, so it can access a word-sized field in a single instruction only if the field is within 124 bytes from the start. So if you have a large structure an optimization that an embedded engineer might want to keep in mind is to place frequently used fields at the beginning of a structure's layout.

Well the first member doesn't need an offset added to the pointer to access it.

In C#, the order of the member is determined by the compiler unless you put the attribute [LayoutKind.Sequential/Explicit] which forces the compiler to lay out the structure/class the way you tell it to.
As far as I can tell, the compiler seems to minimize packing while aligning the data types on their natural order (i.e. 4 bytes int start on 4 byte addresses).

I'm focusing on performance, execution speed, not memory usage.
The compiler, without any optimizing switch, will map the variable storage area using the same order of declarations in code.
Imagine
unsigned char a;
unsigned char b;
long c;
Big mess-up? without align switches, low-memory ops. et al, we're going to have an unsigned char using a 64bits word on your DDR3 dimm, and another 64bits word for the other, and yet the unavoidable one for the long.
So, that's a fetch per each variable.
However, packing it, or re-ordering it, will cause one fetch and one AND masking to be able to use the unsigned chars.
So speed-wise, on a current 64bits word-memory machine, aligns, reorderings, etc, are no-nos. I do microcontroller stuff, and there the differences in packed/non-packed are reallllly noticeable (talking about <10MIPS processors, 8bit word-memories)
On the side, it's long known that the engineering effort required to tweak code for performance other than what a good algorithm instructs you to do, and what the compiler is able to optimize, often results in burning rubber with no real effects. That and a write-only piece of syntaxically dubius code.
The last step-forward in optimization I saw (in uPs, don't think it's doable for PC apps) is to compile your program as a single module, have the compiler optimize it (much more general view of speed/pointer resolution/memory packing, etc), and have the linker trash non-called library functions, methods, etc.

In theory, it could reduce cache misses if you have big objects. But it's usually better to group members of the same size together so you have tighter memory packing.

I highly doubt that would have any bearing in CPU improvements - maybe readability. You can optimize the executable code if the commonly executed basic blocks that are executed within a given frame are in the same set of pages. This is the same idea but would not know how create basic blocks within the code. My guess is the compiler puts the functions in the order it sees them with no optimization here so you could try and place common functionality together.
Try and run a profiler/optimizer. First you compile with some profiling option then run your program. Once the profiled exe is complete it will dump some profiled information. Take this dump and run it through the optimizer as input.
I have been away from this line of work for years but not much has changed how they work.

What is meaning of locality of data structure?

I was reading following article,
What Every Programmer Should Know About Compiler Optimizations
There are other important optimizations that are currently beyond the
capabilities of any compiler—for example, replacing an inefficient
algorithm with an efficient one, or changing the layout of a data
structure to improve its locality.
Does that mean if I change sequence (layout) of data members in class, it can affect performance?
So,
class One
{
int data0;
abstract-data-type data1;
};
Differes in performance from,
class One
{
abstract-data-type data0;
int data1;
};
If this is true, what is rule of thumb while defining classes or data structure?

Locality in this sense is speaking mostly to cache locality. Writing data structures and algorithms to operate mostly out of cache makes the algorithm run as fast as it possibly can. Cache locality is one of the reasons quick sort is quick.
For a data structure, you want to keep the parts of your data structure that refer to each other relatively close to each other, to avoid flushing out useful cache lines.
Also, you can rearrange your data structure so that the compiler will use the minimum amount of memory required to hold all the members and still efficiently access them. This helps make sure your data structure consumes the minimum number of cache lines.
A single cache line on a current x86-64 architecture (core i7) is 64 bytes.

I am not an expert on data/structure locality, but it has to do with how you organize your data to avoid the CPU caching bits of memory from all over the CPU thus slowing down your program by constantly waiting for a memory fetch.
For example, a linked list can be a scattered all over your memory. However if you changed this into an array of "elements" then they are all in contiguous memory - this would save memory access times if you needed to traverse they array all at one time (its just one example)
Additionally:
Also becareful of some of the STL libraries, again I am not 100% sure which are the best, but some of them (e.g. list) are quite bad in terms of locality.
Another , perhaps more common example is an array of pointers, where the pointed to elements can be scattered around memory.
Of course, you cannot always avoid this easily because you sometimes need to be able to dynamically add/move/insert/delete elements...
Summary:
It basically means take care how you layout your data with regard to memory access.

Sort class members by how frequently you will be accessing them. This maximizes the "hotness" of the cache line that contains the head of your class, increasing the likelihood of it remaining cached. Another factor that you care about is packing - due to alignment, rearranging the order in which members are declared could lead to a reduction in the size of your class which would in turn reduce cache pressure.
(None of them are definitive, of course. These rules of thumb aren't a substitute for profiling.)

Optimal Struct size for modern systems

I've read that the ideal size of a structure for performance, that's going to be used in a large collection, is 32 bytes. Is this true and why? Does this effect 64bit processors or is it not applicable?
This is in context of modern (2008+) home Intel-based systems.

The ideal size of a struct is enough to hold the information it needs to contain.

The optimal size for a struct is usually the minimum size needed to store whatever data it's supposed to contain without requiring any hacks like bit twiddling/misaligned accesses to make it fit.

The ideal size of a structure is likely to be one cache line (or a sub-multiple thereof). Level one cache lines are typically 32 or 64 bytes. Splitting an element of a data structure across a cache line boundary will require two main memory accesses to read or write it instead of one.

I don't think there is a reasonable answer to your question. Without any information on the context of the application, the "ideal size of a structure" is way, way underspecified.
As an aside, 32 bits is the space of one modern integer -- it isn't large enough for a "struct" except of a couple of characters or bitfields.

Optimizing member variable order in C++

I was reading a blog post by a game coder for Introversion and he is busily trying to squeeze every CPU tick he can out of the code. One trick he mentions off-hand is to
"re-order the member variables of a
class into most used and least used."
I'm not familiar with C++, nor with how it compiles, but I was wondering if
This statement is accurate?
How/Why?
Does it apply to other (compiled/scripting) languages?
I'm aware that the amount of (CPU) time saved by this trick would be minimal, it's not a deal-breaker. But on the other hand, in most functions it would be fairly easy to identify which variables are going to be the most commonly used, and just start coding this way by default.

Depending on the type of program you're running this advice may result in increased performance or it may slow things down drastically.
Doing this in a multi-threaded program means you're going to increase the chances of 'false-sharing'.
Check out Herb Sutters articles on the subject here
I've said it before and I'll keep saying it. The only real way to get a real performance increase is to measure your code, and use tools to identify the real bottle neck instead of arbitrarily changing stuff in your code base.

It is one of the ways of optimizing the working set size. There is a good article by John Robbins on how you can speed up the application performance by optimizing the working set size. Of course it involves careful selection of most frequent use cases the end user is likely to perform with the application.

We have slightly different guidelines for members here (ARM architecture target, mostly THUMB 16-bit codegen for various reasons):
group by alignment requirements (or, for newbies, "group by size" usually does the trick)
smallest first
"group by alignment" is somewhat obvious, and outside the scope of this question; it avoids padding, uses less memory, etc.
The second bullet, though, derives from the small 5-bit "immediate" field size on the THUMB LDRB (Load Register Byte), LDRH (Load Register Halfword), and LDR (Load Register) instructions.
5 bits means offsets of 0-31 can be encoded. Effectively, assuming "this" is handy in a register (which it usually is):
8-bit bytes can be loaded in one instruction if they exist at this+0 through this+31
16-bit halfwords if they exist at this+0 through this+62;
32-bit machine words if they exist at this+0 through this+124.
If they're outside this range, multiple instructions have to be generated: either a sequence of ADDs with immediates to accumulate the appropriate address in a register, or worse yet, a load from the literal pool at the end of the function.
If we do hit the literal pool, it hurts: the literal pool goes through the d-cache, not the i-cache; this means at least a cacheline worth of loads from main memory for the first literal pool access, and then a host of potential eviction and invalidation issues between the d-cache and i-cache if the literal pool doesn't start on its own cache line (i.e. if the actual code doesn't end at the end of a cache line).
(If I had a few wishes for the compiler we're working with, a way to force literal pools to start on cacheline boundaries would be one of them.)
(Unrelatedly, one of the things we do to avoid literal pool usage is keep all of our "globals" in a single table. This means one literal pool lookup for the "GlobalTable", rather than multiple lookups for each global. If you're really clever you might be able to keep your GlobalTable in some sort of memory that can be accessed without loading a literal pool entry -- was it .sbss?)

While locality of reference to improve the cache behavior of data accesses is often a relevant consideration, there are a couple other reasons for controlling layout when optimization is required - particularly in embedded systems, even though the CPUs used on many embedded systems do not even have a cache.
- Memory alignment of the fields in structures
Alignment considerations are pretty well understood by many programmers, so I won't go into too much detail here.
On most CPU architectures, fields in a structure must be accessed at a native alignment for efficiency. This means that if you mix various sized fields the compiler has to add padding between the fields to keep the alignment requirements correct. So to optimize the memory used by a structure it's important to keep this in mind and lay out the fields such that the largest fields are followed by smaller fields to keep the required padding to a minimum. If a structure is to be 'packed' to prevent padding, accessing unaligned fields comes at a high runtime cost as the compiler has to access unaligned fields using a series of accesses to smaller parts of the field along with shifts and masks to assemble the field value in a register.
- Offset of frequently used fields in a structure
Another consideration that can be important on many embedded systems is to have frequently accessed fields at the start of a structure.
Some architectures have a limited number of bits available in an instruction to encode an offset to a pointer access, so if you access a field whose offset exceeds that number of bits the compiler will have to use multiple instructions to form a pointer to the field. For example, the ARM's Thumb architecture has 5 bits to encode an offset, so it can access a word-sized field in a single instruction only if the field is within 124 bytes from the start. So if you have a large structure an optimization that an embedded engineer might want to keep in mind is to place frequently used fields at the beginning of a structure's layout.

Well the first member doesn't need an offset added to the pointer to access it.

In C#, the order of the member is determined by the compiler unless you put the attribute [LayoutKind.Sequential/Explicit] which forces the compiler to lay out the structure/class the way you tell it to.
As far as I can tell, the compiler seems to minimize packing while aligning the data types on their natural order (i.e. 4 bytes int start on 4 byte addresses).

I'm focusing on performance, execution speed, not memory usage.
The compiler, without any optimizing switch, will map the variable storage area using the same order of declarations in code.
Imagine
unsigned char a;
unsigned char b;
long c;
Big mess-up? without align switches, low-memory ops. et al, we're going to have an unsigned char using a 64bits word on your DDR3 dimm, and another 64bits word for the other, and yet the unavoidable one for the long.
So, that's a fetch per each variable.
However, packing it, or re-ordering it, will cause one fetch and one AND masking to be able to use the unsigned chars.
So speed-wise, on a current 64bits word-memory machine, aligns, reorderings, etc, are no-nos. I do microcontroller stuff, and there the differences in packed/non-packed are reallllly noticeable (talking about <10MIPS processors, 8bit word-memories)
On the side, it's long known that the engineering effort required to tweak code for performance other than what a good algorithm instructs you to do, and what the compiler is able to optimize, often results in burning rubber with no real effects. That and a write-only piece of syntaxically dubius code.
The last step-forward in optimization I saw (in uPs, don't think it's doable for PC apps) is to compile your program as a single module, have the compiler optimize it (much more general view of speed/pointer resolution/memory packing, etc), and have the linker trash non-called library functions, methods, etc.

In theory, it could reduce cache misses if you have big objects. But it's usually better to group members of the same size together so you have tighter memory packing.

I highly doubt that would have any bearing in CPU improvements - maybe readability. You can optimize the executable code if the commonly executed basic blocks that are executed within a given frame are in the same set of pages. This is the same idea but would not know how create basic blocks within the code. My guess is the compiler puts the functions in the order it sees them with no optimization here so you could try and place common functionality together.
Try and run a profiler/optimizer. First you compile with some profiling option then run your program. Once the profiled exe is complete it will dump some profiled information. Take this dump and run it through the optimizer as input.
I have been away from this line of work for years but not much has changed how they work.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js