Relationship between physical registers and Intel SIMD variables?

Relationship between physical registers and Intel SIMD variables? - c++

What is the relationship between physical processor registers and the variables used in Intel intrinsics (e.g. __m128)?
A diagram explaining SIMD typically shows 2 registers but references on the Intel forums to "register pressure" and in this question to "register coloring" suggest there is more going on.
Can any number of variables representing registers be declared? How can this be when they're closely tied to a finite physical resource? What should one be aware of regarding how physical registers are chosen? What happens if more registers are declared than exist?
Can multiple pairs of registers be active at the same time?
Are there different types of physical registers?

The variable types like _m128, _m128i, _m128d, ... are there mainly to protect you. They ensure that you are not attempting to use standard operators like +, -, &, |, ==, ... and ensure that the compiler will throw an error if you are attempting to assign the wrong types. These types force the compiler to load themselves into the appropriate register (XMM* in this case), but still give the compiler the freedom to choose which one, or store them locally on the stack if all appropriate registers are taken. They also ensure that any time they are stored on the stack, they maintain the correct alignment (16 byte alignment in this case), so that intrinsic instructions that rely on alignment don't cause GPFs.
You may tightly tie one of these variables to a physical register if you like using the asm constructs:
__m128i myXMM1 asm( "%xmm1" );
But it's better to just let the compiler do its magic and choose the registers for you to allow better optimization.
Any number of these variables can be declared, and even overbooking your XMM register store might not result in using stack space, as long as your working set of registers remains small. Compiler scoping will usually realize when a value is no longer used and allow the optimizer to not store it back to the stack. Sometimes you can help the compiler out by creating your own scoped stack frame:
__m128i storedVar;
{
__m128i tempVar1, tempVar2, tempVar3;
// do some operations with tempVar1 -> 3
storedVar = tempVar1;
}
{
__m128i tempVar4, tempVar5, tempVar6, tempVar7, tempVar8;
// do some operations with tempVar4 -> 8
storedVar = tempVar4;
}
return storedVar;
Since the variables go out of scope at the closed curly brace, the compiler sees that the registers which used to contain those values are now freed up, so it doesn't need to exceed the total number of available XMM registers.
If you do overbook your register store, and all values need to be maintained, then the compiler will allocate the appropriate size on the stack and ensure that it is properly aligned, and the value of the XMM register will be swapped out to the stack to make room for a new value. Keep in mind that the stack space is well cached, so writes and reads there are not as harmful as you might have expected. The real hit that you take is the necessity of the extra move operations to swap them in and out.
There are different types of physical registers by width (64-bit, 128-bit, 256-bit, 512-bit), obviously associated with the corresponding C/C++ intrinsic data type. The different "flavors" for a given width ("__m128i", "__m128d", ...) can actually all reside in any of the registers of the given width. The type forces you to use the appropriate intrinsic type (_mm_and_si128 vs. _mm_and_pd, for example), which in turn generates an appropriate version of the instruction.
Something like the "and" is a good example because the resulting operation will be identical regardless of type - a bitwise "and". But using the wrong type can incur latency according to what I've read in the Intel docs. The integer instructions and floating point instructions have separate execution queues, and whenever data has to move from one execution queue to the other, there is a penalty. So generally it is good practice to choose the appropriate data type so that the appropriate instructions can be generated, and stay within the realm of that data type.

Related

Is there an order you should declare variables in C++? [duplicate]

I was reading a blog post by a game coder for Introversion and he is busily trying to squeeze every CPU tick he can out of the code. One trick he mentions off-hand is to
"re-order the member variables of a
class into most used and least used."
I'm not familiar with C++, nor with how it compiles, but I was wondering if
This statement is accurate?
How/Why?
Does it apply to other (compiled/scripting) languages?
I'm aware that the amount of (CPU) time saved by this trick would be minimal, it's not a deal-breaker. But on the other hand, in most functions it would be fairly easy to identify which variables are going to be the most commonly used, and just start coding this way by default.

Two issues here:
Whether and when keeping certain fields together is an optimization.
How to do actually do it.
The reason that it might help, is that memory is loaded into the CPU cache in chunks called "cache lines". This takes time, and generally speaking the more cache lines loaded for your object, the longer it takes. Also, the more other stuff gets thrown out of the cache to make room, which slows down other code in an unpredictable way.
The size of a cache line depends on the processor. If it is large compared with the size of your objects, then very few objects are going to span a cache line boundary, so the whole optimization is pretty irrelevant. Otherwise, you might get away with sometimes only having part of your object in cache, and the rest in main memory (or L2 cache, perhaps). It's a good thing if your most common operations (the ones which access the commonly-used fields) use as little cache as possible for the object, so grouping those fields together gives you a better chance of this happening.
The general principle is called "locality of reference". The closer together the different memory addresses are that your program accesses, the better your chances of getting good cache behaviour. It's often difficult to predict performance in advance: different processor models of the same architecture can behave differently, multi-threading means you often don't know what's going to be in the cache, etc. But it's possible to talk about what's likely to happen, most of the time. If you want to know anything, you generally have to measure it.
Please note that there are some gotchas here. If you are using CPU-based atomic operations (which the atomic types in C++0x generally will), then you may find that the CPU locks the entire cache line in order to lock the field. Then, if you have several atomic fields close together, with different threads running on different cores and operating on different fields at the same time, you will find that all those atomic operations are serialised because they all lock the same memory location even though they're operating on different fields. Had they been operating on different cache lines then they would have worked in parallel, and run faster. In fact, as Glen (via Herb Sutter) points out in his answer, on a coherent-cache architecture this happens even without atomic operations, and can utterly ruin your day. So locality of reference is not necessarily a good thing where multiple cores are involved, even if they share cache. You can expect it to be, on grounds that cache misses usually are a source of lost speed, but be horribly wrong in your particular case.
Now, quite aside from distinguishing between commonly-used and less-used fields, the smaller an object is, the less memory (and hence less cache) it occupies. This is pretty much good news all around, at least where you don't have heavy contention. The size of an object depends on the fields in it, and on any padding which has to be inserted between fields in order to ensure they are correctly aligned for the architecture. C++ (sometimes) puts constraints on the order which fields must appear in an object, based on the order they are declared. This is to make low-level programming easier. So, if your object contains:
an int (4 bytes, 4-aligned)
followed by a char (1 byte, any alignment)
followed by an int (4 bytes, 4-aligned)
followed by a char (1 byte, any alignment)
then chances are this will occupy 16 bytes in memory. The size and alignment of int isn't the same on every platform, by the way, but 4 is very common and this is just an example.
In this case, the compiler will insert 3 bytes of padding before the second int, to correctly align it, and 3 bytes of padding at the end. An object's size has to be a multiple of its alignment, so that objects of the same type can be placed adjacent in memory. That's all an array is in C/C++, adjacent objects in memory. Had the struct been int, int, char, char, then the same object could have been 12 bytes, because char has no alignment requirement.
I said that whether int is 4-aligned is platform-dependent: on ARM it absolutely has to be, since unaligned access throws a hardware exception. On x86 you can access ints unaligned, but it's generally slower and IIRC non-atomic. So compilers usually (always?) 4-align ints on x86.
The rule of thumb when writing code, if you care about packing, is to look at the alignment requirement of each member of the struct. Then order the fields with the biggest-aligned types first, then the next smallest, and so on down to members with no aligment requirement. For example if I'm trying to write portable code I might come up with this:
struct some_stuff {
double d; // I expect double is 64bit IEEE, it might not be
uint64_t l; // 8 bytes, could be 8-aligned or 4-aligned, I don't know
uint32_t i; // 4 bytes, usually 4-aligned
int32_t j; // same
short s; // usually 2 bytes, could be 2-aligned or unaligned, I don't know
char c[4]; // array 4 chars, 4 bytes big but "never" needs 4-alignment
char d; // 1 byte, any alignment
};
If you don't know the alignment of a field, or you're writing portable code but want to do the best you can without major trickery, then you assume that the alignment requirement is the largest requirement of any fundamental type in the structure, and that the alignment requirement of fundamental types is their size. So, if your struct contains a uint64_t, or a long long, then the best guess is it's 8-aligned. Sometimes you'll be wrong, but you'll be right a lot of the time.
Note that games programmers like your blogger often know everything about their processor and hardware, and thus they don't have to guess. They know the cache line size, they know the size and alignment of every type, and they know the struct layout rules used by their compiler (for POD and non-POD types). If they support multiple platforms, then they can special-case for each one if necessary. They also spend a lot of time thinking about which objects in their game will benefit from performance improvements, and using profilers to find out where the real bottlenecks are. But even so, it's not such a bad idea to have a few rules of thumb that you apply whether the object needs it or not. As long as it won't make the code unclear, "put commonly-used fields at the start of the object" and "sort by alignment requirement" are two good rules.

Depending on the type of program you're running this advice may result in increased performance or it may slow things down drastically.
Doing this in a multi-threaded program means you're going to increase the chances of 'false-sharing'.
Check out Herb Sutters articles on the subject here
I've said it before and I'll keep saying it. The only real way to get a real performance increase is to measure your code, and use tools to identify the real bottle neck instead of arbitrarily changing stuff in your code base.

It is one of the ways of optimizing the working set size. There is a good article by John Robbins on how you can speed up the application performance by optimizing the working set size. Of course it involves careful selection of most frequent use cases the end user is likely to perform with the application.

We have slightly different guidelines for members here (ARM architecture target, mostly THUMB 16-bit codegen for various reasons):
group by alignment requirements (or, for newbies, "group by size" usually does the trick)
smallest first
"group by alignment" is somewhat obvious, and outside the scope of this question; it avoids padding, uses less memory, etc.
The second bullet, though, derives from the small 5-bit "immediate" field size on the THUMB LDRB (Load Register Byte), LDRH (Load Register Halfword), and LDR (Load Register) instructions.
5 bits means offsets of 0-31 can be encoded. Effectively, assuming "this" is handy in a register (which it usually is):
8-bit bytes can be loaded in one instruction if they exist at this+0 through this+31
16-bit halfwords if they exist at this+0 through this+62;
32-bit machine words if they exist at this+0 through this+124.
If they're outside this range, multiple instructions have to be generated: either a sequence of ADDs with immediates to accumulate the appropriate address in a register, or worse yet, a load from the literal pool at the end of the function.
If we do hit the literal pool, it hurts: the literal pool goes through the d-cache, not the i-cache; this means at least a cacheline worth of loads from main memory for the first literal pool access, and then a host of potential eviction and invalidation issues between the d-cache and i-cache if the literal pool doesn't start on its own cache line (i.e. if the actual code doesn't end at the end of a cache line).
(If I had a few wishes for the compiler we're working with, a way to force literal pools to start on cacheline boundaries would be one of them.)
(Unrelatedly, one of the things we do to avoid literal pool usage is keep all of our "globals" in a single table. This means one literal pool lookup for the "GlobalTable", rather than multiple lookups for each global. If you're really clever you might be able to keep your GlobalTable in some sort of memory that can be accessed without loading a literal pool entry -- was it .sbss?)

While locality of reference to improve the cache behavior of data accesses is often a relevant consideration, there are a couple other reasons for controlling layout when optimization is required - particularly in embedded systems, even though the CPUs used on many embedded systems do not even have a cache.
- Memory alignment of the fields in structures
Alignment considerations are pretty well understood by many programmers, so I won't go into too much detail here.
On most CPU architectures, fields in a structure must be accessed at a native alignment for efficiency. This means that if you mix various sized fields the compiler has to add padding between the fields to keep the alignment requirements correct. So to optimize the memory used by a structure it's important to keep this in mind and lay out the fields such that the largest fields are followed by smaller fields to keep the required padding to a minimum. If a structure is to be 'packed' to prevent padding, accessing unaligned fields comes at a high runtime cost as the compiler has to access unaligned fields using a series of accesses to smaller parts of the field along with shifts and masks to assemble the field value in a register.
- Offset of frequently used fields in a structure
Another consideration that can be important on many embedded systems is to have frequently accessed fields at the start of a structure.
Some architectures have a limited number of bits available in an instruction to encode an offset to a pointer access, so if you access a field whose offset exceeds that number of bits the compiler will have to use multiple instructions to form a pointer to the field. For example, the ARM's Thumb architecture has 5 bits to encode an offset, so it can access a word-sized field in a single instruction only if the field is within 124 bytes from the start. So if you have a large structure an optimization that an embedded engineer might want to keep in mind is to place frequently used fields at the beginning of a structure's layout.

Well the first member doesn't need an offset added to the pointer to access it.

In C#, the order of the member is determined by the compiler unless you put the attribute [LayoutKind.Sequential/Explicit] which forces the compiler to lay out the structure/class the way you tell it to.
As far as I can tell, the compiler seems to minimize packing while aligning the data types on their natural order (i.e. 4 bytes int start on 4 byte addresses).

I'm focusing on performance, execution speed, not memory usage.
The compiler, without any optimizing switch, will map the variable storage area using the same order of declarations in code.
Imagine
unsigned char a;
unsigned char b;
long c;
Big mess-up? without align switches, low-memory ops. et al, we're going to have an unsigned char using a 64bits word on your DDR3 dimm, and another 64bits word for the other, and yet the unavoidable one for the long.
So, that's a fetch per each variable.
However, packing it, or re-ordering it, will cause one fetch and one AND masking to be able to use the unsigned chars.
So speed-wise, on a current 64bits word-memory machine, aligns, reorderings, etc, are no-nos. I do microcontroller stuff, and there the differences in packed/non-packed are reallllly noticeable (talking about <10MIPS processors, 8bit word-memories)
On the side, it's long known that the engineering effort required to tweak code for performance other than what a good algorithm instructs you to do, and what the compiler is able to optimize, often results in burning rubber with no real effects. That and a write-only piece of syntaxically dubius code.
The last step-forward in optimization I saw (in uPs, don't think it's doable for PC apps) is to compile your program as a single module, have the compiler optimize it (much more general view of speed/pointer resolution/memory packing, etc), and have the linker trash non-called library functions, methods, etc.

In theory, it could reduce cache misses if you have big objects. But it's usually better to group members of the same size together so you have tighter memory packing.

I highly doubt that would have any bearing in CPU improvements - maybe readability. You can optimize the executable code if the commonly executed basic blocks that are executed within a given frame are in the same set of pages. This is the same idea but would not know how create basic blocks within the code. My guess is the compiler puts the functions in the order it sees them with no optimization here so you could try and place common functionality together.
Try and run a profiler/optimizer. First you compile with some profiling option then run your program. Once the profiled exe is complete it will dump some profiled information. Take this dump and run it through the optimizer as input.
I have been away from this line of work for years but not much has changed how they work.

What is the difference between data types aside from length in memory?

I've been trying to wrap my head around how C/C++ code is represented in machine code and I'm having trouble understanding what data types actually are apart from a designation of memory length.

Types also are associated with;
a set of values that all variables of that type can represent;
the layout in memory of that type (e.g. the meaning, if any, attached to each bit or byte that represents a variable),
the set of operations that can act on a variable;
the behaviour of those operations.
Types are not necessarily represented directly in machine code. A compiler emits a set of instructions and data (in a way that varies between target platforms) that manipulate memory and machine registers. The type of each variable, in C source, gives information to the compiler about what memory to allocate for it, and the compiler makes decisions for mapping between expressions (in C statements) and usage of registers and machine instructions to give the required effects.

Understanding how the instrinsic functions for SSE use memory

Before I ask my question, just a little background information.
In C languages, when you assign to a variable, you can conceptually assume you just modified a little piece of memory in RAM.
int a = rand(); //conceptually, you created and assigned variable A in ram
In assembly language, to do the same thing, you essentially need the result of rand() stored in a register, and a pointer to "a". You would then do a store instruction to get the register content into ram.
When you program in C++ for example, when you assign and manipulate value type objects, you usually dont even have to think about their addresses or how or when they will be stored in registers.
Using SSE instrinics are strange because they appear to somewhere inbetween coding in C and assembly, in terms of the conceptual memory model.
You can call load/store functions, and they return objects. A math operation like _mm_add will return an object, yet it's unclear to me weather the result will actually be stored in the object unless you call _mm_store.
Consider the following example:
inline void block(float* y, const float* x) const {
// load 4 data elements at a time
__m128 X = _mm_loadu_ps(x);
__m128 Y = _mm_loadu_ps(y);
// do the computations
__m128 result = _mm_add_ps(Y, _mm_mul_ps(X, _mm_set1_ps(a)));
// store the results
_mm_storeu_ps(y, result);
}
There are alot of temporary objects here. Do the temporary objects actually not exist? Is it all just syntax sugar for calling assembly instrunctions in a C like way? What happens if instead of doing the store command at the end, you just kept the result, would the result then be more than syntax sugar, and will actually hold data?
TL:DR How am I suppose to think about memory when using SSE instrinsics?

An __m128 variable may be in a register and/or memory. It's much the same as with simple float or int variables - the compiler will decide which variables belong in registers and which must be stored in memory. In general the compiler will try to keep the "hottest" variables in registers and the rest in memory. It will also analyse the lifetimes of variables so that a register may be used for more than one variable within a block. As a programmer you don't need to worry about this too much, but you should be aware of how many registers you have, i.e.. 8 XMM registers in 32 bit mode and 16 in 64 bit mode. Keeping your variable usage below these numbers will help to keep everything in registers as far as possible. Having said that, the penalty for accessing an operand in L1 cache is not that much greater than accessing a register operand, so you shouldn't get too hung up on keeping everything in registers if it proves difficult to do so.
Footnote: this vagueness about whether SSE variables are in registers or memory when using intrinsics is actually quite helpful, and makes it much easier to write optimised code than doing it with raw assembler - the compiler does the grunt work of keeping track of register allocation and other optimisations, allowing you to concentrate on making the code work correctly.

Vector variables aren't special. They will be spilled to memory and re-loaded when needed later, if the compiler runs out of registers when optimizing a loop (or across a function call to a function the compiler can't "see" to know that it doesn't touch the vector regs).
gcc -O0 actually does tend to store to RAM when you set them, instead of keeping __m128i variables only in registers, IIRC.
You could write all your intrinsic-using code without ever using any load or store intrinsics, but then you'd be at the mercy of the compiler to decide how and when to move data around. (You actually still are, to some degree these days, thanks to compilers being good at optimizing intrinsics, and not just literally spitting out a load wherever you use a load intrinsic.)
Compilers will fold loads into memory operands for following instructions, if the value isn't needed as an input to something else as well. However, this is only safe if the data is at a known-aligned address, or an aligned-load intrinsic was used.
The way I currently think about load intrinsics is as a way of communicating alignment guarantees (or lack thereof) to the compiler. The "regular" SSE (non-AVX / non-VEX-encoded) versions of vector instructions fault if used with an unaligned 128b memory operand. (Even on CPUs supporting AVX, FWIW.) For example, note that even punpckl* lists its memory operand as a m128, and thus has alignment requirements, even if it only actually reads the low 64b. pmovzx lists its operand as a m128.
Anyway, using load instead of loadu tells the compiler that it can fold the load into being a memory operand for another instruction, even if it can't otherwise prove that it comes from an aligned address.
Compiling for an AVX target machine will allow the compiler to fold even unaligned loads into other operations, to take advantage of uop micro-fusion.
This came up in comments on How to specify alignment with _mm_mul_ps.
The store intrinsics apparently have two purposes:
To tell the compiler whether it should use the aligned or unaligned asm instruction.
To remove the need for a cast from __m128d to double * (doesn't apply to the integer case).
Just to confuse things, AVX2 introduced things like _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a), which stores the high/low halves to different addresses. It probably compiles to a vmovdqu / vextracti128 ..., 1 sequence. Incidentally, I guess they made vextracti128 with AVX512 in mind, since using it with 0 as the immediate is the same as vmovdqu, but slower and longer-to-encode.

Alignment requirement of function pointers

We know that the value of a pointer to data should be properly aligned. For example, the value of a pointer to double should be a multiple of 8. So I'm wondering whether a pointer to function has similar requirements.

Alignment of both data and code is highly machine dependent.
On many processors, reading for example double at unaligned addresses will cause a fault (hardware exception, trap, or whatever you want to call it) - this either is handled in software [slow, often 10-1000x slower than aligned access] or causes the application performing the operation to fail (similar to accessing invalid memory locations in a modern OS). On for example x86, it will be slower, but typically not fail, because the processor will, at least in some cases, have to do two smaller read operations and combine those before it gets the value of the double.
Code may have alignment as well. Most RISC processors have fixed size code-words - 4 bytes being a commmon size, and they should be aligned to that size. ARM in "thumb" mode uses 2-byte instruction size, with some instructions having extra data in another word after.
On the other hand, x86 has "single byte" alignment requirement, and 68K for example would require code to be aligned at 2 bytes only. So in that respect, the alignment need will vary. Beyond that, there are efficiency reasons to have a certain alignment - for example starting functions/branches at 8, 16 or 32-byte boundaries is often beneficial, and I know that some older x86 processors had limits of "how many branch predictions for a given N bytes of code there could be" - meaning that if you have many different branches in a short piece of code, some would have to go without branch prediction, because the "slots" for that location were already full up.
So, compilers will (sometimes) pad code to align functions for performance reasons. However, this is not ALWAYS a win - it wastes cache-space with "padding", and it really depends on how the code is used. Compilers typically know this, at least if you use feedback/profile based optimisations (where the code is run with instrumentation to count how the code is used, and the optimisation is based on the results of this).
As a rule, however, function pointers can point anywhere that is a legal address for "code", so the fundamental requirement is typically 1, 2 or 4 bytes, based on the architecture of the processor itself.

Optimizing member variable order in C++

I was reading a blog post by a game coder for Introversion and he is busily trying to squeeze every CPU tick he can out of the code. One trick he mentions off-hand is to
"re-order the member variables of a
class into most used and least used."
I'm not familiar with C++, nor with how it compiles, but I was wondering if
This statement is accurate?
How/Why?
Does it apply to other (compiled/scripting) languages?
I'm aware that the amount of (CPU) time saved by this trick would be minimal, it's not a deal-breaker. But on the other hand, in most functions it would be fairly easy to identify which variables are going to be the most commonly used, and just start coding this way by default.

Depending on the type of program you're running this advice may result in increased performance or it may slow things down drastically.
Doing this in a multi-threaded program means you're going to increase the chances of 'false-sharing'.
Check out Herb Sutters articles on the subject here
I've said it before and I'll keep saying it. The only real way to get a real performance increase is to measure your code, and use tools to identify the real bottle neck instead of arbitrarily changing stuff in your code base.

It is one of the ways of optimizing the working set size. There is a good article by John Robbins on how you can speed up the application performance by optimizing the working set size. Of course it involves careful selection of most frequent use cases the end user is likely to perform with the application.

We have slightly different guidelines for members here (ARM architecture target, mostly THUMB 16-bit codegen for various reasons):
group by alignment requirements (or, for newbies, "group by size" usually does the trick)
smallest first
"group by alignment" is somewhat obvious, and outside the scope of this question; it avoids padding, uses less memory, etc.
The second bullet, though, derives from the small 5-bit "immediate" field size on the THUMB LDRB (Load Register Byte), LDRH (Load Register Halfword), and LDR (Load Register) instructions.
5 bits means offsets of 0-31 can be encoded. Effectively, assuming "this" is handy in a register (which it usually is):
8-bit bytes can be loaded in one instruction if they exist at this+0 through this+31
16-bit halfwords if they exist at this+0 through this+62;
32-bit machine words if they exist at this+0 through this+124.
If they're outside this range, multiple instructions have to be generated: either a sequence of ADDs with immediates to accumulate the appropriate address in a register, or worse yet, a load from the literal pool at the end of the function.
If we do hit the literal pool, it hurts: the literal pool goes through the d-cache, not the i-cache; this means at least a cacheline worth of loads from main memory for the first literal pool access, and then a host of potential eviction and invalidation issues between the d-cache and i-cache if the literal pool doesn't start on its own cache line (i.e. if the actual code doesn't end at the end of a cache line).
(If I had a few wishes for the compiler we're working with, a way to force literal pools to start on cacheline boundaries would be one of them.)
(Unrelatedly, one of the things we do to avoid literal pool usage is keep all of our "globals" in a single table. This means one literal pool lookup for the "GlobalTable", rather than multiple lookups for each global. If you're really clever you might be able to keep your GlobalTable in some sort of memory that can be accessed without loading a literal pool entry -- was it .sbss?)

While locality of reference to improve the cache behavior of data accesses is often a relevant consideration, there are a couple other reasons for controlling layout when optimization is required - particularly in embedded systems, even though the CPUs used on many embedded systems do not even have a cache.
- Memory alignment of the fields in structures
Alignment considerations are pretty well understood by many programmers, so I won't go into too much detail here.
On most CPU architectures, fields in a structure must be accessed at a native alignment for efficiency. This means that if you mix various sized fields the compiler has to add padding between the fields to keep the alignment requirements correct. So to optimize the memory used by a structure it's important to keep this in mind and lay out the fields such that the largest fields are followed by smaller fields to keep the required padding to a minimum. If a structure is to be 'packed' to prevent padding, accessing unaligned fields comes at a high runtime cost as the compiler has to access unaligned fields using a series of accesses to smaller parts of the field along with shifts and masks to assemble the field value in a register.
- Offset of frequently used fields in a structure
Another consideration that can be important on many embedded systems is to have frequently accessed fields at the start of a structure.
Some architectures have a limited number of bits available in an instruction to encode an offset to a pointer access, so if you access a field whose offset exceeds that number of bits the compiler will have to use multiple instructions to form a pointer to the field. For example, the ARM's Thumb architecture has 5 bits to encode an offset, so it can access a word-sized field in a single instruction only if the field is within 124 bytes from the start. So if you have a large structure an optimization that an embedded engineer might want to keep in mind is to place frequently used fields at the beginning of a structure's layout.

Well the first member doesn't need an offset added to the pointer to access it.

In C#, the order of the member is determined by the compiler unless you put the attribute [LayoutKind.Sequential/Explicit] which forces the compiler to lay out the structure/class the way you tell it to.
As far as I can tell, the compiler seems to minimize packing while aligning the data types on their natural order (i.e. 4 bytes int start on 4 byte addresses).

I'm focusing on performance, execution speed, not memory usage.
The compiler, without any optimizing switch, will map the variable storage area using the same order of declarations in code.
Imagine
unsigned char a;
unsigned char b;
long c;
Big mess-up? without align switches, low-memory ops. et al, we're going to have an unsigned char using a 64bits word on your DDR3 dimm, and another 64bits word for the other, and yet the unavoidable one for the long.
So, that's a fetch per each variable.
However, packing it, or re-ordering it, will cause one fetch and one AND masking to be able to use the unsigned chars.
So speed-wise, on a current 64bits word-memory machine, aligns, reorderings, etc, are no-nos. I do microcontroller stuff, and there the differences in packed/non-packed are reallllly noticeable (talking about <10MIPS processors, 8bit word-memories)
On the side, it's long known that the engineering effort required to tweak code for performance other than what a good algorithm instructs you to do, and what the compiler is able to optimize, often results in burning rubber with no real effects. That and a write-only piece of syntaxically dubius code.
The last step-forward in optimization I saw (in uPs, don't think it's doable for PC apps) is to compile your program as a single module, have the compiler optimize it (much more general view of speed/pointer resolution/memory packing, etc), and have the linker trash non-called library functions, methods, etc.

In theory, it could reduce cache misses if you have big objects. But it's usually better to group members of the same size together so you have tighter memory packing.

I highly doubt that would have any bearing in CPU improvements - maybe readability. You can optimize the executable code if the commonly executed basic blocks that are executed within a given frame are in the same set of pages. This is the same idea but would not know how create basic blocks within the code. My guess is the compiler puts the functions in the order it sees them with no optimization here so you could try and place common functionality together.
Try and run a profiler/optimizer. First you compile with some profiling option then run your program. Once the profiled exe is complete it will dump some profiled information. Take this dump and run it through the optimizer as input.
I have been away from this line of work for years but not much has changed how they work.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js