Understanding how the instrinsic functions for SSE use memory - c++

Before I ask my question, just a little background information.
In C languages, when you assign to a variable, you can conceptually assume you just modified a little piece of memory in RAM.
int a = rand(); //conceptually, you created and assigned variable A in ram
In assembly language, to do the same thing, you essentially need the result of rand() stored in a register, and a pointer to "a". You would then do a store instruction to get the register content into ram.
When you program in C++ for example, when you assign and manipulate value type objects, you usually dont even have to think about their addresses or how or when they will be stored in registers.
Using SSE instrinics are strange because they appear to somewhere inbetween coding in C and assembly, in terms of the conceptual memory model.
You can call load/store functions, and they return objects. A math operation like _mm_add will return an object, yet it's unclear to me weather the result will actually be stored in the object unless you call _mm_store.
Consider the following example:
inline void block(float* y, const float* x) const {
// load 4 data elements at a time
__m128 X = _mm_loadu_ps(x);
__m128 Y = _mm_loadu_ps(y);
// do the computations
__m128 result = _mm_add_ps(Y, _mm_mul_ps(X, _mm_set1_ps(a)));
// store the results
_mm_storeu_ps(y, result);
}
There are alot of temporary objects here. Do the temporary objects actually not exist? Is it all just syntax sugar for calling assembly instrunctions in a C like way? What happens if instead of doing the store command at the end, you just kept the result, would the result then be more than syntax sugar, and will actually hold data?
TL:DR How am I suppose to think about memory when using SSE instrinsics?

An __m128 variable may be in a register and/or memory. It's much the same as with simple float or int variables - the compiler will decide which variables belong in registers and which must be stored in memory. In general the compiler will try to keep the "hottest" variables in registers and the rest in memory. It will also analyse the lifetimes of variables so that a register may be used for more than one variable within a block. As a programmer you don't need to worry about this too much, but you should be aware of how many registers you have, i.e.. 8 XMM registers in 32 bit mode and 16 in 64 bit mode. Keeping your variable usage below these numbers will help to keep everything in registers as far as possible. Having said that, the penalty for accessing an operand in L1 cache is not that much greater than accessing a register operand, so you shouldn't get too hung up on keeping everything in registers if it proves difficult to do so.
Footnote: this vagueness about whether SSE variables are in registers or memory when using intrinsics is actually quite helpful, and makes it much easier to write optimised code than doing it with raw assembler - the compiler does the grunt work of keeping track of register allocation and other optimisations, allowing you to concentrate on making the code work correctly.

Vector variables aren't special. They will be spilled to memory and re-loaded when needed later, if the compiler runs out of registers when optimizing a loop (or across a function call to a function the compiler can't "see" to know that it doesn't touch the vector regs).
gcc -O0 actually does tend to store to RAM when you set them, instead of keeping __m128i variables only in registers, IIRC.
You could write all your intrinsic-using code without ever using any load or store intrinsics, but then you'd be at the mercy of the compiler to decide how and when to move data around. (You actually still are, to some degree these days, thanks to compilers being good at optimizing intrinsics, and not just literally spitting out a load wherever you use a load intrinsic.)
Compilers will fold loads into memory operands for following instructions, if the value isn't needed as an input to something else as well. However, this is only safe if the data is at a known-aligned address, or an aligned-load intrinsic was used.
The way I currently think about load intrinsics is as a way of communicating alignment guarantees (or lack thereof) to the compiler. The "regular" SSE (non-AVX / non-VEX-encoded) versions of vector instructions fault if used with an unaligned 128b memory operand. (Even on CPUs supporting AVX, FWIW.) For example, note that even punpckl* lists its memory operand as a m128, and thus has alignment requirements, even if it only actually reads the low 64b. pmovzx lists its operand as a m128.
Anyway, using load instead of loadu tells the compiler that it can fold the load into being a memory operand for another instruction, even if it can't otherwise prove that it comes from an aligned address.
Compiling for an AVX target machine will allow the compiler to fold even unaligned loads into other operations, to take advantage of uop micro-fusion.
This came up in comments on How to specify alignment with _mm_mul_ps.
The store intrinsics apparently have two purposes:
To tell the compiler whether it should use the aligned or unaligned asm instruction.
To remove the need for a cast from __m128d to double * (doesn't apply to the integer case).
Just to confuse things, AVX2 introduced things like _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a), which stores the high/low halves to different addresses. It probably compiles to a vmovdqu / vextracti128 ..., 1 sequence. Incidentally, I guess they made vextracti128 with AVX512 in mind, since using it with 0 as the immediate is the same as vmovdqu, but slower and longer-to-encode.

Related

Objective difference between register and pointer in AVX instructions

Scenario: You are writing a complex algorithm using SIMD. A handful of constants and/or infrequently changing values are used. Ultimately, the algorithm ends up using more than 16 ymm, resulting in the use of stack pointers (e.g. opcode contains vaddps ymm0,ymm1,ymmword ptr [...] instead of vaddps ymm0,ymm1,ymm7).
In order to make the algorithm fit into the available registers, the constants can be "inlined". For example:
const auto pi256{ _mm256_set1_ps(PI) };
for (outer condition)
{
...
const auto radius_squared{ _mm256_mul_ps(radius, radius) };
...
for (inner condition)
{
...
const auto area{ _mm256_mul_ps(radius_squared, pi256) };
...
}
}
... becomes ...
for (outer condition)
{
...
for (inner condition)
{
...
const auto area{ _mm256_mul_ps(_mm256_mul_ps(radius, radius), _mm256_set1_ps(PI)) };
...
}
}
Whether the disposable variable in question is a constant, or is infrequently calculated (calculated outer loop), how can one determine which approach achieves the best throughput? Is it a matter of some concept like "ptr adds 2 extra latency"? Or is it nondeterministic such that it differs on a case-by-case basis and can only be fully optimized through trial-and-error + profiling?
A good optimizing compiler should generate the same machine code for both versions. Just define your vector constants as locals, or use them anonymously for maximum readability; let the compiler worry about register allocation and pick the least expensive way to deal with running out of registers if that happens.
Your best bet for helping the compiler is to use fewer different constants if possible. e.g. instead of _mm_and_si128 with both set1_epi16(0x00FF) and 0xFF00, use _mm_andn_si128 to mask the other way. You usually can't do anything to influence which things it chooses to keep in registers vs. not, but fortunately compilers are pretty good at this because it's also essential for scalar code.
A compiler will hoist constants out of the loop (even inlining a helper function containing constants), or if only used in one side of a branch, bring the setup into that side of the branch.
The source code computes exactly the same thing with no difference in visible side-effects, so the as-if rule allows the compiler the freedom to do this.
I think compilers normally do register allocation and choose what to spill/reload (or just use a read-only vector constant) after doing CSE (common subexpression elimination) and identifying loop invariants and constants that can be hoisted.
When it finds it doesn't have enough registers to keep all variables and constants in regs inside the loop, the first choice for something to not keep in a register would normally be a loop-invariant vector, either a compile-time constant or something computed before the loop.
An extra load that hits in L1d cache is cheaper than storing (aka spilling) / reloading a variable inside the loop. Thus, compilers will choose to load constants from memory regardless of where you put the definition in the source code.
Part of the point of writing in C++ is that you have a compiler to make this decision for you. Since it's allowed to do the same thing for both sources, doing different things would be a missed-optimization for at least one of the cases. (The best thing to do in any particular case depends on surrounding code, but normally using vector constants as memory source operands is fine when the compiler runs low on regs.)
Is it a matter of some concept like "ptr adds 2 extra latency"?
Micro-fusion of a memory source operand doesn't lengthen the critical path from the non-constant input to the output. The load uop can start as soon as the address is ready, and for vector constants it's usually either a RIP-relative or [rsp+constant] addressing mode. So usually the load is ready to execute as soon as it's issued into the out-of-order part of the core. Assuming an L1d cache hit (since it will stay hot in cache if loaded every loop iteration), this is only ~5 cycles, so it will easily be ready in time if there's a dependency-chain bottleneck on the vector register input.
It doesn't even hurt front-end throughput. Unless you're bottlenecked on load-port throughput (2 loads per clock on modern x86 CPUs), it typically makes no difference. (Even with highly accurate measurement techniques.)

How to safely implement "Using Uninitialized Memory For Fun And Profit"?

I would like to build a dense integer set in C++ using the trick described at https://research.swtch.com/sparse . This approach achieves good performance by allowing itself to read uninitialized memory.
How can I implement this data structure without triggering undefined behavior, and without running afoul of tools like Valgrand or ASAN?
Edit: It seems like responders are focusing on the word "uninitialized" and interpreting it in the context of the language standard. That was probably a poor word choice on my part - here "uninitialized" means only that its value is not important to the correct functioning of the algorithm. It's obviously possible to implement this data structure safely (LLVM does it in SparseMultiSet). My question is what is the best and most performant way to do so?
I can see four basic approaches you can take. These are applicable not only to C++ but also most other low-level languages like C that make uninitialized access possible but not allowed, and the last is applicable even to higher-level "safe" languages.
Ignore the standard, implement it in the usual way
This is the one crazy trick language lawyers hate! Don't freak out yet though - the solutions following this one won't break the rules, so just skip this part if you are of the rules-stickler variety.
The standard makes most uses of uninitialized values undefined and the few loopholes it does allow (e.g., copying one undefined value to another) don't really give you enough rope to actually implement what you want - even in C which is slightly less restrictive (see for example this answer covering C11, which explains that while accessing an indeterminiate value may not directly trigger UB anything that results is also indeterminate and indeed the value may appear to chance from access to access).
So you just implement it anyway, with the knowledge that most or all currently compilers will just compile it to the expected code, and know that your code is not standards compliant.
At least in my test all of gcc, clang and icc didn't take advantage of the illegal access to do anything crazy. Of course, the test is not comprehensive and even if you could construct one, the behavior could change in a new version of the compiler.
You would be safest if the implementation of the methods that access uninitialized memory was compiled, once, in a separate compilation unit - this makes it easy to check that it does the right thing (just check the assembly once) and makes it nearly impossible (outside of LTGC) for the compiler to do anything tricky, since it can't prove whether uninitialized values are being accessed.
Still, this approach is theoretically unsafe and you should check the compiled output very carefully and have additional safeguards in place if you take it.
If you take this approach, tools like valgrind are fairly likely to report a uninitialized read error.
Now these tools work at the assembly level, and some uninitialized reads may be fine (see, for example, the next item on fast standard library implementations), so they don't actually report a uninitialized read immediately, but rather have a variety of heuristics to determine if invalid values are actually used. For example, they may avoid reporting an error until they determine the uninitialized value is used to determine the direction of a conditional jump, or some other action that is not trackable/recoverable according to the heuristic. You may be able to get the compiler to emit code that reads uninitialized memory but is safe according to this heuristic.
More likely, you won't be able to do that (since the logic here is fairly subtle as it relies on the relationship between the values in two arrays), so you can use the suppression options in your tools of choice to hide the errors. For example, valgrind can suppress based on the stack trace - and in fact there are already many such suppression entries used by default to hide false-positives in various standard libraries.
Since it works based on stack traces, you'll probably have difficulties if the reads occur in inlined code, since the top-of-stack will then be different for every call-site. You could avoid this my making sure the function is not inlined.
Use assembly
What is ill-defined in the standard, is usually well-defined at the assembly level. It is why the compiler and standard library can often implement things in a faster way than you could achieve with C or C++: a libc routine written in assembly is already targeting a specific architecture and doesn't have to worry about the various caveats in the language specification that are there to make things run fast on a variety of hardware.
Normally, implementing any serious amount of code in assembly is a costly undertaking, but here it is only a handful, so it may be feasible depending on how many platforms you are targeting. You don't even really need to write the methods yourself - just compile the C++ version (or use godbolt and copy the assembly. The is_member function, for example1, looks like:
sparse_array::is_member(unsigned long):
mov rax, QWORD PTR [rdi+16]
mov rdx, QWORD PTR [rax+rsi*8]
xor eax, eax
cmp rdx, QWORD PTR [rdi]
jnb .L1
mov rax, QWORD PTR [rdi+8]
cmp QWORD PTR [rax+rdx*8], rsi
sete al
Rely on calloc magic
If you use calloc2, you explicitly request zeroed memory from the underlying allocator. Now a correct version of calloc may simply call malloc and then zero out the returned memory, but actual implementations rely on the fact that the OS-level memory allocation routines (sbrk and mmap, pretty much) will generally return you zeroed memory on any OS with protected memory (i.e., all the big ones), to avoid zeroing out the memory again.
As a practical matter, for large allocations, this is typically satisfied by implementing a call like anonymous mmap by mapping a special zero page of all zeros. When (if ever) the memory is written, does copy-on-write actually allocate a new page. So the allocation of large, zeroed memory regions may be for free since the OS already needs to zero the pages.
In that case, implementing your sparse set on top of calloc could be just as fast as the nominally uninitialized version, while being safe and standards compliant.
Calloc Caveats
You should of course test to ensure that calloc is behaving as expected. The optimized behavior is usually only going to occur when your program initializes a lot of long-lived zeroed memory approximately "up-front". That is, the typical logic for optimized calloc if something like this:
calloc(N)
if (can satisfy a request for N bytes from allocated-then-freed memory)
memset those bytes to zero and return them
else
ask the OS for memory, return it directly because it is zeroed
Basically, the malloc infrastructure (which also underlies new and friends) has a (possibly empty) pool of memory that it has already requested from the OS and generally tries to allocated there first. This pool is composed of memory from the last block request from the OS but not handed out (e.g., because the user requested 32 bytes but the allocated asks for chunks from the OS in 1 MB blocks, so there is a lot left over), and also of memory that was handed out to the process but subsequently returned via free or delete or whatever. The memory in that pool has arbitrary values, and if a calloc can be satisfied from that pool, you don't get your magic, since the zero-init has to occur.
On the other hand if the memory has to be allocated from the OS, you get the magic. So it depends on your use case: if you are frequently creating and destroying sparse_set objects, you will generally just be drawing from the internal malloc pools and will pay the zeroing costs. If you have a long-lived sparse_set objects which take up a lot of memory, they likely were allocated by asking the OS and you got the zeroing nearly for free.
The good news is that if you don't want to rely on the calloc behavior above (indeed, on your OS or with your allocator it may not even be optimized in that way), you could usually replicate the behavior by mapping in /dev/zero manually for your allocations. On OSes that offer it, this guarantees that you get the "cheap" behavior.
Use Lazy Initialization
For a solution that is totally platform agnostic you could simply use yet another array which tracks the initialization state of the array.
First you choose some granule, at which you will track initialization, and use bitmap where each bit tracks the initialization state of that granule of the sparse array.
For example, let's say you choose your granule to be 4 elements, and the size of the elements in your array is 4 bytes (e.g., int32_t values): you need 1 bit to track every 4 elements * 4 bytes/element * 8 bits/byte, which is an overhead of less than 1%3 in allocated memory.
Now you simply check the corresponding bit in this array before accessing sparse. This adds some small cost to accessing the sparse array, but doesn't change the overall complexity, and the check is still quite fast.
For example, your is_member function now looks like:
bool sparse_set::is_member(size_t i){
bool init = is_init[i >> INIT_SHIFT] & (1UL << (i & INIT_MASK));
return init && sparse[i] < n && dense[sparse[i]] == i;
}
The generated assembly on x86 (gcc) now starts with:
mov rax, QWORD PTR [rdi+24]
mov rdx, rsi
shr rdx, 8
mov rdx, QWORD PTR [rax+rdx*8]
xor eax, eax
bt rdx, rsi
jnc .L2
...
.L2:
ret
That's all associate with the bitmap check. It's all going to be pretty quick (and often off the critical path since it isn't part of the data flow).
In general, the cost of this approach depends on the density of your set, and what functions you are calling - is_member is about the worse case for this approach since some functions (e.g., clear) aren't affected at all, and others (e.g., iterate) can batch up the is_init check and only do it once every INIT_COVERAGE elements (meaning the overhead would be again ~1% for the example values).
Sometimes this approach will be faster than the approach suggested in the OP's link, especially when the handling elements not in the set - in this case, the is_init check will fail and often shortcut the remaining code, and in this case you have a working set that is much smaller (256 times using the example granule size) than the size of the sparse array, so you may great reduce misses to DRAM or outer cache levels.
The granule size itself is an important tunable for this approach. Intuitively, a larger granule size pays a larger initialization cost when the an element covered by the granule is accessed for the first time, but saves on memory and up-front is_init initialization cost. You can come up with a formula that finds the optimum size in the simple case - but the behavior also depends on the "clustering" of values and other factors. Finally, it is totally reasonable to use a dynamic granule size to cover your bases under varying workloads - but it comes at the cost of variable shifts.
Really Lazy Solution
It is worth noting that there is a similarity between the calloc and lazy init solutions: both lazily initialize blocks of memory as they are needed, but the calloc solution track this implicitly in hardware through MMU magic with page tables and TLB entries, while the lazy init solution does it in software, with a bitmap explicitly tracking which granules have been allocated.
The hardware approach has the advantage of being nearly free in (for the "hit" case, anyway) since it uses the always-present virtual memory support in the CPU to detect misses, but the software case has the advantage of being portable and allowing precise control over the granule size etc.
You can actually combine these approaches, to make a lazy approach that doesn't use a bitmap, and doesn't even need the dense array at all: just allocate your sparse array with mmap as PROT_NONE, so you fault whenever you read from an un-allocated page in a sparse array. You catch the fault and allocate the page in the sparse array with a sentinel value indicating "not present" for every element.
This is the fastest of all for the "hot" case: you don't need any of the ... && dense[sparse[i]] == i checks any more.
The downsides are:
Your code is really not portable since you need to implement the fault-handling logic, which is usually platform specific.
You can't control the granule size: it must be at page granularity (or some multiple thereof). If your set is very sparse (say less than 1 out of 4096 elements occupied) and uniformly distributed, you end up paying a high initialization cost since you need to handle a fault and initialize a full page of values for every element.
Misses (i.e., non-insert accesses to set elements that don't exist) either need to allocate a page even if no elements will exist in that range, or will be very slow (incurring a fault) every time.
1 This implementation has no "range checking" - i.e., it doesn't check if i is greater than MAX_ELEM - depending on your use case you may want to check this. My implementation used a template parameter for MAX_ELEM, which may result in slightly faster code, but also more bloat, and you'd do fine to just make the max size a class member as well.
2 Really, the only requirement that you use something that calls calloc under the covers or performs the equivalent zero-fill optimization, but based on my tests more idiomatic C++ approaches like new int[size]() just do the allocation followed by a memset. gcc does optimize malloc followed by memset into calloc, but that's not useful if you are trying to avoid the use of C routines anyway!
3 Precisely, you need 1 extra bit to track every 128 bits of the sparse array.
If we reword your question:
What code reads from uninitialized memory without tripping tools designed to catch reads from uninitialized memory?
Then the answer becomes clear -- it is not possible. Any way of doing this that you could find represents a bug in Valgrind that would be fixed.
Maybe it's possible to get the same performance without UB, but the restrictions you put on your question "I would like to... use the trick... allowing itself to read uninitialized memory" guarantee UB. Any competing method avoiding UB will not be using the trick that you so love.
Valgrind does not complain if you just read uninitialised memory.
Valgrind will complain if you use this data in a way that influences
the visible behaviour of the program, e.g. using this data as input in a syscall, or doing a jump based on this data, or using this data to commpute another address. See http://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.uninitvals for more info.
So, it might very well be that you will have no problem with Valgrind.
If valgrind still complain but your algorithm is correct even when using this uninit data, then you can use client requests to declare this memory as initialised.

Relationship between physical registers and Intel SIMD variables?

What is the relationship between physical processor registers and the variables used in Intel intrinsics (e.g. __m128)?
A diagram explaining SIMD typically shows 2 registers but references on the Intel forums to "register pressure" and in this question to "register coloring" suggest there is more going on.
Can any number of variables representing registers be declared? How can this be when they're closely tied to a finite physical resource? What should one be aware of regarding how physical registers are chosen? What happens if more registers are declared than exist?
Can multiple pairs of registers be active at the same time?
Are there different types of physical registers?
The variable types like _m128, _m128i, _m128d, ... are there mainly to protect you. They ensure that you are not attempting to use standard operators like +, -, &, |, ==, ... and ensure that the compiler will throw an error if you are attempting to assign the wrong types. These types force the compiler to load themselves into the appropriate register (XMM* in this case), but still give the compiler the freedom to choose which one, or store them locally on the stack if all appropriate registers are taken. They also ensure that any time they are stored on the stack, they maintain the correct alignment (16 byte alignment in this case), so that intrinsic instructions that rely on alignment don't cause GPFs.
You may tightly tie one of these variables to a physical register if you like using the asm constructs:
__m128i myXMM1 asm( "%xmm1" );
But it's better to just let the compiler do its magic and choose the registers for you to allow better optimization.
Any number of these variables can be declared, and even overbooking your XMM register store might not result in using stack space, as long as your working set of registers remains small. Compiler scoping will usually realize when a value is no longer used and allow the optimizer to not store it back to the stack. Sometimes you can help the compiler out by creating your own scoped stack frame:
__m128i storedVar;
{
__m128i tempVar1, tempVar2, tempVar3;
// do some operations with tempVar1 -> 3
storedVar = tempVar1;
}
{
__m128i tempVar4, tempVar5, tempVar6, tempVar7, tempVar8;
// do some operations with tempVar4 -> 8
storedVar = tempVar4;
}
return storedVar;
Since the variables go out of scope at the closed curly brace, the compiler sees that the registers which used to contain those values are now freed up, so it doesn't need to exceed the total number of available XMM registers.
If you do overbook your register store, and all values need to be maintained, then the compiler will allocate the appropriate size on the stack and ensure that it is properly aligned, and the value of the XMM register will be swapped out to the stack to make room for a new value. Keep in mind that the stack space is well cached, so writes and reads there are not as harmful as you might have expected. The real hit that you take is the necessity of the extra move operations to swap them in and out.
There are different types of physical registers by width (64-bit, 128-bit, 256-bit, 512-bit), obviously associated with the corresponding C/C++ intrinsic data type. The different "flavors" for a given width ("__m128i", "__m128d", ...) can actually all reside in any of the registers of the given width. The type forces you to use the appropriate intrinsic type (_mm_and_si128 vs. _mm_and_pd, for example), which in turn generates an appropriate version of the instruction.
Something like the "and" is a good example because the resulting operation will be identical regardless of type - a bitwise "and". But using the wrong type can incur latency according to what I've read in the Intel docs. The integer instructions and floating point instructions have separate execution queues, and whenever data has to move from one execution queue to the other, there is a penalty. So generally it is good practice to choose the appropriate data type so that the appropriate instructions can be generated, and stay within the realm of that data type.

strict aliasing and memory alignment

I have performance critical code and there is a huge function that allocates like 40 arrays of different size on the stack at the beginning of the function. Most of these arrays have to have certain alignment (because these arrays are accessed somewhere else down the chain using cpu instructions that require memory alignment (for Intel and arm CPUs).
Since some versions of gcc simply fail to align stack variables properly (notably for arm code), or even sometimes it says that maximum alignment for the target architecture is less than what my code actually requests, I simply have no choice but to allocate these arrays on the stack and align them manually.
So, for each array I need to do something like that to get it aligned properly:
short history_[HIST_SIZE + 32];
short * history = (short*)((((uintptr_t)history_) + 31) & (~31));
This way, history is now aligned on 32-byte boundary. Doing the same is tedious for all 40 arrays, plus this part of code is really cpu intensive and I simply cannot do the same alignment technique for each of the arrays (this alignment mess confuses the optimizer and different register allocation slows down the function big time, for better explanation see explanation at the end of the question).
So... obviously, I want to do that manual alignment only once and assume that these arrays are located one right after the other. I also added extra padding to these arrays so that they are always multiple of 32 bytes. So, then I simply create a jumbo char array on the stack and cast it to a struct that has all these aligned arrays:
struct tmp
{
short history[HIST_SIZE];
short history2[2*HIST_SIZE];
...
int energy[320];
...
};
char buf[sizeof(tmp) + 32];
tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));
Something like that. Maybe not the most elegant, but it produced really good result and manual inspection of generated assembly proves that generated code is more or less adequate and acceptable. Build system was updated to use newer GCC and suddenly we started to have some artifacts in generated data (e.g. output from validation test suite is not bit exact anymore even in pure C build with disabled asm code). It took long time to debug the issue and it appeared to be related to aliasing rules and newer versions of GCC.
So, how can I get it done? Please, don't waste time trying to explain that it's not standard, not portable, undefined etc (I've read many articles about that). Also, there is no way I can change the code (I would perhaps consider modifying GCC as well to fix the issue, but not refactoring the code)... basically, all I want is to apply some black magic spell so that newer GCC produces the functionally same code for this type of code without disabling optimizations?
Edit:
I used this code on multiple OSes/compilers, but started to have issues when I switched to newer NDK which is based on GCC 4.6. I get the same bad result with GCC 4.7 (from NDK r8d)
I mention 32 byte alignment. If it hurts your eyes, substitute it with any other number that you like, for example 666 if it helps. There is absolutely no point to even mention that most architectures don't need that alignment. If I align 8KB of local arrays on stack, I loose 15 bytes for 16 byte alignment and I loose 31 for 32 byte alignment. I hope it's clear what I mean.
I say that there are like 40 arrays on the stack in performance critical code. I probably also need to say that it's a third party old code that has been working well and I don't want to mess with it. No need to say if it's good or bad, no point for that.
This code/function has well tested and defined behavior. We have exact numbers of the requirements of that code e.g. it allocates Xkb or RAM, uses Y kb of static tables, and consumes up to Z kb of stack space and it cannot change, since the code won't be changed.
By saying that "alignment mess confuses the optimizer" I mean that if I try to align each array separately code optimizer allocates extra registers for the alignment code and performance critical parts of code suddenly don't have enough registers and start trashing to stack instead which results in a slowdown of the code. This behavior was observed on ARM CPUs (I'm not worried about intel at all by the way).
By artifacts I meant that the output becomes non-bitexact, there is some noise added. Either because of this type aliasing issue or there is some bug in the compiler that results eventually in wrong output from the function.
In short, the point of the question... how can I allocate random amount of stack space (using char arrays or alloca, and then align pointer to that stack space and reinterpret this chunk of memory as some structure that has some well defined layout that guarantees alignment of certain variables as long as the structure itself is aligned properly. I'm trying to cast the memory using all kinds of approaches, I move the big stack allocation to a separate function, still I get bad output and stack corruption, I'm really starting to think more and more that this huge function hits some kind of bug in gcc. It's quite strange, that by doing this cast I can't get this thing done no matter what I try. By the way, I disabled all optimizations that require any alignment, it's pure C-style code now, still I get bad results (non-bitexact output and occasional stack corruptions crashes). The simple fix that fixes it all, I write instead of:
char buf[sizeof(tmp) + 32];
tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));
this code:
tmp buf;
tmp * X = &buf;
then all bugs disappear! The only problem is that this code doesn't do proper alignment for the arrays and will crash with optimizations enabled.
Interesting observation:
I mentioned that this approach works well and produces expected output:
tmp buf;
tmp * X = &buf;
In some other file I added a standalone noinline function that simply casts a void pointer to that struct tmp*:
struct tmp * to_struct_tmp(void * buffer32)
{
return (struct tmp *)buffer32;
}
Initially, I thought that if I cast alloc'ed memory using to_struct_tmp it will trick gcc to produce results that I expected to get, yet, it still produces invalid output. If I try to modify working code this way:
tmp buf;
tmp * X = to_struct_tmp(&buf);
then i get the same bad result! WOW, what else can I say? Perhaps, based on strict-aliasing rule gcc assumes that tmp * X isn't related to tmp buf and removed tmp buf as unused variable right after return from to_struct_tmp? Or does something strange that produces unexpected result. I also tried to inspect generated assembly, however, changing tmp * X = &buf; to tmp * X = to_struct_tmp(&buf); produces extremely different code for the function, so, somehow that aliasing rule affects code generation big time.
Conclusion:
After all kinds of testing, I have an idea why possibly I can't get it to work no matter what I try. Based on strict type aliasing, GCC thinks that the static array is unused and therefore doesn't allocate stack for it. Then, local variables that also use stack are written to the same location where my tmp struct is stored; in other words, my jumbo struct shares the same stack memory as other variables of the function. Only this could explain why it always results in the same bad result. -fno-strict-aliasing fixes the issue, as expected in this case.
First I'd like to say I'm definitely with you when you ask not to buzz about "standard violation", "implementation-dependent" and etc. Your question is absolutely legitimate IMHO.
Your approach to pack all the arrays within one struct also makes sense, that's what I'd do.
It's unclear from the question formulation which "artifacts" do you observe. Is there any unneeded code generated? Or data misalignment? If the latter is the case - you may (hopefully) use things like STATIC_ASSERT to ensure at compile-time that things are aligned properly. Or at least have some run-time ASSERT at debug build.
As Eric Postpischil proposed, you may consider declaring this structure as global (if this is applicable for the case, I mean multi-threading and recursion are not an option).
One more point that I'd like to notice is so-called stack probes. When you allocate a lot of memory from the stack in a single function (more than 1 page to be exact) - on some platforms (such as Win32) the compiler adds an extra initialization code, known as stack probes. This may also have some performance impact (though likely to be minor).
Also, if you don't need all the 40 arrays simultaneously you may arrange some of them in a union. That is, you'll have one big struct, inside which some sub-structs will be grouped into union.
There are a number of issues here.
Alignment: There is little that requires 32-byte alignment. 16-byte alignment is beneficial for SIMD types on current Intel and ARM processors. With AVX on current Intel processors, the performance cost of using addresses that are 16-byte aligned but not 32-byte aligned is generally mild. There may be a large penalty for 32-byte stores that cross a cache line, so 32-byte alignment can be helpful there. Otherwise, 16-byte alignment may be fine. (On OS X and iOS, malloc returns 16-byte aligned memory.)
Allocation in critical code: You should avoid allocating memory in performance critical code. Generally, memory should be allocated at the start of the program, or before performance critical work begins, and reused during the performance critical code. If you allocate memory before performance critical code begins, then the time it takes to allocate and prepare the memory is essentially irrelevant.
Large, numerous arrays on the stack: The stack is not intended for large memory allocations, and there are limits to its use. Even if you are not encountering problems now, apparently unrelated changes in your code in the future could interact with using a lot of memory on the stack and cause stack overflows.
Numerous arrays: 40 arrays is a lot. Unless these are all in use for different data at the same time, and necessarily so, you should seek to reuse some of the same space for different data and purposes. Using different arrays unnecessarily can cause more cache thrashing than necessary.
Optimization: It is not clear what you mean by saying that the “alignment mess confuses the optimizer and different register allocation slows down the function big time”. If you have multiple automatic arrays inside a function, I would generally expect the optimizer to know they are different, even if you derive pointers from the arrays by address arithmetic. E.g., given code such as a[i] = 3; b[i] = c[i]; a[i] = 4;, I would expect the optimizer to know that a, b, and c are different arrays, and therefore c[i] cannot be the same as a[i], so it is okay to eliminate a[i] = 3;. Perhaps an issue you have is that, with 40 arrays, you have 40 pointers to arrays, so the compiler ends up moving pointers into and out of registers?
In that case, reusing fewer arrays for multiple purposes might help reduce that. If you have an algorithm that is actually using 40 arrays at one time, then you might look at restructuring the algorithm so it uses fewer arrays at a time. If an algorithm has to point to 40 different places in memory, then you essentially need 40 pointers, regardless of where or how they are allocated, and 40 pointers is more than there are registers available.
If you have other concerns about optimization and register use, you should be more specific about them.
Aliasing and artifacts: You report there are some aliasing and artifact problems, but you do not give sufficient details to understand them. If you have one large char array that you reinterpret as a struct containing all your arrays, then there is no aliasing within the struct. So it is not clear what issues you are encountering.
Just disable alias-based optimization and call it a day
If your problems are in fact caused by optimizations related to strict aliasing, then -fno-strict-aliasing will solve the problem. Additionally, in that case, you don't need to worry about losing optimization because, by definition, those optimizations are unsafe for your code and you can't use them.
Good point by Praetorian. I recall one developer's hysteria prompted by the introduction of alias analysis in gcc. A certain Linux kernel author wanted to (A) alias things, and (B) still get that optimization. (That's an oversimplification but it seems like -fno-strict-aliasing would solve the problem, not cost much, and they all must have had other fish to fry.)
32 byte alignment sounds as if you are pushing the button too far. No CPU instruction should require an alignement as large as that. Basically an alignement as wide as the largest data type of your architecture should suffice.
C11 has the concept fo maxalign_t, which is a dummy type of maximum alignment for the architecture. If your compiler doesn't have it, yet, you can easily simulate it by something like
union maxalign0 {
long double a;
long long b;
... perhaps a 128 integer type here ...
};
typedef union maxalign1 maxalign1;
union maxalign1 {
unsigned char bytes[sizeof(union maxalign0)];
union maxalign0;
}
Now you have a data type that has the maximal alignment of your platform and that is default initialized with all bytes set to 0.
maxalign1 history_[someSize];
short * history = history_.bytes;
This avoids the awful address computations that you do currently, you'd only have to do some adoption of someSize to take into account that you always allocate multiples of sizeof(maxalign1).
Also be asured that this has no aliasing problems. First of all unions in C made for this, and then character pointers (of any version) are always allowed to alias any other pointer.

Optimizing member variable order in C++

I was reading a blog post by a game coder for Introversion and he is busily trying to squeeze every CPU tick he can out of the code. One trick he mentions off-hand is to
"re-order the member variables of a
class into most used and least used."
I'm not familiar with C++, nor with how it compiles, but I was wondering if
This statement is accurate?
How/Why?
Does it apply to other (compiled/scripting) languages?
I'm aware that the amount of (CPU) time saved by this trick would be minimal, it's not a deal-breaker. But on the other hand, in most functions it would be fairly easy to identify which variables are going to be the most commonly used, and just start coding this way by default.
Two issues here:
Whether and when keeping certain fields together is an optimization.
How to do actually do it.
The reason that it might help, is that memory is loaded into the CPU cache in chunks called "cache lines". This takes time, and generally speaking the more cache lines loaded for your object, the longer it takes. Also, the more other stuff gets thrown out of the cache to make room, which slows down other code in an unpredictable way.
The size of a cache line depends on the processor. If it is large compared with the size of your objects, then very few objects are going to span a cache line boundary, so the whole optimization is pretty irrelevant. Otherwise, you might get away with sometimes only having part of your object in cache, and the rest in main memory (or L2 cache, perhaps). It's a good thing if your most common operations (the ones which access the commonly-used fields) use as little cache as possible for the object, so grouping those fields together gives you a better chance of this happening.
The general principle is called "locality of reference". The closer together the different memory addresses are that your program accesses, the better your chances of getting good cache behaviour. It's often difficult to predict performance in advance: different processor models of the same architecture can behave differently, multi-threading means you often don't know what's going to be in the cache, etc. But it's possible to talk about what's likely to happen, most of the time. If you want to know anything, you generally have to measure it.
Please note that there are some gotchas here. If you are using CPU-based atomic operations (which the atomic types in C++0x generally will), then you may find that the CPU locks the entire cache line in order to lock the field. Then, if you have several atomic fields close together, with different threads running on different cores and operating on different fields at the same time, you will find that all those atomic operations are serialised because they all lock the same memory location even though they're operating on different fields. Had they been operating on different cache lines then they would have worked in parallel, and run faster. In fact, as Glen (via Herb Sutter) points out in his answer, on a coherent-cache architecture this happens even without atomic operations, and can utterly ruin your day. So locality of reference is not necessarily a good thing where multiple cores are involved, even if they share cache. You can expect it to be, on grounds that cache misses usually are a source of lost speed, but be horribly wrong in your particular case.
Now, quite aside from distinguishing between commonly-used and less-used fields, the smaller an object is, the less memory (and hence less cache) it occupies. This is pretty much good news all around, at least where you don't have heavy contention. The size of an object depends on the fields in it, and on any padding which has to be inserted between fields in order to ensure they are correctly aligned for the architecture. C++ (sometimes) puts constraints on the order which fields must appear in an object, based on the order they are declared. This is to make low-level programming easier. So, if your object contains:
an int (4 bytes, 4-aligned)
followed by a char (1 byte, any alignment)
followed by an int (4 bytes, 4-aligned)
followed by a char (1 byte, any alignment)
then chances are this will occupy 16 bytes in memory. The size and alignment of int isn't the same on every platform, by the way, but 4 is very common and this is just an example.
In this case, the compiler will insert 3 bytes of padding before the second int, to correctly align it, and 3 bytes of padding at the end. An object's size has to be a multiple of its alignment, so that objects of the same type can be placed adjacent in memory. That's all an array is in C/C++, adjacent objects in memory. Had the struct been int, int, char, char, then the same object could have been 12 bytes, because char has no alignment requirement.
I said that whether int is 4-aligned is platform-dependent: on ARM it absolutely has to be, since unaligned access throws a hardware exception. On x86 you can access ints unaligned, but it's generally slower and IIRC non-atomic. So compilers usually (always?) 4-align ints on x86.
The rule of thumb when writing code, if you care about packing, is to look at the alignment requirement of each member of the struct. Then order the fields with the biggest-aligned types first, then the next smallest, and so on down to members with no aligment requirement. For example if I'm trying to write portable code I might come up with this:
struct some_stuff {
double d; // I expect double is 64bit IEEE, it might not be
uint64_t l; // 8 bytes, could be 8-aligned or 4-aligned, I don't know
uint32_t i; // 4 bytes, usually 4-aligned
int32_t j; // same
short s; // usually 2 bytes, could be 2-aligned or unaligned, I don't know
char c[4]; // array 4 chars, 4 bytes big but "never" needs 4-alignment
char d; // 1 byte, any alignment
};
If you don't know the alignment of a field, or you're writing portable code but want to do the best you can without major trickery, then you assume that the alignment requirement is the largest requirement of any fundamental type in the structure, and that the alignment requirement of fundamental types is their size. So, if your struct contains a uint64_t, or a long long, then the best guess is it's 8-aligned. Sometimes you'll be wrong, but you'll be right a lot of the time.
Note that games programmers like your blogger often know everything about their processor and hardware, and thus they don't have to guess. They know the cache line size, they know the size and alignment of every type, and they know the struct layout rules used by their compiler (for POD and non-POD types). If they support multiple platforms, then they can special-case for each one if necessary. They also spend a lot of time thinking about which objects in their game will benefit from performance improvements, and using profilers to find out where the real bottlenecks are. But even so, it's not such a bad idea to have a few rules of thumb that you apply whether the object needs it or not. As long as it won't make the code unclear, "put commonly-used fields at the start of the object" and "sort by alignment requirement" are two good rules.
Depending on the type of program you're running this advice may result in increased performance or it may slow things down drastically.
Doing this in a multi-threaded program means you're going to increase the chances of 'false-sharing'.
Check out Herb Sutters articles on the subject here
I've said it before and I'll keep saying it. The only real way to get a real performance increase is to measure your code, and use tools to identify the real bottle neck instead of arbitrarily changing stuff in your code base.
It is one of the ways of optimizing the working set size. There is a good article by John Robbins on how you can speed up the application performance by optimizing the working set size. Of course it involves careful selection of most frequent use cases the end user is likely to perform with the application.
We have slightly different guidelines for members here (ARM architecture target, mostly THUMB 16-bit codegen for various reasons):
group by alignment requirements (or, for newbies, "group by size" usually does the trick)
smallest first
"group by alignment" is somewhat obvious, and outside the scope of this question; it avoids padding, uses less memory, etc.
The second bullet, though, derives from the small 5-bit "immediate" field size on the THUMB LDRB (Load Register Byte), LDRH (Load Register Halfword), and LDR (Load Register) instructions.
5 bits means offsets of 0-31 can be encoded. Effectively, assuming "this" is handy in a register (which it usually is):
8-bit bytes can be loaded in one instruction if they exist at this+0 through this+31
16-bit halfwords if they exist at this+0 through this+62;
32-bit machine words if they exist at this+0 through this+124.
If they're outside this range, multiple instructions have to be generated: either a sequence of ADDs with immediates to accumulate the appropriate address in a register, or worse yet, a load from the literal pool at the end of the function.
If we do hit the literal pool, it hurts: the literal pool goes through the d-cache, not the i-cache; this means at least a cacheline worth of loads from main memory for the first literal pool access, and then a host of potential eviction and invalidation issues between the d-cache and i-cache if the literal pool doesn't start on its own cache line (i.e. if the actual code doesn't end at the end of a cache line).
(If I had a few wishes for the compiler we're working with, a way to force literal pools to start on cacheline boundaries would be one of them.)
(Unrelatedly, one of the things we do to avoid literal pool usage is keep all of our "globals" in a single table. This means one literal pool lookup for the "GlobalTable", rather than multiple lookups for each global. If you're really clever you might be able to keep your GlobalTable in some sort of memory that can be accessed without loading a literal pool entry -- was it .sbss?)
While locality of reference to improve the cache behavior of data accesses is often a relevant consideration, there are a couple other reasons for controlling layout when optimization is required - particularly in embedded systems, even though the CPUs used on many embedded systems do not even have a cache.
- Memory alignment of the fields in structures
Alignment considerations are pretty well understood by many programmers, so I won't go into too much detail here.
On most CPU architectures, fields in a structure must be accessed at a native alignment for efficiency. This means that if you mix various sized fields the compiler has to add padding between the fields to keep the alignment requirements correct. So to optimize the memory used by a structure it's important to keep this in mind and lay out the fields such that the largest fields are followed by smaller fields to keep the required padding to a minimum. If a structure is to be 'packed' to prevent padding, accessing unaligned fields comes at a high runtime cost as the compiler has to access unaligned fields using a series of accesses to smaller parts of the field along with shifts and masks to assemble the field value in a register.
- Offset of frequently used fields in a structure
Another consideration that can be important on many embedded systems is to have frequently accessed fields at the start of a structure.
Some architectures have a limited number of bits available in an instruction to encode an offset to a pointer access, so if you access a field whose offset exceeds that number of bits the compiler will have to use multiple instructions to form a pointer to the field. For example, the ARM's Thumb architecture has 5 bits to encode an offset, so it can access a word-sized field in a single instruction only if the field is within 124 bytes from the start. So if you have a large structure an optimization that an embedded engineer might want to keep in mind is to place frequently used fields at the beginning of a structure's layout.
Well the first member doesn't need an offset added to the pointer to access it.
In C#, the order of the member is determined by the compiler unless you put the attribute [LayoutKind.Sequential/Explicit] which forces the compiler to lay out the structure/class the way you tell it to.
As far as I can tell, the compiler seems to minimize packing while aligning the data types on their natural order (i.e. 4 bytes int start on 4 byte addresses).
I'm focusing on performance, execution speed, not memory usage.
The compiler, without any optimizing switch, will map the variable storage area using the same order of declarations in code.
Imagine
unsigned char a;
unsigned char b;
long c;
Big mess-up? without align switches, low-memory ops. et al, we're going to have an unsigned char using a 64bits word on your DDR3 dimm, and another 64bits word for the other, and yet the unavoidable one for the long.
So, that's a fetch per each variable.
However, packing it, or re-ordering it, will cause one fetch and one AND masking to be able to use the unsigned chars.
So speed-wise, on a current 64bits word-memory machine, aligns, reorderings, etc, are no-nos. I do microcontroller stuff, and there the differences in packed/non-packed are reallllly noticeable (talking about <10MIPS processors, 8bit word-memories)
On the side, it's long known that the engineering effort required to tweak code for performance other than what a good algorithm instructs you to do, and what the compiler is able to optimize, often results in burning rubber with no real effects. That and a write-only piece of syntaxically dubius code.
The last step-forward in optimization I saw (in uPs, don't think it's doable for PC apps) is to compile your program as a single module, have the compiler optimize it (much more general view of speed/pointer resolution/memory packing, etc), and have the linker trash non-called library functions, methods, etc.
In theory, it could reduce cache misses if you have big objects. But it's usually better to group members of the same size together so you have tighter memory packing.
I highly doubt that would have any bearing in CPU improvements - maybe readability. You can optimize the executable code if the commonly executed basic blocks that are executed within a given frame are in the same set of pages. This is the same idea but would not know how create basic blocks within the code. My guess is the compiler puts the functions in the order it sees them with no optimization here so you could try and place common functionality together.
Try and run a profiler/optimizer. First you compile with some profiling option then run your program. Once the profiled exe is complete it will dump some profiled information. Take this dump and run it through the optimizer as input.
I have been away from this line of work for years but not much has changed how they work.