I am writing some code on Linux, in C++ where I create a large char array for byte processing. After doing some reading I was wondering whether I should align the array on a 16 byte boundary, apparently this can allow the CPU to take advantage of SSE?
If so, how can I tell the GCC compiler where I wish the array to be aligned?
Memory alignment doesn't directly cause GCC to generate SSE code. If you really want GCC to generate SSE code, you should use at least one of the following:
GCC Optimize Options like -msse, -mtune.
Assembly, or Inline Assembly
GCC Vector Extensions
In point 1, whether SSE instructions are generated still depends on the compiler, while in point 2 and 3, SSE instructions are surely to be generated.
Since XMM registers are involved in SSE, a lot of SSE instructions do require strict memory alignment for 128-bit. You can use GCC Type Attributes __attribute__ ((aligned (N))) on your type definition to ensure that.
NOTE: Memory alignment benefits not only from the potential usage of SSE instructions but also from the usage of atom instructions and efficient cache operations. In many platforms, an instruction is atomic only when it accesses memory aligned for its size. Meanwhile, cache is usually organized in groups of lines stably mapping to the memory, which needs one more access if the cache line boundary is crossed.
ALSO NOTE: malloc only ensures to return a pointer which is suitably aligned for any built-in type (see the malloc man page). If you want to align the structs defined by yourself, you should still use the GCC Type Attributes __attribute__ ((aligned (N))) mentioned above.
As a general rule you should use a vector instead of an array. This would solve also the issue of alignement.
Related
I have a C++ program I'm compiling for AMD64. Of course, different processors, despite being AMD64, support different features and instructions because they implement different microarchitectures. An easy way to optimise the program for one's own machine is to just use -march=native in Clang or GCC, but this isn't very portable for distribution's sake. A more portable solution would be to pick and choose specific target features.
This obviously affects performance (some processors support AVX-512, some don't, some support AVX2, some don't, etc.), but can this affect memory usage (heap/stack, not code size) in any significant way?
Different alignment rules or type widths are the two main ways you could get a difference, but -march= doesn't change that, not when compiling for the same ABI on the same ISA. (Otherwise -march=skylake-avx512 code couldn't call -march=sandybridge code and vice versa, if they disagreed on struct layouts.)
Compiling for a different ABI can save space especially in pointer-heavy data structures. Specifically an ILP32 ABI such as Linux x32 has 4 byte pointers instead of 8, so struct foo { foo *next; int val; }; is 8 bytes instead of 16 (after padding to make sizeof(foo) a multiple of the alignof(foo) it inherits from pointers needing 8-byte alignment). But that won't work for your use-case of 100GB of data; 32-bit pointers limit you to 4GiB of address space.
-march= could have some small effect on stack space when auto-vectorizing. e.g. a function might align the stack by 64 in order to spill/reload a ZMM vector.
Or with older GCC, align even if the final asm doesn't actually store or load any vectors to the stack frame. But that's at most an extra 56 bytes of wasted stack space per level of function nesting, vs. 16-byte alignment which can be had for free as part of the calling convention.
GCC / clang's optimizers won't AFAIK do any optimizations that change the size of dynamic allocations. Clang can sometimes optimize away a dynamic allocation entirely in a function that for example creates and destroys a std::vector<float> foo(100); and all accesses to it can be optimized away. (e.g. store constants into the vector and then read them back, it can just optimize that away then eliminate the allocation, too. Or a std::vector that isn't even used.)
Possibly a different allocator library that's better at reducing internal fragmentation could save space, if you end up with some memory pages allocated but not fully used. But that's not something -march= affects.
Before I ask my question, just a little background information.
In C languages, when you assign to a variable, you can conceptually assume you just modified a little piece of memory in RAM.
int a = rand(); //conceptually, you created and assigned variable A in ram
In assembly language, to do the same thing, you essentially need the result of rand() stored in a register, and a pointer to "a". You would then do a store instruction to get the register content into ram.
When you program in C++ for example, when you assign and manipulate value type objects, you usually dont even have to think about their addresses or how or when they will be stored in registers.
Using SSE instrinics are strange because they appear to somewhere inbetween coding in C and assembly, in terms of the conceptual memory model.
You can call load/store functions, and they return objects. A math operation like _mm_add will return an object, yet it's unclear to me weather the result will actually be stored in the object unless you call _mm_store.
Consider the following example:
inline void block(float* y, const float* x) const {
// load 4 data elements at a time
__m128 X = _mm_loadu_ps(x);
__m128 Y = _mm_loadu_ps(y);
// do the computations
__m128 result = _mm_add_ps(Y, _mm_mul_ps(X, _mm_set1_ps(a)));
// store the results
_mm_storeu_ps(y, result);
}
There are alot of temporary objects here. Do the temporary objects actually not exist? Is it all just syntax sugar for calling assembly instrunctions in a C like way? What happens if instead of doing the store command at the end, you just kept the result, would the result then be more than syntax sugar, and will actually hold data?
TL:DR How am I suppose to think about memory when using SSE instrinsics?
An __m128 variable may be in a register and/or memory. It's much the same as with simple float or int variables - the compiler will decide which variables belong in registers and which must be stored in memory. In general the compiler will try to keep the "hottest" variables in registers and the rest in memory. It will also analyse the lifetimes of variables so that a register may be used for more than one variable within a block. As a programmer you don't need to worry about this too much, but you should be aware of how many registers you have, i.e.. 8 XMM registers in 32 bit mode and 16 in 64 bit mode. Keeping your variable usage below these numbers will help to keep everything in registers as far as possible. Having said that, the penalty for accessing an operand in L1 cache is not that much greater than accessing a register operand, so you shouldn't get too hung up on keeping everything in registers if it proves difficult to do so.
Footnote: this vagueness about whether SSE variables are in registers or memory when using intrinsics is actually quite helpful, and makes it much easier to write optimised code than doing it with raw assembler - the compiler does the grunt work of keeping track of register allocation and other optimisations, allowing you to concentrate on making the code work correctly.
Vector variables aren't special. They will be spilled to memory and re-loaded when needed later, if the compiler runs out of registers when optimizing a loop (or across a function call to a function the compiler can't "see" to know that it doesn't touch the vector regs).
gcc -O0 actually does tend to store to RAM when you set them, instead of keeping __m128i variables only in registers, IIRC.
You could write all your intrinsic-using code without ever using any load or store intrinsics, but then you'd be at the mercy of the compiler to decide how and when to move data around. (You actually still are, to some degree these days, thanks to compilers being good at optimizing intrinsics, and not just literally spitting out a load wherever you use a load intrinsic.)
Compilers will fold loads into memory operands for following instructions, if the value isn't needed as an input to something else as well. However, this is only safe if the data is at a known-aligned address, or an aligned-load intrinsic was used.
The way I currently think about load intrinsics is as a way of communicating alignment guarantees (or lack thereof) to the compiler. The "regular" SSE (non-AVX / non-VEX-encoded) versions of vector instructions fault if used with an unaligned 128b memory operand. (Even on CPUs supporting AVX, FWIW.) For example, note that even punpckl* lists its memory operand as a m128, and thus has alignment requirements, even if it only actually reads the low 64b. pmovzx lists its operand as a m128.
Anyway, using load instead of loadu tells the compiler that it can fold the load into being a memory operand for another instruction, even if it can't otherwise prove that it comes from an aligned address.
Compiling for an AVX target machine will allow the compiler to fold even unaligned loads into other operations, to take advantage of uop micro-fusion.
This came up in comments on How to specify alignment with _mm_mul_ps.
The store intrinsics apparently have two purposes:
To tell the compiler whether it should use the aligned or unaligned asm instruction.
To remove the need for a cast from __m128d to double * (doesn't apply to the integer case).
Just to confuse things, AVX2 introduced things like _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a), which stores the high/low halves to different addresses. It probably compiles to a vmovdqu / vextracti128 ..., 1 sequence. Incidentally, I guess they made vextracti128 with AVX512 in mind, since using it with 0 as the immediate is the same as vmovdqu, but slower and longer-to-encode.
We know that the value of a pointer to data should be properly aligned. For example, the value of a pointer to double should be a multiple of 8. So I'm wondering whether a pointer to function has similar requirements.
Alignment of both data and code is highly machine dependent.
On many processors, reading for example double at unaligned addresses will cause a fault (hardware exception, trap, or whatever you want to call it) - this either is handled in software [slow, often 10-1000x slower than aligned access] or causes the application performing the operation to fail (similar to accessing invalid memory locations in a modern OS). On for example x86, it will be slower, but typically not fail, because the processor will, at least in some cases, have to do two smaller read operations and combine those before it gets the value of the double.
Code may have alignment as well. Most RISC processors have fixed size code-words - 4 bytes being a commmon size, and they should be aligned to that size. ARM in "thumb" mode uses 2-byte instruction size, with some instructions having extra data in another word after.
On the other hand, x86 has "single byte" alignment requirement, and 68K for example would require code to be aligned at 2 bytes only. So in that respect, the alignment need will vary. Beyond that, there are efficiency reasons to have a certain alignment - for example starting functions/branches at 8, 16 or 32-byte boundaries is often beneficial, and I know that some older x86 processors had limits of "how many branch predictions for a given N bytes of code there could be" - meaning that if you have many different branches in a short piece of code, some would have to go without branch prediction, because the "slots" for that location were already full up.
So, compilers will (sometimes) pad code to align functions for performance reasons. However, this is not ALWAYS a win - it wastes cache-space with "padding", and it really depends on how the code is used. Compilers typically know this, at least if you use feedback/profile based optimisations (where the code is run with instrumentation to count how the code is used, and the optimisation is based on the results of this).
As a rule, however, function pointers can point anywhere that is a legal address for "code", so the fundamental requirement is typically 1, 2 or 4 bytes, based on the architecture of the processor itself.
I have performance critical code and there is a huge function that allocates like 40 arrays of different size on the stack at the beginning of the function. Most of these arrays have to have certain alignment (because these arrays are accessed somewhere else down the chain using cpu instructions that require memory alignment (for Intel and arm CPUs).
Since some versions of gcc simply fail to align stack variables properly (notably for arm code), or even sometimes it says that maximum alignment for the target architecture is less than what my code actually requests, I simply have no choice but to allocate these arrays on the stack and align them manually.
So, for each array I need to do something like that to get it aligned properly:
short history_[HIST_SIZE + 32];
short * history = (short*)((((uintptr_t)history_) + 31) & (~31));
This way, history is now aligned on 32-byte boundary. Doing the same is tedious for all 40 arrays, plus this part of code is really cpu intensive and I simply cannot do the same alignment technique for each of the arrays (this alignment mess confuses the optimizer and different register allocation slows down the function big time, for better explanation see explanation at the end of the question).
So... obviously, I want to do that manual alignment only once and assume that these arrays are located one right after the other. I also added extra padding to these arrays so that they are always multiple of 32 bytes. So, then I simply create a jumbo char array on the stack and cast it to a struct that has all these aligned arrays:
struct tmp
{
short history[HIST_SIZE];
short history2[2*HIST_SIZE];
...
int energy[320];
...
};
char buf[sizeof(tmp) + 32];
tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));
Something like that. Maybe not the most elegant, but it produced really good result and manual inspection of generated assembly proves that generated code is more or less adequate and acceptable. Build system was updated to use newer GCC and suddenly we started to have some artifacts in generated data (e.g. output from validation test suite is not bit exact anymore even in pure C build with disabled asm code). It took long time to debug the issue and it appeared to be related to aliasing rules and newer versions of GCC.
So, how can I get it done? Please, don't waste time trying to explain that it's not standard, not portable, undefined etc (I've read many articles about that). Also, there is no way I can change the code (I would perhaps consider modifying GCC as well to fix the issue, but not refactoring the code)... basically, all I want is to apply some black magic spell so that newer GCC produces the functionally same code for this type of code without disabling optimizations?
Edit:
I used this code on multiple OSes/compilers, but started to have issues when I switched to newer NDK which is based on GCC 4.6. I get the same bad result with GCC 4.7 (from NDK r8d)
I mention 32 byte alignment. If it hurts your eyes, substitute it with any other number that you like, for example 666 if it helps. There is absolutely no point to even mention that most architectures don't need that alignment. If I align 8KB of local arrays on stack, I loose 15 bytes for 16 byte alignment and I loose 31 for 32 byte alignment. I hope it's clear what I mean.
I say that there are like 40 arrays on the stack in performance critical code. I probably also need to say that it's a third party old code that has been working well and I don't want to mess with it. No need to say if it's good or bad, no point for that.
This code/function has well tested and defined behavior. We have exact numbers of the requirements of that code e.g. it allocates Xkb or RAM, uses Y kb of static tables, and consumes up to Z kb of stack space and it cannot change, since the code won't be changed.
By saying that "alignment mess confuses the optimizer" I mean that if I try to align each array separately code optimizer allocates extra registers for the alignment code and performance critical parts of code suddenly don't have enough registers and start trashing to stack instead which results in a slowdown of the code. This behavior was observed on ARM CPUs (I'm not worried about intel at all by the way).
By artifacts I meant that the output becomes non-bitexact, there is some noise added. Either because of this type aliasing issue or there is some bug in the compiler that results eventually in wrong output from the function.
In short, the point of the question... how can I allocate random amount of stack space (using char arrays or alloca, and then align pointer to that stack space and reinterpret this chunk of memory as some structure that has some well defined layout that guarantees alignment of certain variables as long as the structure itself is aligned properly. I'm trying to cast the memory using all kinds of approaches, I move the big stack allocation to a separate function, still I get bad output and stack corruption, I'm really starting to think more and more that this huge function hits some kind of bug in gcc. It's quite strange, that by doing this cast I can't get this thing done no matter what I try. By the way, I disabled all optimizations that require any alignment, it's pure C-style code now, still I get bad results (non-bitexact output and occasional stack corruptions crashes). The simple fix that fixes it all, I write instead of:
char buf[sizeof(tmp) + 32];
tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));
this code:
tmp buf;
tmp * X = &buf;
then all bugs disappear! The only problem is that this code doesn't do proper alignment for the arrays and will crash with optimizations enabled.
Interesting observation:
I mentioned that this approach works well and produces expected output:
tmp buf;
tmp * X = &buf;
In some other file I added a standalone noinline function that simply casts a void pointer to that struct tmp*:
struct tmp * to_struct_tmp(void * buffer32)
{
return (struct tmp *)buffer32;
}
Initially, I thought that if I cast alloc'ed memory using to_struct_tmp it will trick gcc to produce results that I expected to get, yet, it still produces invalid output. If I try to modify working code this way:
tmp buf;
tmp * X = to_struct_tmp(&buf);
then i get the same bad result! WOW, what else can I say? Perhaps, based on strict-aliasing rule gcc assumes that tmp * X isn't related to tmp buf and removed tmp buf as unused variable right after return from to_struct_tmp? Or does something strange that produces unexpected result. I also tried to inspect generated assembly, however, changing tmp * X = &buf; to tmp * X = to_struct_tmp(&buf); produces extremely different code for the function, so, somehow that aliasing rule affects code generation big time.
Conclusion:
After all kinds of testing, I have an idea why possibly I can't get it to work no matter what I try. Based on strict type aliasing, GCC thinks that the static array is unused and therefore doesn't allocate stack for it. Then, local variables that also use stack are written to the same location where my tmp struct is stored; in other words, my jumbo struct shares the same stack memory as other variables of the function. Only this could explain why it always results in the same bad result. -fno-strict-aliasing fixes the issue, as expected in this case.
First I'd like to say I'm definitely with you when you ask not to buzz about "standard violation", "implementation-dependent" and etc. Your question is absolutely legitimate IMHO.
Your approach to pack all the arrays within one struct also makes sense, that's what I'd do.
It's unclear from the question formulation which "artifacts" do you observe. Is there any unneeded code generated? Or data misalignment? If the latter is the case - you may (hopefully) use things like STATIC_ASSERT to ensure at compile-time that things are aligned properly. Or at least have some run-time ASSERT at debug build.
As Eric Postpischil proposed, you may consider declaring this structure as global (if this is applicable for the case, I mean multi-threading and recursion are not an option).
One more point that I'd like to notice is so-called stack probes. When you allocate a lot of memory from the stack in a single function (more than 1 page to be exact) - on some platforms (such as Win32) the compiler adds an extra initialization code, known as stack probes. This may also have some performance impact (though likely to be minor).
Also, if you don't need all the 40 arrays simultaneously you may arrange some of them in a union. That is, you'll have one big struct, inside which some sub-structs will be grouped into union.
There are a number of issues here.
Alignment: There is little that requires 32-byte alignment. 16-byte alignment is beneficial for SIMD types on current Intel and ARM processors. With AVX on current Intel processors, the performance cost of using addresses that are 16-byte aligned but not 32-byte aligned is generally mild. There may be a large penalty for 32-byte stores that cross a cache line, so 32-byte alignment can be helpful there. Otherwise, 16-byte alignment may be fine. (On OS X and iOS, malloc returns 16-byte aligned memory.)
Allocation in critical code: You should avoid allocating memory in performance critical code. Generally, memory should be allocated at the start of the program, or before performance critical work begins, and reused during the performance critical code. If you allocate memory before performance critical code begins, then the time it takes to allocate and prepare the memory is essentially irrelevant.
Large, numerous arrays on the stack: The stack is not intended for large memory allocations, and there are limits to its use. Even if you are not encountering problems now, apparently unrelated changes in your code in the future could interact with using a lot of memory on the stack and cause stack overflows.
Numerous arrays: 40 arrays is a lot. Unless these are all in use for different data at the same time, and necessarily so, you should seek to reuse some of the same space for different data and purposes. Using different arrays unnecessarily can cause more cache thrashing than necessary.
Optimization: It is not clear what you mean by saying that the “alignment mess confuses the optimizer and different register allocation slows down the function big time”. If you have multiple automatic arrays inside a function, I would generally expect the optimizer to know they are different, even if you derive pointers from the arrays by address arithmetic. E.g., given code such as a[i] = 3; b[i] = c[i]; a[i] = 4;, I would expect the optimizer to know that a, b, and c are different arrays, and therefore c[i] cannot be the same as a[i], so it is okay to eliminate a[i] = 3;. Perhaps an issue you have is that, with 40 arrays, you have 40 pointers to arrays, so the compiler ends up moving pointers into and out of registers?
In that case, reusing fewer arrays for multiple purposes might help reduce that. If you have an algorithm that is actually using 40 arrays at one time, then you might look at restructuring the algorithm so it uses fewer arrays at a time. If an algorithm has to point to 40 different places in memory, then you essentially need 40 pointers, regardless of where or how they are allocated, and 40 pointers is more than there are registers available.
If you have other concerns about optimization and register use, you should be more specific about them.
Aliasing and artifacts: You report there are some aliasing and artifact problems, but you do not give sufficient details to understand them. If you have one large char array that you reinterpret as a struct containing all your arrays, then there is no aliasing within the struct. So it is not clear what issues you are encountering.
Just disable alias-based optimization and call it a day
If your problems are in fact caused by optimizations related to strict aliasing, then -fno-strict-aliasing will solve the problem. Additionally, in that case, you don't need to worry about losing optimization because, by definition, those optimizations are unsafe for your code and you can't use them.
Good point by Praetorian. I recall one developer's hysteria prompted by the introduction of alias analysis in gcc. A certain Linux kernel author wanted to (A) alias things, and (B) still get that optimization. (That's an oversimplification but it seems like -fno-strict-aliasing would solve the problem, not cost much, and they all must have had other fish to fry.)
32 byte alignment sounds as if you are pushing the button too far. No CPU instruction should require an alignement as large as that. Basically an alignement as wide as the largest data type of your architecture should suffice.
C11 has the concept fo maxalign_t, which is a dummy type of maximum alignment for the architecture. If your compiler doesn't have it, yet, you can easily simulate it by something like
union maxalign0 {
long double a;
long long b;
... perhaps a 128 integer type here ...
};
typedef union maxalign1 maxalign1;
union maxalign1 {
unsigned char bytes[sizeof(union maxalign0)];
union maxalign0;
}
Now you have a data type that has the maximal alignment of your platform and that is default initialized with all bytes set to 0.
maxalign1 history_[someSize];
short * history = history_.bytes;
This avoids the awful address computations that you do currently, you'd only have to do some adoption of someSize to take into account that you always allocate multiples of sizeof(maxalign1).
Also be asured that this has no aliasing problems. First of all unions in C made for this, and then character pointers (of any version) are always allowed to alias any other pointer.
does struct member alignment in VC bring performance benefit? if it is what is the best performance implication by using this and which size is best for current cpu architecture (x86_64, SSE2+, ..)
Perf takes a nose-dive on x86 and x64 cores when a member straddles a cache line boundary. The common compiler default is 8 byte packing which ensures you're okay on long long, double and 64-bit pointer members.
SSE2 instructions require an alignment of 16, the code will bomb if it is off. You cannot get that out of a packing pragma, the heap allocator for example will only provide an 8-byte alignment guarantee. Find out what your compiler and CRT support. Something like __declspec(align(16)) and a custom allocator like _aligned_malloc(). Or over-allocate the memory and tweak the pointer yourself.
The default alignment used by the compiler should be appropriate for the target platform (32- or 64-bit Intel/AMD) for general data. To take advantage of SIMD, you might have to use a more restrictive alignment on those arrays, but that's usually done with a #pragma or special data type that applies just to the data you'll be using in the SIMD instructions.