Will std::array template instances occupy more code memory? - c++

I have a micro-controller that does not have an MMU, but we are using C and C++.
We are avoiding all dynamic memory usage (i.e. no new SomeClass() or malloc()) and most of the standard library.
Semi-Question 0:
From what I understand std::array does not use any dynamic memory so its usage should be OK (It is on the stack only). Looking at std::array source code, it looks fine since it creates a c-style array and then wraps functionality around that array.
The chip we are using has 1MB of flash memory for storing code.
Question 1:
I am worried that the use of templates in std::array will cause the binary to be larger, which will then potentially cause the binary to exceed the 1MB code memory limit.
I think if you create an instance of a std::array< int, 5 >, then all calls to functions on that std::array will occupy a certain amount of code memory, lets say X bytes of memory.
If you create another instance of std::array< SomeObject, 5 >, then call functions to that std::array, will each of those functions now be duplicated in the binary, thus taking up more code memory? X bytes of memory + Y bytes of memory.
If so, do you think the amount of code generated given the limited code memory capacity will be a concern?
Question 2:
In the above example, if you created a second std::array< int, 10 > instance, would the calls to functions also duplicate the function calls in the generated code? Even though both instances are of the same type, int?

std::array is considered a zero cost abstraction, which means it should be fairly optimizable by the compiler.
As of any zero cost abstraction, it may induce a small compile time penality, and if the opimizations required te be truely zero cost are not supported, then it may incur a small size or runtime penality.
However, note that compiler are free to add padding at the end of a struct. Since std::array is a struct, you should check how your platform is handling std::array, but I highly doubt it's the case for you.
Take this array and std::array case:
#include <numeric>
#include <iterator>
template<std::size_t n>
int stuff(const int(&arr)[n]) {
return std::accumulate(std::begin(arr), std::end(arr), 0);
}
int main() {
int arr[] = {1, 2, 3, 4, 5, 6};
return stuff(arr);
}
#include <numeric>
#include <iterator>
#include <array>
template<std::size_t n>
int stuff(const std::array<int, n>& arr) {
return std::accumulate(std::begin(arr), std::end(arr), 0);
}
int main() {
std::array arr = {1, 2, 3, 4, 5, 6};
return stuff(arr);
}
Clang support this case very well. all cases with std::array or raw arrays are handleld the same way:
-O2 / -O3 both array and std::array with clang:
main: # #main
mov eax, 21
ret
However, GCC seem to have a problem optimizing it, for bith the std::array and the raw array case:
-O3 with GCC for array and std::array:
main:
movdqa xmm0, XMMWORD PTR .LC0[rip]
movaps XMMWORD PTR [rsp-40], xmm0
mov edx, DWORD PTR [rsp-32]
mov eax, DWORD PTR [rsp-28]
lea eax, [rdx+14+rax]
ret
.LC0:
.long 1
.long 2
.long 3
.long 4
Then, it seem to optimize better with -O2 in the case of raw array and fail with std::array:
-O2 GCC std::array:
main:
movabs rax, 8589934593
lea rdx, [rsp-40]
mov ecx, 1
mov QWORD PTR [rsp-40], rax
movabs rax, 17179869187
mov QWORD PTR [rsp-32], rax
movabs rax, 25769803781
lea rsi, [rdx+24]
mov QWORD PTR [rsp-24], rax
xor eax, eax
jmp .L3
.L5:
mov ecx, DWORD PTR [rdx]
.L3:
add rdx, 4
add eax, ecx
cmp rdx, rsi
jne .L5
rep ret
-O2 GCC raw array:
main:
mov eax, 21
ret
It seem that the GCC bug failling to optimize -O3 but succeed with -O2 is fixed in the most recent build.
Here's a compiler explorer with all the O2 and the O3
With all these cases stated, you can see a common pattern: No information about the std::array is outputted in the binary. There are no constructors, no operator[], not even iterators, nor algorithms. Everything is inlined. Compiler are good at inlining simple functions. std::array member functions are usually very very simple.
If you create another instance of std::array< SomeObject, 5 >, then call functions to that std::array, will each of those functions now be duplicated in the binary, thus taking up more flash memory? X bytes of memory + Y bytes of memory.
Well, you changed the data type your array is containing. If you manually add overload of all your functions to handle this additional case, then yes, all those new functions may take up some space. If your function are small, there is a great chance for them to be inlined and take less space. As you can see with the example above, inlining and constant folding may greatly reduce your binary size.
In the above example, if you created a second std::array instance, would the calls to functions also duplicate the function calls in flash memory? Even though both instances are of the same type, int?
Again it depends. If you have many function templated in the size of the array, both std::array and raw arrays may "create" different function. But again, if they are inlined, there is no duplicate to be worried about.
Both will a raw array and std::array, you can pass a pointer to the start of the array and pass the size. If you find this more suitable for your case, then use that, but still raw array and std::array can do that. For raw array, it implicitly decays to a pointer, and with std::array, you must use arr.data() to get the pointer.

Related

Different instructions when I use direct initialization vs std::initializer_list

I was playing around with various methods of initialization and evaluating the developer experience they provide for various use cases and their performance impacts. In the process, I wrote two snippets of code, one with direct initialization and one with an initializer list.
Direct:
class C {
public:
char* x;
C() : x(new char[10] {'0','1'}) {}
};
C c;
Initializer list:
#include <utility>
#include <algorithm>
class C {
public:
char* x;
C(std::initializer_list<char> il) {
x = new char[10];
std::copy(il.begin(), il.end(), x);
}
};
C c = {'0','1'};
I was expecting std::initializer_list to generate the same assembly and while I can understand why it might be worse (looks more complex), I would have been asking a very different question here if it was actually worse. Imagine my surprise when I saw that it generates better code with fewer instructions!
Link to godbolt with the above snippets - https://godbolt.org/z/i3XZ_K
Direct:
_GLOBAL__sub_I_c:
sub rsp, 8
mov edi, 10
call operator new[](unsigned long)
mov edx, 12592
mov WORD PTR [rax], dx
mov QWORD PTR [rax+2], 0
mov QWORD PTR c[rip], rax
add rsp, 8
ret
c:
.zero 8
Initializer list:
_GLOBAL__sub_I_c:
sub rsp, 8
mov edi, 10
call operator new[](unsigned long)
mov edx, 12592
mov QWORD PTR c[rip], rax
mov WORD PTR [rax], dx
add rsp, 8
ret
c:
.zero 8
The assembly is pretty much the same except for the extra mov QWORD PTR [rax+2], 0 instruction which on first glance seems like termination of the char array with a null character. However, IIRC only string literals (like "01") are automatically terminated with null char, not char array literals.
Would appreciate any insight into why the generated code is different and anything I can do (if possible) to make the direct case better.
Also, am a noob when it comes to x86 assembly, so let me know if I am missing something obvious.
EDIT: It only gets worse if I add more chars - https://godbolt.org/z/e8Gys_
The two example code snippets do not have the same effect.
new char[10] {'0','1'}
This aggregate-initializes the new char array, which means that all element of the array which are not given an initializer in the brace-enclosed initializer-list are initialized to zero.
x = new char[10];
This does not set any of the elements of the new array to any value and
std::copy(il.begin(), il.end(), x);
only sets the elements that are actually specified in the std::initializer_list.
Therefore in the first version the compiler has to guarantee that all elements without specified initializer are set to zero, while it can just leave these values untouched in the second version.
There is no way to partially initialize an array directly in a new expression. But the obvious reproduction of the behavior of the second example would be something like
C() : x(new char[10]) {
constexpr char t[]{'0','1'};
std::copy(std::begin(t), std::end(t), x);
}
although, really, you shouldn't use new directly anyway. Instead use std::vector<char>, std::array<char, 10>, std::string or std::unique_ptr<char[]> depending on the concrete use case. All of these avoid lifetime issues that new would cause.

Why does GCC aggregate initialization of an array fill the whole thing with zeros first, including non-zero elements?

Why does gcc fill the whole array with zeros instead of only the remaining 96 integers? The non-zero initializers are all at the start of the array.
void *sink;
void bar() {
int a[100]{1,2,3,4};
sink = a; // a escapes the function
asm("":::"memory"); // and compiler memory barrier
// forces the compiler to materialize a[] in memory instead of optimizing away
}
MinGW8.1 and gcc9.2 both make asm like this (Godbolt compiler explorer).
# gcc9.2 -O3 -m32 -mno-sse
bar():
push edi # save call-preserved EDI which rep stos uses
xor eax, eax # eax=0
mov ecx, 100 # repeat-count = 100
sub esp, 400 # reserve 400 bytes on the stack
mov edi, esp # dst for rep stos
mov DWORD PTR sink, esp # sink = a
rep stosd # memset(a, 0, 400)
mov DWORD PTR [esp], 1 # then store the non-zero initializers
mov DWORD PTR [esp+4], 2 # over the zeroed part of the array
mov DWORD PTR [esp+8], 3
mov DWORD PTR [esp+12], 4
# memory barrier empty asm statement is here.
add esp, 400 # cleanup the stack
pop edi # and restore caller's EDI
ret
(with SSE enabled it would copy all 4 initializers with movdqa load/store)
Why doesn't GCC do lea edi, [esp+16] and memset (with rep stosd) only the last 96 elements, like Clang does? Is this a missed optimization, or is it somehow more efficient to do it this way? (Clang actually calls memset instead of inlining rep stos)
Editor's note: the question originally had un-optimized compiler output which worked the same way, but inefficient code at -O0 doesn't prove anything. But it turns out that this optimization is missed by GCC even at -O3.
Passing a pointer to a to a non-inline function would be another way to force the compiler to materialize a[], but in 32-bit code that leads to significant clutter of the asm. (Stack args result in pushes, which gets mixed in with stores to the stack to init the array.)
Using volatile a[100]{1,2,3,4} gets GCC to create and then copy the array, which is insane. Normally volatile is good for looking at how compilers init local variables or lay them out on the stack.
In theory your initialization could look like that:
int a[100] = {
[3] = 1,
[5] = 42,
[88] = 1,
};
so it may be more effective in sense of cache and optimizablity to first zero out the whole memory block and then set individual values.
May be the behavior changes depending on:
target architecture
target OS
array length
initialization ratio (explicitly initialized values/length)
positions of the initialized values
Of course, in your case the initialization are compacted at the start of the array and the optimization would be trivial.
So it seems that gcc is doing the most generic approach here. Looks like a missing optimization.

Why does p1007r0 std::assume_aligned remove the need for epilogue?

My understanding is that vectorization of code works something like this:
For data in array bellow the first address in the array that is the multiple of 128(or 256 or whatever SIMD instructions require) do slow element by element processing. Let's call this prologue.
For data in array between the first address that is multiple of 128 and last one address that is multiple of 128 use SIMD instruction.
For the data between last address that is multiple of 128 and end of array use slow element by element processing. Let's call this epilogue.
Now I understand why std::assume_aligned helps with prologue, but I do not get why it enables compiler to remove epilogue also.
Quote from proposal:
If we could make this property visible to the compiler, it could skip the loop prologue and epilogue
You can see the effect on code-gen from using GNU C / C++ __builtin_assume_aligned.
gcc 7 and earlier targeting x86 (and ICC18) prefer to use a scalar prologue to reach an alignment boundary, then an aligned vector loop, then a scalar epilogue to clean up any leftover elements that weren't a multiple of a full vector.
Consider the case where the total number of elements is known at compile time to be a multiple of the vector width, but the alignment isn't known. If you knew the alignment, you don't need either a prologue or epilogue. But if not, you need both. The number of left-over elements after the last aligned vector is not known.
This Godbolt compiler explorer link shows these functions compiled for x86-64 with ICC18, gcc7.3, and clang6.0. clang unrolls very aggressively, but still uses unaligned stores. This seems like a weird way to spend that much code-size for a loop that just stores.
// aligned, and size a multiple of vector width
void set42_aligned(int *p) {
p = (int*)__builtin_assume_aligned(p, 64);
for (int i=0 ; i<1024 ; i++ ) {
*p++ = 0x42;
}
}
# gcc7.3 -O3 (arch=tune=generic for x86-64 System V: p in RDI)
lea rax, [rdi+4096] # end pointer
movdqa xmm0, XMMWORD PTR .LC0[rip] # set1_epi32(0x42)
.L2: # do {
add rdi, 16
movaps XMMWORD PTR [rdi-16], xmm0
cmp rax, rdi
jne .L2 # }while(p != endp);
rep ret
This is pretty much exactly what I'd do by hand, except maybe unrolling by 2 so OoO exec could discover the loop exit branch being not-taken while still chewing on the stores.
Thus unaligned version includes a prologue and epilogue:
// without any alignment guarantee
void set42(int *p) {
for (int i=0 ; i<1024 ; i++ ) {
*p++ = 0x42;
}
}
~26 instructions of setup, vs. 2 from the aligned version
.L8: # then a bloated loop with 4 uops instead of 3
add eax, 1
add rdx, 16
movaps XMMWORD PTR [rdx-16], xmm0
cmp ecx, eax
ja .L8 # end of main vector loop
# epilogue:
mov eax, esi # then destroy the counter we spent an extra uop on inside the loop. /facepalm
and eax, -4
mov edx, eax
sub r8d, eax
cmp esi, eax
lea rdx, [r9+rdx*4] # recalc a pointer to the last element, maybe to avoid a data dependency on the pointer from the loop.
je .L5
cmp r8d, 1
mov DWORD PTR [rdx], 66 # fully-unrolled final up-to-3 stores
je .L5
cmp r8d, 2
mov DWORD PTR [rdx+4], 66
je .L5
mov DWORD PTR [rdx+8], 66
.L5:
rep ret
Even for a more complex loop which would benefit from a little bit of unrolling, gcc leaves the main vectorized loop not unrolled at all, but spends boatloads of code-size on fully-unrolled scalar prologue/epilogue. It's really bad for AVX2 256-bit vectorization with uint16_t elements or something. (up to 15 elements in the prologue/epilogue, rather than 3). This is not a smart tradeoff, so it helps gcc7 and earlier significantly to tell it when pointers are aligned. (The execution speed doesn't change much, but it makes a big difference for reducing code-bloat.)
BTW, gcc8 favours using unaligned loads/stores, on the assumption that data often is aligned. Modern hardware has cheap unaligned 16 and 32-byte loads/stores, so letting the hardware handle the cost of loads/stores that are split across a cache-line boundary is often good. (AVX512 64-byte stores are often worth aligning, because any misalignment means a cache-line split on every access, not every other or every 4th.)
Another factor is that earlier gcc's fully-unrolled scalar prologues/epilogues are crap compared to smart handling where you do one unaligned potentially-overlapping vector at the start/end. (See the epilogue in this hand-written version of set42). If gcc knew how to do that, it would be worth aligning more often.
This is discussed in the document itself in Section 5:
A function that returns a pointer T* , and guarantees that it will
point to over-aligned memory, could return like this:
T* get_overaligned_ptr()
{
// code...
return std::assume_aligned<N>(_data);
}
This technique can be used e.g. in the begin() and end()
implementations of a class wrapping an over-aligned range of data. As
long as such functions are inline, the over-alignment will be
transparent to the compiler at the call-site, enabling it to perform
the appropriate optimisations without any extra work by the caller.
The begin() and end() methods are data accessors for the over-aligned buffer _data. That is, begin() returns a pointer to the first byte of the buffer and end() returns a pointer to one byte past the last byte of the buffer.
Suppose they are defined as follows:
T* begin()
{
// code...
return std::assume_aligned<N>(_data);
}
T* end()
{
// code...
return _data + size; // No alignment hint!
}
In this case, the compiler may not be able to eliminate the epilogue. But if there were defined as follows:
T* begin()
{
// code...
return std::assume_aligned<N>(_data);
}
T* end()
{
// code...
return std::assume_aligned<N>(_data + size);
}
Then the compiler would be able to eliminate the epilogue. For example, if N is 128 bits, then every single 128-bit chunk of the buffer is guaranteed to be 128-bit aligned. Note that this is only possible when the size of the buffer is a multiple of the alignment.

GCC fails to optimize aligned std::array like C array

Here's some code which GCC 6 and 7 fail to optimize when using std::array:
#include <array>
static constexpr size_t my_elements = 8;
class Foo
{
public:
#ifdef C_ARRAY
typedef double Vec[my_elements] alignas(32);
#else
typedef std::array<double, my_elements> Vec alignas(32);
#endif
void fun1(const Vec&);
Vec v1{{}};
};
void Foo::fun1(const Vec& __restrict__ v2)
{
for (unsigned i = 0; i < my_elements; ++i)
{
v1[i] += v2[i];
}
}
Compiling the above with g++ -std=c++14 -O3 -march=haswell -S -DC_ARRAY produces nice code:
vmovapd ymm0, YMMWORD PTR [rdi]
vaddpd ymm0, ymm0, YMMWORD PTR [rsi]
vmovapd YMMWORD PTR [rdi], ymm0
vmovapd ymm0, YMMWORD PTR [rdi+32]
vaddpd ymm0, ymm0, YMMWORD PTR [rsi+32]
vmovapd YMMWORD PTR [rdi+32], ymm0
vzeroupper
That's basically two unrolled iterations of adding four doubles at a time via 256-bit registers. But if you compile without -DC_ARRAY, you get a huge mess starting with this:
mov rax, rdi
shr rax, 3
neg rax
and eax, 3
je .L7
The code generated in this case (using std::array instead of a plain C array) seems to check for alignment of the input array--even though it is specified in the typedef as aligned to 32 bytes.
It seems that GCC doesn't understand that the contents of an std::array are aligned the same as the std::array itself. This breaks the assumption that using std::array instead of C arrays does not incur a runtime cost.
Is there something simple I'm missing that would fix this? So far I came up with an ugly hack:
void Foo::fun2(const Vec& __restrict__ v2)
{
typedef double V2 alignas(Foo::Vec);
const V2* v2a = static_cast<const V2*>(&v2[0]);
for (unsigned i = 0; i < my_elements; ++i)
{
v1[i] += v2a[i];
}
}
Also note: if my_elements is 4 instead of 8, the problem does not occur. If you use Clang, the problem does not occur.
You can see it live here: https://godbolt.org/g/IXIOst
Interestingly, if you replace v1[i] += v2a[i]; with v1._M_elems[i] += v2._M_elems[i]; (which is obviously not portable), gcc manages to optimize the std::array case as well as the case of the C array.
Possible interpretation: in the gcc dumps (-fdump-tree-all-all), one can see MEM[(struct FooD.25826 *)this_7(D) clique 1 base 0].v1D.25832[i_15] in the C array case, and MEM[(const value_typeD.25834 &)v2_7(D) clique 1 base 1][_1] for std::array. That is, in the second case, gcc may have forgotten that this is part of type Foo and only remembers that it is accessing a double.
This is an abstraction penalty that comes from all the inline functions one has to go through to finally see the array access. Clang still manages to vectorize nicely (even after removing alignas!). This likely means that clang vectorizes without caring about alignment, and indeed it uses instructions like vmovupd that do not require an aligned address.
The hack you found, casting to Vec, is another way to let the compiler see, when it handles the memory access, that the type being handled is aligned. For a regular std::array::operator[], the memory access happens inside a member function of std::array, which doesn't have any clue that *this happens to be aligned.
Gcc also has a builtin to let the compiler know about alignment:
const double*v2a=static_cast<const double*>(__builtin_assume_aligned(v2.data(),32));

Register keyword in C++

What is difference between
int x=7;
and
register int x=7;
?
I am using C++.
register is a hint to the compiler, advising it to store that variable in a processor register instead of memory (for example, instead of the stack).
The compiler may or may not follow that hint.
According to Herb Sutter in "Keywords That Aren't (or, Comments by Another Name)":
A register specifier has the same
semantics as an auto specifier...
According to Herb Sutter, register is "exactly as meaningful as whitespace" and has no effect on the semantics of a C++ program.
In C++ as it existed in 2010, any program which is valid that uses the keywords "auto" or "register" will be semantically identical to one with those keywords removed (unless they appear in stringized macros or other similar contexts). In that sense the keywords are useless for properly-compiling programs. On the other hand, the keywords might be useful in certain macro contexts to ensure that improper usage of a macro will cause a compile-time error rather than producing bogus code.
In C++11 and later versions of the language, the auto keyword was re-purposed to act as a pseudo-type for objects which are initialized, which a compiler will automatically replace with the type of the initializing expression. Thus, in C++03, the declaration: auto int i=(unsigned char)5; was equivalent to int i=5; when used within a block context, and auto i=(unsigned char)5; was a constraint violation. In C++11, auto int i=(unsigned char)5; became a constraint violation while auto i=(unsigned char)5; became equivalent to auto unsigned char i=5;.
With today's compilers, probably nothing. Is was orginally a hint to place a variable in a register for faster access, but most compilers today ignore that hint and decide for themselves.
register is deprecated in C++11. It is unused and reserved in C++17.
Source: http://en.cppreference.com/w/cpp/keyword/register
Almost certainly nothing.
register is a hint to the compiler that you plan on using x a lot, and that you think it should be placed in a register.
However, compilers are now far better at determining what values should be placed in registers than the average (or even expert) programmer is, so compilers just ignore the keyword, and do what they wants.
The register keyword was useful for:
Inline assembly.
Expert C/C++ programming.
Cacheable variables declaration.
An example of a productive system, where the register keyword was required:
typedef unsigned long long Out;
volatile Out out,tmp;
Out register rax asm("rax");
asm volatile("rdtsc":"=A"(rax));
out=out*tmp+rax;
It has been deprecated since C++11 and is unused and reserved in C++17.
As of gcc 9.3, compiling using -std=c++2a, register produces a compiler warning, but it still has the desired effect and behaves identically to C's register when compiling without -O1–-Ofast optimisation flags in the respect of this answer. Using clang++-7 causes a compiler error however. So yes, register optimisations only make a difference on standard compilation with no optimisation -O flags, but they're basic optimisations that the compiler would figure out even with -O1.
The only difference is that in C++, you are allowed to take the address of the register variable which means that the optimisation only occurs if you don't take the address of the variable or its aliases (to create a pointer) or take a reference of it in the code (only on - O0, because a reference also has an address, because it's a const pointer on the stack, which, like a pointer can be optimised off the stack if compiling using -Ofast, except they will never appear on the stack using -Ofast, because unlike a pointer, they cannot be made volatile and their addresses cannot be taken), otherwise it will behave like you hadn't used register, and the value will be stored on the stack.
On -O0, another difference is that const register on gcc C and gcc C++ do not behave the same. On gcc C, const register behaves like register, because block-scope consts are not optimised on gcc. On clang C, register does nothing and only const block-scope optimisations apply. On gcc C, register optimisations apply but const at block-scope has no optimisation. On gcc C++, both register and const block-scope optimisations combine.
#include <stdio.h> //yes it's C code on C++
int main(void) {
const register int i = 3;
printf("%d", i);
return 0;
}
int i = 3;:
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 3
mov eax, DWORD PTR [rbp-4]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
leave
ret
register int i = 3;:
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
push rbx
sub rsp, 8
mov ebx, 3
mov esi, ebx
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
mov rbx, QWORD PTR [rbp-8] //callee restoration
leave
ret
const int i = 3;
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 3 //still saves to stack
mov esi, 3 //immediate substitution
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
leave
ret
const register int i = 3;
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
mov esi, 3 //loads straight into esi saving rbx push/pop and extra indirection (because C++ block-scope const is always substituted immediately into the instruction)
mov edi, OFFSET FLAT:.LC0 // can't optimise away because printf only takes const char*
mov eax, 0 //zeroed: https://stackoverflow.com/a/6212755/7194773
call printf
mov eax, 0 //default return value of main is 0
pop rbp //nothing else pushed to stack -- more efficient than leave (rsp == rbp already)
ret
register tells the compiler to 1)store a local variable in a callee saved register, in this case rbx, and 2)optimise out stack writes if address of variable is never taken. const tells the compiler to substitute the value immediately (instead of assigning it a register or loading it from memory) and write the local variable to the stack as default behaviour. const register is the combination of these emboldened optimisations. This is as slimline as it gets.
Also, on gcc C and C++, register on its own seems to create a random 16 byte gap on the stack for the first local on the stack, which doesn't happen with const register.
Compiling using -Ofast however; register has 0 optimisation effect because if it can be put in a register or made immediate, it always will be and if it can't it won't be; const still optimises out the load on C and C++ but at file scope only; volatile still forces the values to be stored and loaded from the stack.
.LC0:
.string "%d"
main:
//optimises out push and change of rbp
sub rsp, 8 //https://stackoverflow.com/a/40344912/7194773
mov esi, 3
mov edi, OFFSET FLAT:.LC0
xor eax, eax //xor 2 bytes vs 5 for mov eax, 0
call printf
xor eax, eax
add rsp, 8
ret
Consider a case when compiler's optimizer has two variables and is forced to spill one onto stack. It so happened that both variables have the same weight to the compiler. Given there is no difference, the compiler will arbitrarily spill one of the variables. On the other hand, the register keyword gives compiler a hint which variable will be accessed more frequently. It is similar to x86 prefetch instruction, but for compiler optimizer.
Obviously register hints are similar to user-provided branch probability hints, and can be inferred from these probability hints. If compiler knows that some branch is taken often, it will keep branch related variables in registers. So I suggest caring more about branch hints, and forgetting about register. Ideally your profiler should communicate somehow with the compiler and spare you from even thinking about such nuances.