g++ c++11 constexpr evaluation performance - c++

g++ (4.7.2) and similar versions seem to evaluate constexpr surprisingly fast during compile-time. On my machines in fact much faster than the compiled program during runtime.
Is there a reasonable explanation for that behavior?
Are there optimization techniques involved which are only
applicable at compile-time, that can be executed quicker than actual compiled code?
If so, which?
Here`s my test program and the observed results.
#include <iostream>
constexpr int mc91(int n)
{
return (n > 100)? n-10 : mc91(mc91(n+11));
}
constexpr double foo(double n)
{
return (n>2)? (0.9999)*((unsigned int)(foo(n-1)+foo(n-2))%100):1;
}
constexpr unsigned ack( unsigned m, unsigned n )
{
return m == 0
? n + 1
: n == 0
? ack( m - 1, 1 )
: ack( m - 1, ack( m, n - 1 ) );
}
constexpr unsigned slow91(int n) {
return mc91(mc91(foo(n))%100);
}
int main(void)
{
constexpr unsigned int compiletime_ack=ack(3,14);
constexpr int compiletime_91=slow91(49);
static_assert( compiletime_ack == 131069, "Must be evaluated at compile-time" );
static_assert( compiletime_91 == 91, "Must be evaluated at compile-time" );
std::cout << compiletime_ack << std::endl;
std::cout << compiletime_91 << std::endl;
std::cout << ack(3,14) << std::endl;
std::cout << slow91(49) << std::endl;
return 0;
}
compiletime:
time g++ constexpr.cpp -std=c++11 -fconstexpr-depth=10000000 -O3
real 0m0.645s
user 0m0.600s
sys 0m0.032s
runtime:
time ./a.out
131069
91
131069
91
real 0m43.708s
user 0m43.567s
sys 0m0.008s
Here mc91 is the usual mac carthy f91 (as can be found on wikipedia) and foo is just a useless function returning real values between about 1 and 100, with a fib runtime complexity.
Both the slow calculation of 91 and the ackermann functions get evaluated with the same arguments by the compiler and the compiled program.
Surprisingly the program would even run faster, just generating code and running it through the compiler than executing the code itself.

At compile-time, redundant (identical) constexpr calls can be memoized, while run-time recursive behavior does not provide this.
If you change every recursive function such as...
constexpr unsigned slow91(int n) {
return mc91(mc91(foo(n))%100);
}
... to a form that isn't constexpr, but does remember past calculations at runtime:
std::unordered_map< int, boost::optional<unsigned> > results4;
// parameter(s) ^^^ result ^^^^^^^^
unsigned slow91(int n) {
boost::optional<unsigned> &ret = results4[n];
if ( !ret )
{
ret = mc91(mc91(foo(n))%100);
}
return *ret;
}
You will get less surprising results.
compiletime:
time g++ test.cpp -std=c++11 -O3
real 0m1.708s
user 0m1.496s
sys 0m0.176s
runtime:
time ./a.out
131069
91
131069
91
real 0m0.097s
user 0m0.064s
sys 0m0.032s

Memoization
This is a very interesting "discovery" but the answer is probably more simple than you think it is.
Something can be evaluated compile-time when declared constexpr if all values involved are known at compile time (and if the variable where the value is supposed to end up is declared constexpr as well) with that said imagine the following pseudo-code:
f(x) = g(x)
g(x) = x + h(x,x)
h(x,y) = x + y
since every value is known at compile time the compiler can rewrite the above into the, equivalent, below:
f(x) = x + x + x
To put it in words every function call has been removed and replaced with that of the expression itself. What is also applicable is a method called memoization where results of passed calculated expresions are stored away so you only need to do the hard work once.
If you know that g(5) = 15 why calculate it again? instead just replace g(5) with 15 everytime it is needed, This is possible since a function declared as constexpr isn't allowed to have side-effects .
Runtime
In runtime this is not happening (since we didn't tell the code to behave this way). The little guy running through your code will need to jump from f to g to h and then jump back to g from h before it jumps from g to f all while he stores the return value of each function and passing it along to the next one.
Even if this guy is very very tiny and that he doesn't need to jump very very far he still doesn't like jumping back and forth all the time, it takes a lot for him to do this and with that; it takes time.
But in the OPs example, is it really calculated compile-time?
Yes, and to those not believing that the compiler actually calculates this and put it as constants in the finished binary I will supply the relevant assembly instructions from OPs code below (output of g++ -S -Wall -pedantic -fconstexpr-depth=1000000 -std=c++11)
main:
.LFB1200:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $131069, -4(%rbp)
movl $91, -8(%rbp)
movl $131069, %esi # one of the values from constexpr
movl $_ZSt4cout, %edi
call _ZNSolsEj
movl $_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_, %esi
movq %rax, %rdi
call _ZNSolsEPFRSoS_E
movl $91, %esi # the other value from our constexpr
movl $_ZSt4cout, %edi
call _ZNSolsEi
movl $_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_, %esi
movq %rax, %rdi
# ...
# a lot of jumping is taking place down here
# see the full output at http://codepad.org/Q8D7c41y

Related

in c++ is there any way specialise a function template for specific values of arguments

I have a broadly used function foo(int a, int b) and I want to provide a special version of foo that performs differently if a is say 1.
a) I don't want to go through the whole code base and change all occurrences of foo(1, b) to foo1(b) because the rules on arguments may change and I dont want to keep going through the code base whenever the rules on arguments change.
b) I don't want to burden function foo with an "if (a == 1)" test because of performance issues.
It seems to me to be a fundamental skill of the compiler to call the right code based on what it can see in front of it. Or is this a possible missing feature of C++ that requires macros or something to handle currently.
Simply write
inline int foo(int a, int b)
{
if (a==1) {
// skip complex code and call easy code
call_easy(b);
} else {
// complex code here
do_complex(a, b);
}
}
When you call
foo(1, 10);
the optimizer will/should simply insert a call_easy(b).
Any decent optimizer will inline the function and detect if the function has been called with a==1. Also I think that the entire constexpr mentioned in other posts is nice, but not really necessary in your case. constexpr is very useful, if you want to resolve values at compile time. But you simply asked to switch code paths based on a value at runtime. The optimizer should be able to detect that.
In order to detect that, the optimizer needs to see your function definition at all places where your function is called. Hence the inline requirement - although compilers such as Visual Studio have a "generate code at link time" feature, that reduces this requirement somewhat.
Finally you might want to look at C++ attributes [[likely]] (I think). I haven't worked with them yet, but they are supposed to tell the compiler which execution path is likely and give a hint to the optimizer.
And why don't you experiment a little and look at the generated code in the debugger/disassemble. That will give you a feel for the optimizer. Don't forget that the optimizer is likely only active in Release Builds :)
Templates work in compile time and you want to decide in runtime which is never possible. If and only if you really can call your function with constexpr values, than you can change to a template, but the call becomes foo<1,2>() instead of foo(1,2); "performance issues"... that's really funny! If that single compare assembler instruction is the performance problem... yes, than you have done everything super perfect :-)
BTW: If you already call with constexpr values and the function is visible in the compilation unit, you can be sure the compiler already knows to optimize it away...
But there is another way to handle such things if you really have constexpr values sometimes and your algorithm inside the function can be constexpr evaluated. In that case, you can decide inside the function if your function was called in a constexpr context. If that is the case, you can do a full compile time algorithm which also can contain your if ( a== 1) which will be fully evaluated in compile time. If the function is not called in constexpr context, the function is running as before without any additional overhead.
To do such decision in compile time we need the actual C++ standard ( C++20 )!
constexpr int foo( int a, int)
{
if (std::is_constant_evaluated() )
{ // this part is fully evaluated in compile time!
if ( a == 1 )
{
return 1;
}
else
{
return 2;
}
}
else
{ // and the rest runs as before in runtime
if ( a == 0 )
{
return 3;
}
else
{
return 4;
}
}
}
int main()
{
constexpr int res1 = foo( 1,0 ); // fully evaluated during compile time
constexpr int res2 = foo( 2,0 ); // also full compile time
std::cout << res1 << std::endl;
std::cout << res2 << std::endl;
std::cout << foo( 5, 0) << std::endl; // here we go in runtime
std::cout << foo( 0, 0) << std::endl; // here we go in runtime
}
That code will return:
1
2
4
3
So we do not need to go with classic templates, no need to change the rest of the code but have full compile time optimization if possible.
#Sebastian's suggestion works at least in the simple case with all optimisation levels except -O0 in g++ 9.3.0 on Ubuntu 20.04 in c++20 mode. Thanks again.
See below disassembly always calling directly the correct subfunction func1 or func2 instead of the top function func(). A similar disassembly after -O0 shows only the top level func() being called leaving the decision to run-time which is not desired.
I hope this will work in production code and perhaps with multiple hard coded arguments.
Breakpoint 1, main () at p1.cpp:24
24 int main() {
(gdb) disass /m
Dump of assembler code for function main():
6 inline void func(int a, int b) {
7
8 if (a == 1)
9 func1(b);
10 else
11 func2(a,b);
12 }
13
14 void func1(int b) {
15 std::cout << "func1 " << " " << " " << b << std::endl;
16 }
17
18 void func2(int a, int b) {
19 std::cout << "func2 " << a << " " << b << std::endl;
20 }
21
22 };
23
24 int main() {
=> 0x0000555555555286 <+0>: endbr64
0x000055555555528a <+4>: push %rbp
0x000055555555528b <+5>: push %rbx
0x000055555555528c <+6>: sub $0x18,%rsp
0x0000555555555290 <+10>: mov $0x28,%ebp
0x0000555555555295 <+15>: mov %fs:0x0(%rbp),%rax
0x000055555555529a <+20>: mov %rax,0x8(%rsp)
0x000055555555529f <+25>: xor %eax,%eax
25
26 X x1;
27
28 int b=1;
29 x1.func(1,b);
0x00005555555552a1 <+27>: lea 0x7(%rsp),%rbx
0x00005555555552a6 <+32>: mov $0x1,%esi
0x00005555555552ab <+37>: mov %rbx,%rdi
0x00005555555552ae <+40>: callq 0x55555555531e <X::func1(int)>
30
31 b=2;
32 x1.func(2,b);
0x00005555555552b3 <+45>: mov $0x2,%edx
0x00005555555552b8 <+50>: mov $0x2,%esi
0x00005555555552bd <+55>: mov %rbx,%rdi
0x00005555555552c0 <+58>: callq 0x5555555553de <X::func2(int, int)>
33
34 b=3;
35 x1.func(1,b);
0x00005555555552c5 <+63>: mov $0x3,%esi
0x00005555555552ca <+68>: mov %rbx,%rdi
0x00005555555552cd <+71>: callq 0x55555555531e <X::func1(int)>
36
37 b=4;
38 x1.func(2,b);
0x00005555555552d2 <+76>: mov $0x4,%edx
0x00005555555552d7 <+81>: mov $0x2,%esi
0x00005555555552dc <+86>: mov %rbx,%rdi
0x00005555555552df <+89>: callq 0x5555555553de <X::func2(int, int)>
39
40 return 0;
0x00005555555552e4 <+94>: mov 0x8(%rsp),%rax
0x00005555555552e9 <+99>: xor %fs:0x0(%rbp),%rax
0x00005555555552ee <+104>: jne 0x5555555552fc <main()+118>
0x00005555555552f0 <+106>: mov $0x0,%eax
0x00005555555552f5 <+111>: add $0x18,%rsp
0x00005555555552f9 <+115>: pop %rbx
0x00005555555552fa <+116>: pop %rbp
0x00005555555552fb <+117>: retq
0x00005555555552fc <+118>: callq 0x555555555100 <__stack_chk_fail#plt>
End of assembler dump.

Why jmp instruction after a lock add has so many counts for the cycles event?

I have a function GetObj(int), which take a param means vector index, here's my c++ code
std::shared_ptr<Obj> GetObj(int obj_id) {
// obj_buffer_ is array, 2-buffers,
// declare as std::vector<std::shared_ptr<Obj>> obj_buffer_[2];
// cur_buffer_ is a atomic<size_t> var, value may be 0 or 1
auto& vec_obj = obj_buffer_[cur_buffer_.load(std::memory_order_acquire)];
// let's assume that obj_id is never illegal in running time, so program never return nullptr
if (obj_id < 0 || obj_id >= vec_obj.size()) {
return nullptr;
}
// vec_obj[obj_id] is a shared_ptr<Obj>,
return vec_obj[obj_id];
}
I will call this function 5 million times per second, and use "perf top" to see which instruction cost most, here is the result
please ignor function & vars name(GetShard is GetObj), I change the func & var's name make it easier to understand
here is my Question, let's pay attention to the last instruction
│ ↓ je 6f
0.12 │ lock addl $0x1,0x8(%rdx) ............... (1)
91.85 │ ┌──jmp 73 ............... (2)
│6f:│ addl $0x1,0x8(%rdx) ............... (3)
0.12 │73:└─→pop %rbp ............... (4)
│ ← retq ............... (5)
I found the (1) instruction, which is shared_ptr's counter incr 1, the atomic counter incr will lock the cpu bus, then jump to (4) pop %rbp which means pop stack. these instruction is doing "return" job.
but why does jmp instruction took 90% cpu circles? the function is called 5 millions times/s, I can understand if "lock addl" slow, but why jmp instruction take so long?

GCC not performing loop invariant code motion

I decided to check the result of loop invariant code motion optimization using g++. However, when I compiled the following code with -fmove-loop-invariants and analysed its assembly, I saw that k + 17 calculation is still performed in the loop body.
What could prevent the compiler from optimizing it?
May be the compiler concludes that it is more efficient to recalculate k + 17?
int main()
{
int k = 0;
std::cin >> k;
for (int i = 0; i < 10000; ++i)
{
int n = k + 17; // not moved out of the loop
printf("%d\n", n);
}
return 0;
}
Tried g++ -O0 -fmove-loop-invariants, g++ -O3 and g++ -O3 -fmove-loop-invariants using both g++ 4.6.3 and g++ 4.8.3.
EDIT: Ignore my previous answer. You can see that the calculation has been folded into a constant. Therefore it is performing the loop invariant optimization.
Because of the as-if rule. Simply put, the compiler is not allowed to make any optimizations that may affect the observable behavior of the program, in this case the printf. You can see what happens if you make n volatile and remove the printf:
for (int i = 0; i < 10000; ++i)
{
volatile int n = k + 17; // not moved out of the loop
}
// Example assembly output for GCC 4.6.4
// ...
movl $10000, %eax
addl $17, %edx
.L2:
subl $1, %eax
movl %edx, 12(%rsp)
// ...

Do C++ compilers perform compile-time optimizations on lambda closures?

Suppose we have the following (nonsensical) code:
const int a = 0;
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
Variable 'a' equals zero, so the compiler can deduce on compile time, that the instruction 'if(a) c++;' will never be executed and will optimize it away.
My question: Does the same happen with lambda closures?
Check out another piece of code:
const int a = 0;
function<int()> lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
}
Will the compiler know that 'a' is 0 and will it optimize the lambda?
Even more sophisticated example:
function<int()> generate_lambda(const int a)
{
return [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
};
}
function<int()> a_is_zero = generate_lambda(0);
function<int()> a_is_one = generate_lambda(1);
Will the compiler be smart enough to optimize the first lambda when it knows that 'a' is 0 at generation time?
Does gcc or llvm have this kind of optimizations?
I'm asking because I wonder if I should make such optimizations manually when I know that certain assumptions are satisfied on lambda generation time or the compiler will do that for me.
Looking at the assembly generated by gcc5.2 -O2 shows that the optimization does not happen when using std::function:
#include <functional>
int main()
{
const int a = 0;
std::function<int()> lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
};
return lambda();
}
compiles to some boilerplate and
movl (%rdi), %ecx
movl $10000000, %edx
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
cmpl $1, %ecx
sbbl $-1, %eax
addl $7, %eax
subl $1, %edx
jne .L3
rep; ret
which is the loop you wanted to see optimized away. (Live) But if you actually use a lambda (and not an std::function), the optimization does happen:
int main()
{
const int a = 0;
auto lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
};
return lambda();
}
compiles to
movl $70000000, %eax
ret
i.e. the loop was removed completely. (Live)
Afaik, you can expect a lambda to have zero overhead, but std::function is different and comes with a cost (at least at the current state of the optimizers, although people apparently work on this), even if the code "inside the std::function" would have been optimized. (Take that with a grain of salt and try if in doubt, since this will probably vary between compilers and versions. std::functions overhead can certainly be optimized away.)
As #MarcGlisse correctly pointed out, clang3.6 performs the desired optimization (equivalent to the second case above) even with std::function. (Live)
Bonus edit, thanks to #MarkGlisse again: If the function that contains the std::function is not called main, the optimization happening with gcc5.2 is somewhere between gcc+main and clang, i.e. the function gets reduced to return 70000000; plus some extra code. (Live)
Bonus edit 2, this time mine: If you use -O3, gcc will, (for some reason) as explained in Marco's answer, optimize the std::function to
cmpl $1, (%rdi)
sbbl %eax, %eax
andl $-10000000, %eax
addl $80000000, %eax
ret
and keep the rest as in the not_main case. So I guess at the bottom of the line, one will just have to measure when using std::function.
Both gcc at -O3 and MSVC2015 Release won't optimize it away with this simple code and the lambda would actually be called
#include <functional>
#include <iostream>
int main()
{
int a = 0;
std::function<int()> lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10; b++)
{
if(a) c++;
c += 7;
}
return c;
};
std::cout << lambda();
return 0;
}
At -O3 this is what gcc generates for the lambda (code from godbolt)
lambda:
cmp DWORD PTR [rdi], 1
sbb eax, eax
and eax, -10
add eax, 80
ret
This is a contrived and optimized way to express the following:
If a was a 0, the first comparison would set the carry flag CR. eax would actually be set to 32 1 values, and'ed with -10 (and that would yield -10 in eax) and then added 80 -> result is 70.
If a was something different from 0, the first comparison would not set the carry flag CR, eax would be set to zero, the and would have no effect and it would be added 80 -> result is 80.
It has to be noted (thanks Marc Glisse) that if the function is marked as cold (i.e. unlikely to be called) gcc performs the right thing and optimizes the call away.
MSVC generates more verbose code but the comparison isn't skipped.
Clang is the only one which gets it right: the lambda hasn't its code optimized more than gcc did but it is not called
mov edi, std::cout
mov esi, 70
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
Morale: Clang seems to get it right but the optimization challenge is still open.

is for() loop maintained inside stack or for is just a statement?

Very basic code:
Code#1
int i;
int counterONE = 0;
for (i = 0; i < 5; i++) {
counterONE += 1;
}
cout << counterONE;
Code#2
int i, j;
int CounterONE = 0, counterTWO = 0;
for (i = 0; i < 5; i++) {
for (j = 0; j < 5; j++) {
counterONE++;
}
counterTWO++;
}
cout << counterONE;
cout << endl << counterTWO;
For both of the codes my questions are:
How do those loops work? Are the stack frames maintained?
How is the internal memory maintained? Is there a queue?
Why for looks like a function(){} body and how to resolve the function body?
And Please don't comment any answer in short I need complete elaboration please.
for is a simple loop that is translated to "goto" (or similar) in the machine code in order to make some commands repeat themselves
for (int i = 0; i < 5; i++) {
some code
}
some more code
will be translated to something like (very simplified)
R_x = 0 // R_x is some register
loop: check if R_x >= 5
if so, go to "after"
some code
increase R_x
go to loop
after: some more code
This code does not involve any recursion, and the importance of the stack here is negligible (only one is used, and only to store the automatic variables).
The real answer is that for is not a function. It is a keyword,
introducing a special type of statement.
Don't be fooled by the parentheses. Parentheses in C++ are overloaded:
they may be an operator (the function call operator), or they may be
punctuation, which is part of the syntax. When they follow the keyword
for (or while or if or switch), they are punctuation, and not
the function call operator; many people, like myself, like to
differentiate the two uses by formatting them differently, putting a
space between the keyword and the opening parentheses when they are
punctation at the statement level, but using no space between the name
of a function and the ( operator. (Everyone I know who does this also
treats all of the formats for casting as if they were a function,
although technically...)
EDIT:
For what it's worth: you can overload the () operator. The overload will be considered in cases where the parentheses are operators (and in the context of a function style cast), but not when they are punctuation.
For both of the codes my question is how do those loops work? are stack frames are maintained?
Let's look at what the compiler generates for your loop. I took your first snippet and built the following program with it1:
#include <stdio.h>
int main( void )
{
int i;
int counterONE=0;
for(i=0;i<5;i++)
{
counterONE+=1;
}
return 0;
}
Here's the equivalent assembly code generated by gcc (using gcc -S)2, annotated by me:
.file "loops.c"
.text
.globl main
.type main, #function
main: ;; function entry point
.LFB2:
pushq %rbp ;; save current frame pointer
.LCFI0:
movq %rsp, %rbp ;; make stack pointer new frame pointer
.LCFI1:
movl $0, -4(%rbp) ;; initialize counterONE to 0
movl $0, -8(%rbp) ;; initialize i to 0
jmp .L2 ;; jump to label L2 below
.L3:
addl $1, -4(%rbp) ;; add 1 to counterONE
addl $1, -8(%rbp) ;; add 1 to i
.L2:
cmpl $4, -8(%rbp) ;; compare i to the value 4
jle .L3 ;; if i is <= 4, jump to L3
movl $0, %eax ;; return 0
leave
ret
The only stack frame involved is the one created for the main function; no additional stack frames are created within the for loop itself. Even if you declared a variable local to the for loop, such as
for ( int i = 0; i < 5; i++ )
{
...
}
or
for ( i = 0; i < 5; i++ )
{
int j;
...
}
a new stack frame will (most likely) not be created for the loop; any variables local to the loop will be created in the enclosing function's frame, although the variable will not be visible to code outside of the loop body.
How internally memory is maintained is there a queue?
No additional data structures are necessary. The only memory involved is the memory for i (which controls the execution of the loop) and counterONE, both of which are maintained on the stack3. They are referred to by their offset from the address stored in the frame pointer (for example, if %rbp contained the address 0x8000, then the memory for i would be stored at address 0x8000 - 8 == 0x7ff8 and the memory for counterONE would be stored at address 0x8000 - 4 == 0x7ffc).
why for looks like a function(){} body how to resolve the function body?
The language grammar tells the compiler how to interpret the code.
Here's the grammar for an iteration statement (taken from the online C 2011 draft):
(6.8.5) iteration-statement:
while ( expression ) statement
do statement while ( expression ) ;
for ( expressionopt ; expressionopt ; expressionopt ) statement
for ( declaration expressionopt ; expressionopt ) statement
Likewise, here's the grammar for a function call:
(6.5.2) postfix-expression:
...
postfix-expression ( argument-expression-listopt )
...
and a function definition:
(6.9.1) function-definition:
declaration-specifiers declarator declaration-listopt compound-statement
During parsing, the compiler breaks the source code up into tokens - keywords, identifiers, constants, string literals, and punctuators. The compiler then tries to match sequences of tokens against the grammar.
So, assuming the source file contains
for ( i = 0; i < 5; i++ )
the compiler will see the for keyword; based on the grammar, it knows to interpret the following ( i = 0; i < 5; i++ ) as a loop control body, rather than a function call or a function definition.
That's the 50,000 foot view, anyway; parsing is a pretty involved subject, and it's only part of what a compiler does. You might want to start here and follow the links.
Just know that this isn't something you're going to pick up in a weekend.
1. Note - formatting really matters. If you want us to help you, you need to make your code as readable as possible. Sloppy formatting, inconsistent indenting, etc., makes it harder to find errors, and makes it less likely that someone will take time out to help.
2. This isn't the complete listing, but the rest isn't relevant to your question.
3. This explanation applies to commonly used architectures like x86, but be aware there are some old and/or oddball architectures that may do things differently.