Is it the case that comparisons imply a branch? - c++

I was reading the wikipedia page on optimization:
http://en.wikibooks.org/wiki/Optimizing_C%2B%2B/Code_optimization/Pipeline
and I came across the line:
For pipelined processors, comparisons are slower than differences, because they imply a branch.
Why is it the case that comparisons imply a branch?
For example if:
int i = 2;
int x = i<5;
Is there a branch in this? It makes sense to me to branch for if statements with conditionals but I don't understand why a comparison alone causes a branch.

Preamble: Modern compilers are capable of eliminating branches in various ways. Thus, none of the examples necessarily results a branch in the final (assembler or machine) code.
So why does the logic basically imply branches?
The code
bool check_interval_branch(int const i, int const min_i, int const max_i)
{
return min_i <= i && i <= max_i;
}
can be logically rewritten to be:
bool check_interval_branch(int const i, int const min_i, int const max_i)
{
if (min_i <= i)
{
if (i < max_i) return true;
}
return false;
}
Here you obviously have two branches (where the second one is only executed if the first one is true - short circuit) which can be mispredicted by the branch predictor which in turn leads to a reroll of the pipeline.
Visual Studio 2013 (with optimization turned one) generates the following assembly containing two branches for check_interval_branch:
push ebp
mov ebp, esp
mov eax, DWORD PTR _i$[ebp]
cmp DWORD PTR _min_i$[ebp], eax // comparison
jg SHORT $LN3#check_inte // conditional jump
cmp eax, DWORD PTR _max_i$[ebp] // comparison
jg SHORT $LN3#check_inte // conditional jump
mov al, 1
pop ebp
ret 0
$LN3#check_inte:
xor al, al
pop ebp
ret 0
The code
bool check_interval_diff(int const i, int const min_i, int const max_i)
{
return unsigned(i - min_i) <= unsigned(max_i - min_i);
}
is logically identical to
bool check_interval_diff(int const i, int const min_i, int const max_i)
{
if (unsigned(i – min_i) <= unsigned(max_i – min_i)) { return true; }
return false;
}
which contains only a single branch but executes two differences.
The generated code for check_interval_diff of Visual Studio 2013 doesn't even contain a conditional jump.
push ebp
mov ebp, esp
mov edx, DWORD PTR _i$[ebp]
mov eax, DWORD PTR _max_i$[ebp]
sub eax, DWORD PTR _min_i$[ebp]
sub edx, DWORD PTR _min_i$[ebp]
cmp eax, edx // comparison
sbb eax, eax
inc eax
pop ebp
ret 0
(The trick here is that the subtraction done by sbb is different by 1 according to the carry flag which in turn is set to 1 or 0 by the cmp instruction.)
In fact you see three differences here (2x sub, 1x sbb).
It probably depends on your data / the use case which one is faster.
See Mysticals answer here about branch prediction.
Your code int x = i<5; is logically identical to
int x = false;
if (i < 5)
{
x = true;
}
which again contains a branch (x = true only executed if i < 5.)

This involves only a single branch:
unsigned(i – min_i) <= unsigned(max_i – min_i)
While this involves two:
min_i <= i && i <= max_i
When a CPU encounters a branch, it consults its predictor and follows the most likely path. If the prediction is correct, the branch is essentially free in terms of performance. If the prediction is wrong, the CPU needs to flush the pipeline and start all over.
This kind of optimization is a double-edged sword --- if your branches are highly predictable, the first might actually run slower than the second. It entirely depends on how much you know about your data.

While the answers given here are good, not all comparisons are translated into branch instructions (they do introduce data dependencies which may also cost you some performance).
For example, the following C code
int main()
{
volatile int i;
int x = i<5;
return x;
}
Is compiled by gcc (x86-64, Optimizations enabled) into:
movl -4(%rbp), %eax
cmpl $5, %eax
setl %al
movzbl %al, %eax
The setl instruction sets the value of AL according to the result of the comparison instruction preceding it.
Of course, this is a very simple example -- and the cmp/setl combination probably introduces dependencies that prevent the processor for executing them in parallel or may even cost you a few cycles.
Still, on a modern processor, not all comparisons are turned into branch instructions.

Who ever wrote that page is not competent as a programmer. First,
comparisons don't necessarily imply a branch; it depends on what you
do with them. And whether that implies a branch or not depends on the
processor and the compiler. An if generally requires a branch, but
even then, a good optimizer can sometimes avoid it. A while or a
for will generally require a branch, unless the compiler can unwind
the loop, but that branch is highly predictable, so even when branch
prediction is an issue, it may not matter.
More generally, if you worry about anything at this level when writing
your code, you're wasting your time, and making maintenance far more
difficult. The only time you should be concerned is once you have a
performance problem, and the profiler shows that this is the spot where
you're loosing performance. At that point, you can experiment with
several different ways of writing the code, to determine which one will
result in faster code for your combination of compiler and hardware.
(Change the compiler or the hardware, and it might not be the same one.)

Related

Optimizing std::clamp with favor for in-range input: is there a point for keeping cmov instead of a branch?

C++17 std::clamp is a template function that makes sure the input value is not less than the given minimum and less than the given maximum, and returns the input value; otherwise it returns the minimum or the maximum respectively.
The goal is to optimize it, assuming the following:
The type parameter is 32 bit or 64 bit integer
The input value is way more likely to be already in range than out of range, so likely to be returned
The input value is likely to be computed shortly before, the minimum and maximum is likely to be known in advance
Let's ignore references, which may complicate optimization, but in practice are not useful for integer values
For both the standard implementation, and a naive implementation, the assembly generated by gcc and clang does not seem to favor the in range assumption above. Both of these:
#include <algorithm>
int clamp1(int v, int minv, int maxv)
{
return std::clamp(v, minv, maxv);
}
int clamp2(int v, int minv, int maxv)
{
if (maxv < v)
{
return maxv ;
}
if (v < minv)
{
return minv;
}
return v;
}
Compile into two cmov (https://godbolt.org/z/oedd9Yfro):
mov eax, esi
cmp edi, esi
cmovge eax, edi
cmp edx, edi
cmovl eax, edi
Trying to tell the compilers to favor the in range case with __builtin_expect (gcc seem to ignore C++20 [[likely]]):
int clamp3(int v, int minv, int maxv)
{
if (__builtin_expect(maxv < v, 0))
{
return maxv;
}
if (__builtin_expect(v < minv, 0))
{
return minv;
}
return v;
}
The result for gcc and clang are now different (https://godbolt.org/z/s4vedo1br).
gcc still fully avoids branches using two cmov. clang has one branch, instead of expected two (annotation mine):
clamp3(int, int, int):
mov eax, edx ; result = maxv
cmp edi, esi ; v, minv
cmovge esi, edi ; if (v >= minv) minv = v
cmp edx, edi ; maxv, v
jl .LBB0_2 ; if (maxv < v) goto LBB0_2
mov eax, esi ; result = minv (after it was updated from v if no clamping)
.LBB0_2:
ret
Questions:
Are there significant disadvantages in using conditional jumps that are expected to go the same branch each time, so that gcc avoids them?
Is clang version with one conditional jump better than it would have been if there was two jumps?
Not using cmov is suggested in Intel® 64 and IA-32 Architectures
Optimization Reference Manual, from June 2021 version page 3-5:
Assembly/Compiler Coding Rule 2. (M impact, ML generality) Use the SETCC and CMOV
instructions to eliminate unpredictable conditional branches where possible. Do not do this for
predictable branches. Do not use these instructions to eliminate all unpredictable conditional branches
(because using these instructions will incur execution overhead due to the requirement for executing
both paths of a conditional branch). In addition, converting a conditional branch to SETCC or CMOV
trades off control flow dependence for data dependence and restricts the capability of the out-of-order
engine. When tuning, note that all Intel 64 and IA-32 processors usually have very high branch
prediction rates. Consistently mispredicted branches are generally rare. Use these instructions only if
the increase in computation time is less than the expected cost of a mispredicted branch.

What is the compiler doing here that allows comparison of many values to be done with few actual comparisons?

My question is about what the compiler is doing in this case that optimizes the code way more than what I would think is possible.
Given this enum:
enum MyEnum {
Entry1,
Entry2,
... // Entry3..27 are the same, omitted for size.
Entry28,
Entry29
};
And this function:
bool MyFunction(MyEnum e)
{
if (
e == MyEnum::Entry1 ||
e == MyEnum::Entry3 ||
e == MyEnum::Entry8 ||
e == MyEnum::Entry14 ||
e == MyEnum::Entry15 ||
e == MyEnum::Entry18 ||
e == MyEnum::Entry21 ||
e == MyEnum::Entry22 ||
e == MyEnum::Entry25)
{
return true;
}
return false;
}
For the function, MSVC generates this assembly when compiled with -Ox optimization flag (Godbolt):
bool MyFunction(MyEnum) PROC ; MyFunction
cmp ecx, 24
ja SHORT $LN5#MyFunction
mov eax, 20078725 ; 01326085H
bt eax, ecx
jae SHORT $LN5#MyFunction
mov al, 1
ret 0
$LN5#MyFunction:
xor al, al
ret 0
Clang generates similar (slightly better, one less jump) assembly when compiled with -O3 flag:
MyFunction(MyEnum): # #MyFunction(MyEnum)
cmp edi, 24
ja .LBB0_2
mov eax, 20078725
mov ecx, edi
shr eax, cl
and al, 1
ret
.LBB0_2:
xor eax, eax
ret
What is happening here? I see that even if I add more enum comparisons to the function, the assembly that is generated does not actually become "more", it's only this magic number (20078725) that changes. That number depends on how many enum comparisons are happening in the function. I do not understand what is happening here.
The reason why I am looking at this is that I was wondering if it is good to write the function as above, or alternatively like this, with bitwise comparisons:
bool MyFunction2(MyEnum e)
{
if (
e == MyEnum::Entry1 |
e == MyEnum::Entry3 |
e == MyEnum::Entry8 |
e == MyEnum::Entry14 |
e == MyEnum::Entry15 |
e == MyEnum::Entry18 |
e == MyEnum::Entry21 |
e == MyEnum::Entry22 |
e == MyEnum::Entry25)
{
return true;
}
return false;
}
This results in this generated assembly with MSVC:
bool MyFunction2(MyEnum) PROC ; MyFunction2
xor edx, edx
mov r9d, 1
cmp ecx, 24
mov eax, edx
mov r8d, edx
sete r8b
cmp ecx, 21
sete al
or r8d, eax
mov eax, edx
cmp ecx, 20
cmove r8d, r9d
cmp ecx, 17
sete al
or r8d, eax
mov eax, edx
cmp ecx, 14
cmove r8d, r9d
cmp ecx, 13
sete al
or r8d, eax
cmp ecx, 7
cmove r8d, r9d
cmp ecx, 2
sete dl
or r8d, edx
test ecx, ecx
cmove r8d, r9d
test r8d, r8d
setne al
ret 0
Since I do not understand what happens in the first case, I can not really judge which one is more efficient in my case.
Quite smart! The first comparison with 24 is to do a rough range check - if it's more than 24 or less than 0 it will bail out; this is important as the instructions that follow that operate on the magic number have a hard cap to [0, 31] for operand range.
For the rest, the magic number is just a bitmask, with the bits corresponding to the "good" values set.
>>> bin(20078725)
'0b1001100100110000010000101'
It's easy to spot the first and third bits (counting from 1 and from right) set, the 8th, 14th, 15th, ...
MSVC checks it "directly" using the BT (bit test) instruction and branching, clang instead shifts it of the appropriate amount (to get the relevant bit in the lowest order position) and keeps just it ANDing it with zero (avoiding a branch).
The C code corresponding to the clang version would be something like:
bool MyFunction(MyEnum e) {
if(unsigned(e) > 24) return false;
return (20078725 >> e) & 1;
}
as for the MSVC version, it's more like
inline bool bit_test(unsigned val, int bit) {
return val & (1<<bit);
}
bool MyFunction(MyEnum e) {
if(unsigned(e) > 24) return false;
return bit_test(20078725, e);
}
(I kept the bit_test function separated to emphasize that it's actually a single instruction in assembly, that val & (1<<bit) thing has no correspondence to the original assembly.
As for the if-based code, it's quite bad - it uses a lot of CMOV and ORs the results together, which is both longer code, and will probably serialize execution. I suspect the corresponding clang code will be better. OTOH, you wrote this code using bitwise OR (|) instead of the more semantically correct logical OR (||), and the compiler is strictly following your orders (typical of MSVC).
Another possibility to try instead could be a switch - but I don't think there's much to gain compared to the code already generated for the first snippet, which looks pretty good to me.
Ok, doing a quick test with all the versions against all compilers, we can see that:
the C translation of the CLang output above results in pretty much that same code (= to the clang output) in all compilers; similarly for the MSVC translation;
the bitwise or version is the same as the logical or version (= good) in both CLang and gcc;
in general, gcc does essentially the same thing as CLang except for the switch case;
switch results are varied:
CLang does best, by generating the exact same code;
both gcc and MSVC generate jump-table based code, which in this case is less good; however:
gcc prefers to emit a table of QWORDs, trading size for simplicity of the setup code;
MSVC instead emits a table of BYTEs, paying it in setup code size; I couldn't get gcc to emit similar code even changing -O3 to -Os (optimize for size).
Ah, the old immediate bitmap trick.
GCC does this too, at least for a switch.
x86 asm casetable implementation. Unfortunately GCC9 has a regression for some cases: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91026#c3 ; GCC8 and earlier do a better job.
Another example of using it, this time for code-golf (fewest bytes of code, in this case x86 machine code) to detect certain letters: User Appreciation Challenge #1: Dennis ♦
The basic idea is to use the input as an index into a bitmap of true/false results.
First you have to range-check because the bitmap is fixed-width, and x86 shifts wrap the shift count. We don't want high inputs to alias into the range where there are some that should return true. cmp edi, 24/ja is doing.
(If the range between the lowest and highest true values was from 120 to 140, for example, it might start with a sub edi,120 to range-shift everything before the cmp.)
Then you use bitmap & (1<<e) (the bt instruction), or (bitmap >> e) & 1 (shr / and) to check the bit in the bitmap that tells you whether that e value should return true or false.
There are many ways to implement that check, logically equivalent but with performance differences.
If the range was wider than 32, it would have to use 64-bit operand-size. If it was wider than 64, the compiler might not attempt this optimization at all. Or might still do it for some of the conditions that are in a narrow range.
Using an even larger bitmap (in .rodata memory) would be possible but probably not something most compilers will invent for you. Either with bt [mem],reg (inefficient) or manually indexing a dword and checking that the same way this code checks the immediate bitmap. If you had a lot of high-entropy ranges it might be worth checking 2x 64-bit immediate bitmap, branchy or branchless...
Clang/LLVM has other tricks up its sleeve for efficiently comparing against multiple values (when it doesn't matter which one is hit), e.g. broadcast a value into a SIMD register and use a packed compare. That isn't dependent on the values being in a dense range. (Clang generates worse code for 7 comparisons than for 8 comparisons)
that optimizes the code way more than what I would think is possible.
These kinds of optimizations come from smart human compiler developers that notice common patterns in source code and think of clever ways to implement them. Then get compilers to recognize those patterns and transform their internal representation of the program logic to use the trick.
Turns out that switch and switch-like if() statements are common, and aggressive optimizations are common.
Compilers are far from perfect, but sometimes they do come close to living up to what people often claim; that compilers will optimize your code for you so you can write it in a human-readable way and still have it run near-optimally. This is sometimes true over the small scale.
Since I do not understand what happens in the first case, I can not really judge which one is more efficient in my case.
The immediate bitmap is vastly more efficient. There's no data memory access in either one so no cache miss loads. The only "expensive" instruction is a variable-count shift (3 uops on mainstream Intel, because of x86's annoying FLAGS-setting semantics; BMI2 shrx is only 1 uop and avoid having to mov the number to ecx.) https://agner.org/optimize. And see other performance analysis links in https://stackoverflow.com/tags/x86/info.
Each instruction in the cmp/cmov chain is at least 1 uop, and there's a pretty long dependency chain through each cmov because MSVC didn't bother to break it into 2 or more parallel chains. But regardless it's just a lot of uops, far more than the bitmap version, so worse for throughput (ability for out-of-order exec to overlap the work with surrounding code) as well as latency.
bt is also cheap: 1 uop on modern AMD and Intel. (bts, btr, btc are 2 on AMD, still 1 on Intel).
The branch in the immediate-bitmap version could have been a setna / and to make it branchless, but especially for this enum definition the compiler expected that it would be in range. It could have increased branch predictability by only requiring e <= 31, not e <= 24.
Since the enum only goes up to 29, and IIRC it's UB to have out-of-range enum values, it could actually optimize it away entirely.
Even if the e>24 branch doesn't predict very well, it's still probably better overall. Given current compilers, we only get a choice between the nasty chain of cmp/cmov or branch + bitmap. Unless turn the asm logic back into C to hand-hold compilers into making the asm we want, then we can maybe get branchless with an AND or CMOV to make it always zero for out-of-range e.
But if we're lucky, profile-guided optimization might let some compilers make the bitmap range check branchless. (In asm the behaviour of shl reg, cl with cl > 31 or 63 is well-defined: on x86 it simply masks the count. In a C equivalent, you could use bitmap >> (e&31) which can still optimize to a shr; compilers know that x86 shr masks the count so they can optimize that away. But not for other ISAs that saturate the shift count...)
There are lots of ways to implement the bitmap check that are pretty much equivalent. e.g. you could even use the CF output of shr, set according to the last bit shifted out. At least if you make sure CF has a known state ahead of time for the cl=0 case.
When you want an integer bool result, right-shifting seems to make more sense than bt / setcc, but with shr costing 3 uops on Intel it might actually be best to use bt reg,reg / setc al. Especially if you only need a bool, and can use EAX as your bitmap destination so the previous value of EAX is definitely ready before setcc. (Avoiding a false dependency on some unrelated earlier dep chain.)
BTW, MSVC has other silliness: as What is the best way to set a register to zero in x86 assembly: xor, mov or and? explains, xor al,al is totally stupid compared to xor eax,eax when you want to zero AL. If you don't need to leave the upper bytes of RAX unmodified, zero the full register with a zeroing idiom.
And of course branching just to return 0 or return 1 makes little sense, unless you expect it to be very predictable and want to break the data dependency. I'd expect that setc al would make more sense to read the CF result of bt

How to measure the number of increments per second

I want to measure the speed in which my PC can increment a counter N times (e.g., for N = 10^9).
I tried the following code:
using namespace std
auto start = chrono::steady_clock::now();
for (int i = 0; i < N; ++i)
{
}
auto end = chrono::steady_clock::now();
However, the compiler is smart enough to simply set i=N, and I get that start==end regardless of the value of N.
How can I change the code to measure the increment speed? (adding costly operations in the loop would dominate the runtime and would not allow the measurement to be correct).
I use Windows 10 and Visual Studio 15.9.7.
A bit of motivation: my code takes about 2 seconds for N=10^9. I'm wondering if there's any "meat" left in optimizing it further (e.g., could it possibly go down to 1 sec? or would the loop itself require more?)
This question doesn't really make sense in C or C++. The compiler aims to generate the fastest code that meets the constraints defined by your source code. In your question, you do not define a constraint that the compiler must do a loop at all. Because the loop has no effect, the optimizer will remove it.
Gabriel Staple's answer is probably the nearest thing you can get to a sensible answer to your question, but it is also not quite right because it defines too many constraints that limits the compiler's freedom to implement optimal code. Volatile often forces the compiler to write the result back to memory each time the variable is modified.
eg, this code:
void foo(int N) {
for (volatile int i = 0; i < N; ++i)
{
}
}
Becomes this assembly (on an x64 compiler I tried):
mov DWORD PTR [rsp-4], 0
mov eax, DWORD PTR [rsp-4]
cmp edi, eax
jle .L1
.L3:
mov eax, DWORD PTR [rsp-4] # Read i from mem
add eax, 1 # i++
mov DWORD PTR [rsp-4], eax # Write i to mem
mov eax, DWORD PTR [rsp-4] # Read it back again before
# evaluating the loop condition.
cmp eax, edi # Is i < N?
jl .L3 # Jump back to L3 if not.
.L1:
It sounds like your real question is more like how fast is:
L1: add eax, 1
jmp L1
Even the answer to that is complex and requires an understanding of the internals of your CPU's pipelines.
I recommend playing with Godbolt to understand more about what the compiler is doing. eg https://godbolt.org/z/59XUSu
You can directly measure the speed of the "empty loop", but it is not easy to convince a C++ compiler to emit it. GCC and Clang can be tricked with asm volatile("") but MSVC inline assembly has always been different and is disabled completely for 64bit programs.
It is possible to use MASM to side-step that restriction:
.MODEL FLAT
.CODE
_testfun PROC
sub ecx, 1
jnz _testfun
ret
_testfun ENDP
END
Import it into your code with extern "C" void testfun(unsigned N);.
Try volatile int i = 0 In your for loop. The volatile keyword tells the compiler this variable could change at any time, due to outside events or threads, and therefore it can't make the same assumptions about what the variable might be in the future.

Optimisation of a write out of a loop

Moving a member variable to a local variable reduces the number of writes in this loop despite the presence of the __restrict keyword. This is using GCC -O3. Clang and MSVC optimise the writes in both cases. [Note that since this question was posted we observed that adding __restrict to the calling function caused GCC to also move the store out of the loop. See the godbolt link below and the comments]
class X
{
public:
void process(float * __restrict d, int size)
{
for (int i = 0; i < size; ++i)
{
d[i] = v * c + d[i];
v = d[i];
}
}
void processFaster(float * __restrict d, int size)
{
float lv = v;
for (int i = 0; i < size; ++i)
{
d[i] = lv * c + d[i];
lv = d[i];
}
v = lv;
}
float c{0.0f};
float v{0.0f};
};
With gcc -O3 the first one has an inner loop that looks like:
.L3:
mulss xmm0, xmm1
add rdi, 4
addss xmm0, DWORD PTR [rdi-4]
movss DWORD PTR [rdi-4], xmm0
cmp rax, rdi
movss DWORD PTR x[rip+4], xmm0 ;<<< the extra store
jne .L3
.L1:
rep ret
The second here:
.L8:
mulss xmm0, xmm1
add rdi, 4
addss xmm0, DWORD PTR [rdi-4]
movss DWORD PTR [rdi-4], xmm0
cmp rdi, rax
jne .L8
.L7:
movss DWORD PTR x[rip+4], xmm0
ret
See https://godbolt.org/g/a9nCP2 for the complete code.
Why does the compiler not perform the lv optimisation here?
I'm assuming the 3 memory accesses per loop are worse than the 2 (assuming size is not a small number), though I've not measured this yet.
Am I right to make that assumption?
I think the observable behaviour should be the same in both cases.
This seems to be caused by the missing __restrict qualifier on the f_original function. __restrict is a GCC extension; it is not quite clear how it is expected to behave in C++. Maybe it is a compiler bug (missed optimization) that it appears to disappear after inlining.
The two methods are not identical. In the first, the value of v is updated multiple times during the execution. That may be or may not be what you want, but it is not the same as the second method, so it is not something the compiler can decide for itself as a possible optimization.
The restrict keyword says there is no aliasing with anything else, in effect same as if the value had been local (and no local had any references to it).
In the second case there is no external visible effect of v so it doesn't need to store it.
In the first case there is a potential that some external might see it, the compiler doesn't at this time know that there will be no threads that could change it, but it knows that it doesn't have to read it as its neither atomic nor volatile. And the change of d[] another externally visible variable make the storing necessary.
If the compiler writers reasoning, well neither d nor v are volatile nor atomic so we can just do it all using 'as-if', then the compiler has to be sure no one can touch v at all. I'm pretty sure this will come in one of the new version as there is no synchronisation before the return and this will be the case in 99+% of all cases anyway. Programmers will then have to put either volatile or atomic on variables that are changed, which I think I could live with.

C++ For loop optimizations for a virtual machine

Context
My question is twofold (really two questions) but quite basic*. But first, I will show some relevant code for some context. For the TL;DR 'meat and potatoes', skip to the bottom for the actual questions.
*(I'm assuming answerers are aware of what is happening/how a virtual machine operates fundamentally before attempting to answer).
As mentioned, I am writing a (toy) VM, which executes a custom byte code instruction set.
(ellipses here only represent omission of some cases)
Here is a snippet of my code:
for (ip = 0; (ip < _PROGRAM_SIZE || !cstackempty); ip++) {
if (breakPending) { break; }
switch (_instr) {
case INST::PUSH: {
AssertAbort(wontoverflow(1), "Stack overflow (1 byte)");
cmd_ "PUSH";
push(_incbyte);
printStack();
break;
}
...
case INST::ADD: {
AssertAbort(stackhas(2), "Can't pop stack to add 2 bytes. Stack does not contain 2 bytes");
cmd_ "ADD";
byte popped8_a = pop();
byte popped8_b = pop();
byte result = popped8_a + popped8_b;
push(result);
cmd_ " "; cmd_(byte)result;
printStack();
break;
}
case INST::ADD16: {
AssertAbort(stackhas(4), "Can't pop stack to add 4 bytes. Stack does not contain 4 bytes");
cmd_ "ADD16";
u16 popped16_a = pop16();
u16 popped16_b = pop16();
u16 result = popped16_a + popped16_b;
push16(result);
cmd << " "; cmd << (u16)result;
printStack();
break;
}
...
}
}
Only because it's relevant, I will mention that _cstack is the call stack, hence the !cstackempty macro, which checks in case the call is empty before calling quits (exiting the for loop) just because it's the last instruction being executed (because that last instruction could we be part of a function, or even a return). Also, ip (instruction pointer) is simply an unsigned long long (u64), as is _PROGRAM_SIZE (size of program in bytes). instr is a byte and is a reference to the current instruction (1 byte).
Meat and potatoes
Question 1: Since I'm initialising two new integers of variable size per block/case (segmented into blocks to avoid redeclaration errors and such), would declaring them above the for loop be helpful in terms of speed, assignment latency, program size etc?
Question 2: Would continue be faster than break in this case, and is there any faster way to execute such a conditional loop, such as some kind of goto-pointer-to-label like in this post, that is implementation agnostic, or somehow avoid the cost of either continue or break?
To summarize, my priorities are speed, then memory costs (speed, efficiency), then file size (of the VM).
Before answering the specific questions, a note: There isn't any CPU that executes C++ directly. So any question of this type of micro-optimization at the language level depends heavily on the compiler, software runtime environment and target hardware. It is entirely possible that one technique works better on the compiler you are using today, but worse on the one you use tomorrow. Similarly for hardware choices such as CPU architecture.
The only way to get a definitive answer of which is better is to benchmark it in a realistic situation, and often the only way to understand the benchmark results is to dive into the generated assembly. If this kind of optimization is important to you, consider learning a bit about the assembly language for your development architecture.
Given that, I'll pick a specific compiler (gcc) and a common architecture (x86) and answer in that context. The details will differ slightly for other choices, but I expect the broad strokes to be similar for any decent compiler and hardware combination.
Question 1
The place of declaration doesn't matter. The declaration itself, doesn't even really turn into code - it's only the definition and use that generate code.
For example, consider the two variants of a simple loop below (the external sink() method is just there to avoid optimizing way the assignment to a):
Declaration Inside Loop
int func(int* num) {
for (unsigned int i=0; i<100; i++) {
int a = *num + *num;
sink(a);
sink(a);
}
}
Declaration Outside Loop
int func(int* num) {
int a;
for (unsigned int i=0; i<100; i++) {
a = *num + *num;
sink(a);
sink(a);
}
}
We can use the godbolt compiler explorer to easily check the assembly generated for the first and second variants. They are identical - here's the loop:
.L2:
mov ebp, DWORD PTR [r12]
add ebx, 1
add ebp, ebp
mov edi, ebp
call sink(int)
mov edi, ebp
call sink(int)
cmp ebx, 100
jne .L2
Basically the declaration doesn't produce any code - only the assignment does.
Question 2
Here it is key to note that at the hardware level, there aren't instructions like "break" or "continue". You really only have jumps, either conditional or not, which are basically gotos. Both break and continue will be translated to jumps. In your case, a break inside a switch, where the break is the last statement in the loop, and a continue inside the switch have exactly the same effect, so I expect them to be compiled identically, but let's check.
Let's use this test case:
int func(unsigned int num, int iters) {
for (; iters > 0; iters--) {
switch (num) {
case 0:
sinka();
break;
case 1:
sinkb();
break;
case 2:
sinkc();
break;
case 3:
sinkd();
break;
case 4:
sinkd();
break;
}
}
}
It uses the break to exist the case. Here's the godbolt output on gcc 4.4.7 for x86, ignoring the function prologue:
.L13:
cmp ebp, 4
ja .L3
jmp [QWORD PTR [r13+r12*8]] # indirect jump
.L9:
.quad .L4
.quad .L5
.quad .L6
.quad .L7
.quad .L8
.L4:
call sinka()
jmp .L3
.L5:
call sinkb()
jmp .L3
.L6
call sinkc()
jmp .L3
.L7
call sinkd()
jmp .L3
.L8
call sinkd()
.L3:
sub ebx, 1
test ebx, ebx
jg .L13
Here, the compile has chosen a jump table approach. The value of num is used to look up a jump address (the table is the series of .quad directives), and then an indirect jump is made to one of the label L4 through L8. The breaks change into jmp .L3, which executes the loop logic.
Note that a jump table isn't the only way to compile a switch - if I used 4 or less case statements, the compile instead chose a series of branches.
Let's try the same example, but with each break replaced with a continue:
int func(unsigned int num, int iters) {
for (; iters > 0; iters--) {
switch (num) {
case 0:
sinka();
continue;
... [16 lines omitted] ...
}
}
}
As you might have guessed by now, the results are identical - at for this particular compiler and target. The continue statements and break statements imply the exact same control flow, so I'd expect this to be true for most decent compilers with optimization turned on.
For Question 2, a processor should be able to handle break reasonably well since it is effectively a branch that will always occur in assembly so it shouldn't cause too much issue. This should mean there is no pipeline flush for the reason stated as the branch prediction unit should get this one right. I believe Question 1 was answered above.