Branch on ?: operator? - c++

For a typical modern compiler on modern hardware, will the ? : operator result in a branch that affects the instruction pipeline?
In other words which is faster, calling both cases to avoid a possible branch:
bool testVar = someValue(); // Used later.
purge(white);
purge(black);
or picking the one actually needed to be purged and only doing it with an operator ?::
bool testVar = someValue();
purge(testVar ? white : black);
I realize you have no idea how long purge() will take, but I'm just asking a general question here about whether I would ever want to call purge() twice to avoid a possible branch in the code.
I realize this is a very tiny optimization and may make no real difference, but would still like to know. I expect the ?: does not result in branching, but want to make sure my understanding is correct.

Depends on the platform. Specifically, it depends on the size of jump prediction table of the CPU and whether the CPU allows conditional operations (like on ARM).
CPUs with conditional operations will strongly favor the second case. CPUs with bigger jump prediction tables will favor the first case.
The real answer (like with any other performance questions): measure and compare. Sometimes the rest of the code throws a curve ball and it's usually impossible to predict effects of some changes.

The CMOV (Conditional MOVe) instruction has been part of the x86 instruction set since the Pentium Pro. It is rarely automatically generated by GCC because of compiler options commonly used and restrictions placed by the C language. A SETCC/CMOV sequence can be inserted by inline assembly in your C program. This should only be done is cases where the conditional variable is a randomly oscillating value in the inner loop (millions of executions) of a program. In non-oscillating cases and in cases of simple patterns of oscillation, modern processors can predict branches with a very high degree of accuracy. In 2007, Linus Torvalds suggested here to avoid use of CMOV in most situations.
Intel describes the conditional move in the Intel(R) Architecture Software Developer's Manual, Volume 2: Instruction Set Reference Manual:
The CMOVcc instructions check the state of one or more of the status
flags in the EFLAGS register (CF, OF, PF, SF, and ZF) and perform a
move operation if the flags are in a specified state (or condition). A
condition code (cc) is associated with each instruction to indicate
the condition being tested for. If the condition is not satisfied, a
move is not performed and execution continues with the instruction
following the CMOVcc instruction.
These instructions can move a 16- or 32-bit value from memory to a
general-purpose register or from one general-purpose register to
another. Conditional moves of 8-bit register operands are not
supported.
The conditions for each CMOVcc mnemonic is given in the description
column of the above table. The terms “less” and “greater” are used for
comparisons of signed integers and the terms “above” and “below” are
used for unsigned integers.
Because a particular state of the status flags can sometimes be
interpreted in two ways, two mnemonics are defined for some opcodes.
For example, the CMOVA (conditional move if above) instruction and the
CMOVNBE (conditional move if not below or equal) instruction are
alternate mnemonics for the opcode 0F 47H.

I can't imagine the first method would ever be faster.
With the first method you may avoid a branch, but you replace it with a function call, which would usually involve a branch plus a lot more (unless it was inlined). Even if inlined, unless the functionality inside the purge() function was absolutely trivial it would almost certainly be slower.

Calling a function is at least as expensive as doing a logic test + jump (and yes, the ? : ternary operator would require a jump).

in the first case purge is called twice. In the second case purge is called once
Its hard to answer the question about branching because its so dependent on compilers and instruction set. For example on an ARM (which has conditional instruction execution) it might not branch. ON an x86 it almost certainly will

Related

How does reordering numerical code in order to avoid temporary variables make the code faster?

I made the experience (this is not the question but a statement), that avoiding non-constant local variables in favor of const variables or avoiding local variables at all, enables the c++ compiler to generate faster code.
I assume, that this gives the compiler more freedom to interleave calculation of expressions, whereas assignments force the compiler to insert a sync point.
Is this assumption in fact the case?
Any other explanation? e.g. Compiler giving up on certain optimization levels, as soon as the code gets too complex in order to avoid astronomical compile times?
No, assignments don't force the compiler to insert a sync point. If the variables are local, and don't affect anything visible outside your function, compiler will remove all unneeded variables, as part of the usual "register allocation" optimization it does.
If your code is so complex it approaches the limit of what the compiler can keep in memory, additional local variables can make the compiler give up and produce unoptimized code. However, this is a very rare edge-case; and it can be triggered on any change in code, not only regarding local variables.
Generally, compiler optimization is hard to reason about, outside of well-known problems (aliasing, loop-carried dependencies, etc). You might feel like you found some related consideration, but it could disappear when you upgrade your compiler or switch to a different one.
Assignments to local variables that you don't subsequently modify allow the compiler to assume that that value in that variable won't change. It might therefore decide (for example) to store it in a register for the 'usage-span' of the variable. This is a simple optimisation, and no self-respecting compiler is going to miss it (unless perhaps register pressure means it is forced to spill).
An example of where this might speed up the code (and maybe reduce code size a little also) is to assign a member variable to a local and then subsequently use that instead of the member variable. If you are confident that the value is not going to change, this might help the compiler generate better code. But then again, it might be a good way of introducing bugs, you do have to be careful playing games like this.
As Thomas Matthews said in the comments, another advantage of doing what you might consider to be a redundant assignment is to help with debugging. It allows the variable to be inspected (and perhaps adjusted) during a debugging run and that can be really handy. I'm not proud, I make mistakes, so I do it a lot.
Just my $0.02
It's unusual that temp vars hurt optimization; usually they're optimized away, or they help the compiler do a load or calculation once instead of repeating it (common subexpression elimination).
Repeated access to arr[i] might actually load multiple times if the compiler can't prove that no other assignments to other pointers to the same type couldn't have modified that array element. float *__restrict arr can help the compiler figure it out, or float ai = arr[i]; can tell the compiler to read it once and keep using the same value, regardless of other stores.
Of course, if optimization is disabled, more statements are typically slower than using fewer large expressions, and store/reload latency bottlenecks are usually the main bottleneck. See How to optimize these loops (with compiler optimization disabled)? . But -O0 (no optimization) is supposed to be slow. If you're compiling without at least -O2, preferably -O3 -march=native -ffast-math -flto, that's your problem.
I assume, that this gives the compiler more freedom to interleave calculation of expressions, whereas assignments force the compiler to insert a sync point.
Is this assumption in fact the case?
"Sync point" isn't the right technical term for it, but ISO C++ rules for FP math do distinguish between optimization within one expression vs. across statements / expressions.
Contraction of a * b + c into fma(a,b,c) is only allowed within one expression, if at all.
GCC defaults to -ffp-contract=fast, allowing it across expressions. clang defaults to strict or no, but supports -ffp-contract=fast. See How to use Fused Multiply-Add (FMA) instructions with SSE/AVX . If fast makes the code with temp vars run as fast as without, strict FP-contraction rules were the reason why it was slower with temp vars.
(Legacy x87 80-bit FP math, or other unusual machines with FLT_EVAL_METHOD!=0 - FP math happens at higher precision, and rounding to float or double costs extra). Strict ISO C++ semantics require rounding at expression boundaries, e.g. on assignments. GCC defaults to ignoring that, -fno-float-store. But -std=c++11 or whatever (instead of -std=gnu++11) will enforce that extra rounding work (a store/reload which costs throughput and latency).
This isn't a problem for x86 with SSE2 for scalar math; computation happens at either float or double according to the type of the data, with instructions like mulsd (scalar double) or mulss (scalar single). So it implements FLT_EVAL_METHOD == 0 instead of x87's 2. Hopefully nobody in 2023 is building number crunching code for 32-bit x87 and caring about the performance, especially without mentioning that obscure build choice. I mention this mostly for completeness.

How good is the Visual Studio compiler at branch-prediction for simple if-statements?

Here is some c++ pseudo-code as an example:
bool importantFlag = false;
for (SomeObject obj : arr) {
if (obj.someBool) {
importantFlag = true;
}
obj.doSomethingUnrelated();
}
Obviously, once the if-statement evaluates as true and runs the code inside, there is no reason to even perform the check again since the result will be the same either way. Is the compiler smart enough to recognize this or will it continue checking the if-statement with each loop iteration and possibly redundantly assign importantFlag to true again? This could potentially have a noticeable impact on performance if the number of loop iterations is large, and breaking out of the loop is not an option here.
I generally ignore these kinds of situations and just put my faith into the compiler, but it would be nice to know exactly how it handles these kinds of situations.
Branch-prediction is a run-time thing, done by the CPU not the compiler.
The relevant optimization here would be if-conversion to a very cheap branchless flag |= obj.someBool;.
Ahead-of-time C++ compilers make machine code for the CPU to run; they aren't interpreters. See also Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” and How to remove "noise" from GCC/clang assembly output? for more about looking at optimized compiler-generated asm.
I guess what you're suggesting could be making a 2nd version of the loop that doesn't look at the bool, and converting the if() into an if() goto to set the flag once and then run the other version of the loop from this point onward. That would likely not be worth it, since a single OR instruction is so cheap if other members of the same object are already getting accessed.
But it's a plausible optimization; however I don't think compilers would typically do it for you. You can of course do it manually, although you'd have to iterate manually instead of using a range-for, because you want to use the same iterator to start part-way through the range.
Branch likelihood estimation at compile time is a thing compilers do to figure out whether branchy or branchless code is appropriate, e.g. gcc optimization flag -O3 makes code slower than -O2 uses CMOV for a case that looks unpredictable, but when run on sorted data is actually very predictable. My answer there shows the asm real-world compilers make; note that they don't multi-version the loop, although that wouldn't be possible in that case if the compiler didn't know about the data being sorted.
Also to guess which side of a branch is more likely, so they can lay out the fast path with fewer taken branches. That's what the C++20 [[likely]] / [[unlikely]] hints are for, BTW, not actually influencing run-time branch prediction. Except on some CPUs, indirectly via static prediction the first time a CPU sees a branch. Or a few ISAs, like PowerPC and MIPS, have "branch-likely" instructions with actual run-time hints for the CPU which compilers might or might not actually use even if available. See
How do the likely/unlikely macros in the Linux kernel work and what is their benefit? - They influence branch layout, making the "likely" path a straight line (branches usually not-taken) for I-cache locality and contiguous fetch.
Is there a compiler hint for GCC to force branch prediction to always go a certain way?
If you expect to have a large data set you could just have two for loops, the first of them breaking when the importantFlag is set to true. It's hard to know specifically what optimizations the compiler will make since it's not well documented.
Peter Cordes has already given a great answer.
I'd also like to mention shortcircuiting
In this example
if( importantFlag || some_expensive_check() ) {
importantFlag = true;
}
Once important Flag is set to true, the expensive check will never be performed, since the || stops at the first true.

Is it good practice to construct long circuit statements?

Question Context: [C++] I want to know what is theoretically the fastest, and what the compiler will do. I don't want to hear about premature optimization is the root of all evil, etc.
I was writing some code like this:
bool b0 = ...;
bool b1 = ...;
if (b0 && b1)
{
...
}
But then I was thinking: the code, as-is, will compile into two TEST instructions, if compiled without optimizations. This means two branches. So I was thinking that it might be better to write:
if (b0 & b1)
Which will produce only one TEST instruction, if no optimization is done by the compiler. But then I feel that this is against my code-style. I usually write && and ||.
Q: What will the compiler do if I turn on optimization flags (-O1, -O2, -O3, -Os and -Ofast). Will the compiler automatically compile it like &, even if I have used a && in the code? And what is theoretically faster? Does the behavior change if I do this:
if (b0 && b1)
{ ... }
else if (b0)
{ ... }
else if (b1)
{ ... }
else
{ ... }
Q: As I could have guessed, this is very depended on the situation, but is it a common trick for a compiler to replace a && with a &?
Q: What will the compiler do if I turn on optimization flags (-O1, -O2, -O3, -Os and -Ofast).
Most likely nothing more to increase the optimization.
As stated in my comments, you really can't optimize the evaluation any further than:
AND B0 WITH B1 (sets condition flags)
JUMP ZERO TO ...
Although, if you have a lot of simple boolean logic or data operations, some processors may conditionally execute them.
Will the compiler automatically compile it like &, even if I have used a && in the code?
And what is theoretically faster?
In most platforms, there is no difference in evaluation of A & B versus A && B.
In the final evaluation, either a compare or an AND instruction is executed, then a jump based on the status. Two instructions.
Most processors don't have Boolean registers. It's all numbers and bits.
Optimize By Boolean Logic
Your best option is to review the design and set up your algorithms to use Boolean algebra. You can than simplify the Boolean expressions.
Another option is to implement the code so that the compiler can generate conditional assembly instructions, if the platform supports them.
Optimize: Reduce jumps
Processors favor arithmetic and data transfers over jumps.
Many processors are always feeding an instruction pipeline. When it comes to a conditional branch instruction, the processor has to wait (suspend the instruction prefetching) until the condition status is determined. Then it can determine where the next instruction will be fetched.
If you can't remove the jumps, such as in a loop, make the ratio of data processing to jumping bigger in the data side. Search for "Loop Unrolling". Many compilers will perform this when optimization levels are increased.
Optimize: Data Cache
You may notice increased performance by organizing your data for best data cache usage.
For example, instead of 3 large arrays, use one array of a structure containing 3 elements. This allows the elements in use to be close to each other (and reduce the likelihood of accessing data outside of the cache).
Summary
The difference in evaluation of A && B versus A & B as conditional expressions is known as a micro-optimization. You will achieve improved performance by using Boolean algebra to reduce the quantity of conditional expressions. Jumps, or changes in execution path, slow down instruction execution. Fetching data outside of the data cache also slows down execution. You will most likely get better performance by redesigning your code and helping the compiler to reduce the branches and more effective use of the data cache.
If you care about what's fastest, why do you care what the compiler will do without optimisation?
Q: As I could have guessed, this is very depended on the situation, but is it a common trick for a compiler to replace a && with a &?
This question seems to assume that the compiler transforms C++ code into more C++ code. It doesn't. It transforms your code into machine instructions (including the assembler as part of the compiler for argument's sake). You should not assume there is a one-to-one mapping from a C++ operator like && or & to a particular instruction.
With optimisation the compiler will do whatever it thinks will be faster. If a single instruction would be faster the compiler will generate a single instruction for if (b0 && b1), you don't need to bugger up your code with micro-optimisations to help it make such a simple transformation.
The compiler knows the instruction set it's using, it knows the context the condition is in and whether it can be removed entirely as dead code, or moved elsewhere to help the pipeline, or simplified by constant propagation, etc. etc.
And if you really care about what's fastest, why would you compute b1 until you know it's actually needed? If obtaining the value of b1 has no side effects the compiler could even transform your code to:
bool b0 = ...;
if (b0)
{
bool b1 = ...;
if (b1)
{
Does that mean two if conditions are faster than a &?! Of course not.
In other words, the whole premise of the question is flawed. Do not compromise the readability and simplicity of your code in the misguided pursuit of the "theoretically fastest" micro-optimisation. Spend your time improving the algorithms and data structures used not trying to second guess which instructions the compiler will generate.

Is comparing to zero faster than comparing to any other number?

Is
if(!test)
faster than
if(test==-1)
I can produce assembly but there is too much assembly produced and I can never locate the particulars I'm after. I was hoping someone just knows the answer. I would guess they are the same unless most CPU architectures have some sort of "compare to zero" short cut.
thanks for any help.
Typically, yes. In typical processors testing against zero, or testing sign (negative/positive) are simple condition code checks. This means that instructions can be re-ordered to omit a test instruction. In pseudo assembly, consider this:
Loop:
LOADCC r1, test // load test into register 1, and set condition codes
BCZS Loop // If zero was set, go to Loop
Now consider testing against 1:
Loop:
LOAD r1, test // load test into register 1
SUBT r1, 1 // Subtract Test instruction, with destination suppressed
BCNE Loop // If not equal to 1, go to Loop
Now for the usual pre-optimization disclaimer: Is your program too slow? Don't optimize, profile it.
It depends.
Of course it's going to depend, not all architectures are equal, not all µarchs are equal, even compilers aren't equal but I'll assume they compile this in a reasonable way.
Let's say the platform is 32bit x86, the assembly might look something like
test eax, eax
jnz skip
Vs:
cmp eax, -1
jnz skip
So what's the difference? Not much. The first snippet takes a byte less. The second snippet might be implemented with an inc to make it shorter, but that would make it destructive so it doesn't always apply, and anyway, it's probably slower (but again it depends).
Take any modern Intel CPU. They do "macro fusion", which means they take a comparison and a branch (subject to some limitations), and fuse them. The comparison becomes essentially free in most cases. The same goes for test. Not inc though, but the inc trick only really applied in the first place because we just happened to compare to -1.
Apart from any "weird effects" (due to changed alignment and whatnot), there should be absolutely no difference on that platform. Not even a small difference.
Even if you got lucky and got the test for free as a result of a previous arithmetic instruction, it still wouldn't be any better.
It'll be different on other platforms, of course.
On x86 there won't be any noticeably difference, unless you are doing some math at the same time (e.g. while(--x) the result of --x will automatically set the condition code, where while(x) ... will necessitate some sort of test on the value in x before we know if it's zero or not.
Many other processors do have a "automatic updates of the condition codes on LOAD or MOVE instructions", which means that checking for "postive", "negative" and "zero" is "free" with every movement of data. Of course, you pay for that by not being able to backward propagate the compare instruction from the branch instruction, so if you have a comparison, the very next instruction MUST be a conditional branch - where an extra instruction between these would possibly help with alleviating any delay in the "result" from such an instruction.
In general, these sort of micro-optimisations are best left to compilers, rather than the user - the compiler will quite often convert for(i = 0; i < 1000; i++) into for(i = 1000-1; i >= 0; i--) if it thinks that makes sense [and the order of the loop isn't important in the compiler's view]. Trying to be clever with these sort of things tend to make the code unreadable, and performance can suffer badly on other systems (because when you start tweaking "natural" code to "unnatural", the compiler tends to think that you really meant what you wrote, and not optimise it the same way as the "natural" version).

Profiling a simple, one cycle length operation

We have an assignment where we need to profile a 'simple instruction' (addition or bit-wise and for example). This means performing the same operation a large number of times (100K+) and measuring the average time in microseconds. The result should be presented in cycle-lengths: (totalTime/iterations)*cphMHz.
So, results may vary but all in all we were told that we should get a result close to 1 cycle-length. Actual result doesn't matter as long as programming is correct.
My question is: what is a good operation to profile?
There are two points I need to concider:
I use loop unrolling to be a bit more accurate, so in each iteration I perform 10 simple instruction. This means I have to choose an operation to wouldn't be performed only once due to compiler optimization (we can't use -o0 flag as school staff does not).
Bad example: var = i; - the compiler would only perform the last command.
What is a real 'simple instruction'? How do I know the number of operations that are actually performed? I tried reading the assembly output, but I couldn't understand it.
Hope I was clear enough, any idea would be great.
Thanks anyway
P.S don't know if it matters but I write in CPP
1) This sounds (to me) like an impossible task, if optimizations are (or might be) enabled. You can never be sure on what the compiler will do during optimizations. I'd definitely do something like reusing the previous result. If allowed to/possible, I'd try to include a raw assembler snippet to be profiled (so you can be sure there's no additional overhead; although it still could be optimized).
2) As for instructions: One assembler command is one instruction. E.g. a += i will - depending on available instruction set and stuff - most likely result in 4 instructions: read a, read i, add, write a. Reading assembly is pretty much straightforward. Depending on the instruction set/processor, there might be different "directions" for reading (i.e. "from -> to"). x86 assemblers (and those for most other common processors) will prefer instruction target, source, while DSPs prefer to use instruction source, target. Just important to know: moving data has to happen through registers. So even a single assignment like a = b will result in two instructions (b to register and register to a).
In general, if this answer goes into the wrong direction, try to elaborate a bit more on your specific task and its requirements (e.g. which compiler is to be used) and drop me a short comment.