I've heard this often enough to actually question it - many people say that in an if-else statement, one should put the condition that is most likely to be true first. So, if condition is likely to be false most of the time, put !condition in the if statement, otherwise use condition. Some contrived illustration of what I mean:
if (likely) {
// do this
} else {
// do that
}
if (!unlikely) {
// do this
} else {
// do that
}
Some people say that this is more efficient - perhaps due to branch prediction or some other optimization, I've never actually enquired when the topic has been breached - but as far as I can tell there will always be one test, and both paths will result in a jump.
So my question is - is there a convincing reason (where a "convincing reason" may be a tiny efficiency gain) why the condition that is most likely to be true should come first in an if-else statement?
There are 2 reasons why order may matter:
multiple branches of several if/else_if/else_if/else statement have different probability due to known distribution of data
Sample - sort apples, most are good yellow once (90%), but some are orange (5%) and some are of other colors (5%).
if (apple.isYellow) {...}
else if (apple.isOrange) {....}
else {...}
vs.
if (!apple.isYellow && !apple.isOrange) {...}
else if (apple.isOrange ) {....}
else {...}
In first sample 90% of apples checked with just one if check and 10% will hit 2, but in second sample only 5% hit one check and 95% hit two.
So if you know that that is significant difference between chances when one branch will be used it may be useful to move it up to be the first condition.
Note that your sample of single if makes no difference on that level.
low level CPU optimizations that may favor one of the branches (also this is more about incoming data to consistently hitting the same branch of condition).
Earlier/simpler CPUs may need to clear command parsing pipeline if conditional jump is performed compared to case where code executed sequentially. So this may be a reason for such suggestion.
Sample (fake assembly):
IF R1 > R2 JUMP ElseBranch
ADD R2, R3 -> R4 // CPU decodes this command WHILE executing previous IF
....
JUMP EndCondition
ElseBranch:
ADD 4, R3 -> R4 // CPU will have to decodes this command AFTER
// and drop results of parsing ADD R2, R3 -> R4
....
EndCondition:
....
Modern CPUs should not have the problem as they would parse commands for both branches. They even have branch prediction logic for conditions. So if condition mostly resolved one way CPU will assume that condition will be resolved that particular way and start executing code in that branch before check is finished. To my knowledge it does not matter on current CPUs whether it is first or alternative branch of condition. Check out Why is it faster to process a sorted array than an unsorted array? for good info on that.
Related
I have a bottleneck (about 20% CPU time) in my code which is in following if statement:
if (a == 0) { // here
...
}
where a is a uint8_t, so a number from 0 to 255.
Are there any low level optimizations to make it faster?
I thought about using bitwise NOR (~(a| 0)), but that would work only if a was a 1-bit, right?
Just in case: I don't care about code readability in this particular case.
Unless your compiler is garbage, there is nothing you can do to speed up integer comparison.
However, it is possible that the bottleneck you observe is not really the comparison itself, but rather the result of unlucky branch prediction.
There are two ways of getting around this:
If "to branch or not to branch" follows a pattern, move this last second decision further up in your program logic where you can use the pattern, just don't branch in your hot function. This might require serious thinking. A hacky way to find out whether you have patterns: Print 1 if you branch and 0 else for enough calls, Zip is up and see whether the resulting archive gets much smaller (in bits) than the number of values you printed. (Of course there are also smart formulas for that if you like it more theoretical.)
If you choose one branch over the other most of the time, you can tell the compiler which branch is the likely one. With gcc, checkout __builtin_expect, for other compilers, read the manual.
Important for both solutions: You will need to measure whether that actually helped. Especially the second one will not be magically be better, it might even make things much worse.
I am running while loop in 4 thread, in the loop I am evaluating function and incrementally increasing counter.
while(1) {
int fitness = EnergyFunction::evaluate(sequence);
mutex.lock();
counter++;
mutex.unlock();
}
When I run this loop, as I said in 4 running threads, I get ~ 20 000 000 evaluations per second.
while(1) {
if (dist(mt) == 0) {
sequence[distDim(mt)] = -1;
} else {
sequence[distDim(mt)] = 1;
}
int fitness = EnergyFunction::evaluate(sequence);
mainMTX.lock();
overallGeneration++;
mainMTX.unlock();
}
If I add some random mutation for the sequence, I get ~ 13 000 000 evaluations per second.
while(1) {
if (dist(mt) == 0) {
sequence[distDim(mt)] = -1;
} else {
sequence[distDim(mt)] = 1;
}
int fitness = EnergyFunction::evaluate(sequence);
mainMTX.lock();
if(fitness < overallFitness)
overallFitness = fitness;
overallGeneration++;
mainMTX.unlock();
}
But when I add simple if statement that checks, if new fitness is smaller than old fitness if that is true then replace old fitness with new fitness.
But performance loss is massive! Now I get ~ 20 000 evaluations per second. If I remove random mutation part, I also get ~ 20 000 evaluations per second.
Variable overallFitness is declared as
extern int overallFitness;
I am having troubles figuring out what is the problem for such a big performance loss. Is comparing two int such time taking operation?
Also I don't believe that is related to mutex locking.
UPDATE
This performance loss was not because of branch prediction, but compiler just ignored this call int fitness = EnergyFunction::evaluate(sequence);.
Now I added volatile and compiler doesn't ignore the call anymore.
Also thank you for pointing out branch misprediction and atomic<int>, didn't know about them!
Because of atomic I also remove mutex part, so the final code looks like this:
while(1) {
sequence[distDim(mt)] = lookup_Table[dist(mt)];
fitness = EnergyFunction::evaluate(sequence);
if(fitness < overallFitness)
overallFitness = fitness;
++overallGeneration;
}
Now I am getting ~ 25 000 evaluations per second.
You need to run a profiler to get to the bottom of this. On Linux, use perf.
My guess is that EnergyFunction::evaluate() is being entirely optimized away, because in the first examples, you don't use the result. So the compiler can discard the whole thing. You can try writing the return value to a volatile variable, which should force the compiler or linker to not optimize the call away. 1000x speed up is definitely not attributable to a simple comparison.
There is actually an atomic instruction to increase an int by 1. So a smart compiler may be able to entirely remove the mutex, altough I'd be surprised if it did. You can test this by looking at the assembly, or by removing the mutex and changing the type of overallGeneration to atomic<int> an check how fast it still is. This optimization is no longer possible with your last, slow example.
Also, if the compiler can see that evaluate does nothing to the global state and the result isn't used, then it can skip the entire call to evaluate. You can find out if that's the case by looking at the assembly or by removing the call to EnergyFunction::evaluate(sequence) and look at the timing - if it doesn't speed up, the function wasn't called in the first place. This optimization is no longer possible with your last, slow example. You should be able to stop the compiler from not executing EnergyFunction::evaluate(sequence) by defining the function in a different object file (other cpp or library) and disabling link time optimization.
There are other effects here that also create a performance difference, but I can't see any other effects that can explain a difference of factor 1000. A factor 1000 usually means the compiler cheated in the previous test and the change now prevents it from cheating.
I am not sure that my answer will give an explanation for such a dramatic performance drop but it definitely may have impact on it.
In the first case you added branches to the non-critical area:
if (dist(mt) == 0) {
sequence[distDim(mt)] = -1;
} else {
sequence[distDim(mt)] = 1;
}
In this case the CPU (at least IA) will perform branch prediction and in case of branch miss-prediction there is a performance penalty - this is a known fact.
Now regarding the second addition, you added a branch to the critical area:
mainMTX.lock();
if(fitness < overallFitness)
overallFitness = fitness;
overallGeneration++;
mainMTX.unlock();
Which in its turn, in addition to the "miss-prediction" penalty increased the amount of code which is executed in that area and thus the probability that other threads will have to wait for mainMTX.unlock();.
NOTE
Please make sure that all the global/shared resources are defined as volatile. Otherwise the compiler may optimize them out (which might explain such high number of evaluations at the very beginning).
In case of overallFitness it probably won't be optimized out because it is declared as extern but overallGeneration may be optimized out. If this is the case, then it may explain this performance drop after adding the "real" memory access in the critical area.
NOTE2
I am still not sure that the explanation I provided may explain such significant performance drop. So I believe there might be some implementation details in the code which you didn't post (like volatile for example).
EDIT
As Peter (#Peter) Mark Lakata (#MarkLakata) stated in the separate answers, and I tend to agree with them, most likely that the reason for the performance drop is that in the first case fitness was never used so the compiler just optimized that variable out together with the function call. While in the second case fitness was used so the compiler didn't optimize it. Good catch Peter and Mark! I just missed that point.
I realize this is not strictly an answer to the question but an alternative to the problem presented as it was.
Is overallGeneration used while the code is running? That is, is it perhaps used to determine when to stop computation? If it is not, you could forego synchronizing a global counter and have a counter per thread and after the computation is done, sum up all the per-thread counters to a grand total. Similarly for overallFitness, you could keep track of maxFitness per thread and pick the maximum of the four results once computation is over.
Having no thread synchronization at all would get you 100% CPU usage.
Is
if(!test)
faster than
if(test==-1)
I can produce assembly but there is too much assembly produced and I can never locate the particulars I'm after. I was hoping someone just knows the answer. I would guess they are the same unless most CPU architectures have some sort of "compare to zero" short cut.
thanks for any help.
Typically, yes. In typical processors testing against zero, or testing sign (negative/positive) are simple condition code checks. This means that instructions can be re-ordered to omit a test instruction. In pseudo assembly, consider this:
Loop:
LOADCC r1, test // load test into register 1, and set condition codes
BCZS Loop // If zero was set, go to Loop
Now consider testing against 1:
Loop:
LOAD r1, test // load test into register 1
SUBT r1, 1 // Subtract Test instruction, with destination suppressed
BCNE Loop // If not equal to 1, go to Loop
Now for the usual pre-optimization disclaimer: Is your program too slow? Don't optimize, profile it.
It depends.
Of course it's going to depend, not all architectures are equal, not all µarchs are equal, even compilers aren't equal but I'll assume they compile this in a reasonable way.
Let's say the platform is 32bit x86, the assembly might look something like
test eax, eax
jnz skip
Vs:
cmp eax, -1
jnz skip
So what's the difference? Not much. The first snippet takes a byte less. The second snippet might be implemented with an inc to make it shorter, but that would make it destructive so it doesn't always apply, and anyway, it's probably slower (but again it depends).
Take any modern Intel CPU. They do "macro fusion", which means they take a comparison and a branch (subject to some limitations), and fuse them. The comparison becomes essentially free in most cases. The same goes for test. Not inc though, but the inc trick only really applied in the first place because we just happened to compare to -1.
Apart from any "weird effects" (due to changed alignment and whatnot), there should be absolutely no difference on that platform. Not even a small difference.
Even if you got lucky and got the test for free as a result of a previous arithmetic instruction, it still wouldn't be any better.
It'll be different on other platforms, of course.
On x86 there won't be any noticeably difference, unless you are doing some math at the same time (e.g. while(--x) the result of --x will automatically set the condition code, where while(x) ... will necessitate some sort of test on the value in x before we know if it's zero or not.
Many other processors do have a "automatic updates of the condition codes on LOAD or MOVE instructions", which means that checking for "postive", "negative" and "zero" is "free" with every movement of data. Of course, you pay for that by not being able to backward propagate the compare instruction from the branch instruction, so if you have a comparison, the very next instruction MUST be a conditional branch - where an extra instruction between these would possibly help with alleviating any delay in the "result" from such an instruction.
In general, these sort of micro-optimisations are best left to compilers, rather than the user - the compiler will quite often convert for(i = 0; i < 1000; i++) into for(i = 1000-1; i >= 0; i--) if it thinks that makes sense [and the order of the loop isn't important in the compiler's view]. Trying to be clever with these sort of things tend to make the code unreadable, and performance can suffer badly on other systems (because when you start tweaking "natural" code to "unnatural", the compiler tends to think that you really meant what you wrote, and not optimise it the same way as the "natural" version).
To check an int within range [1, ∞) or not, I can use the following ways (use #1, #2 a lot):
if (a>=1)
if (a>0)
if (a>1 || a==1)
if (a==1 || a>1)
Is there any difference that I should pay attention to among the four versions?
Functionally there is no difference between the 4 ways you listed. This is mainly an issue of style. I would venture that #1 and #2 are the most common forms though, if I saw #3 or #4 on a code review I would suggest a change.
Perf wise I suppose it is possible that some compiler out there optimizes one better than the other. But I really doubt it. At best it would be a micro-optimization and nothing I would ever base my coding style on without direct profiler input
I don't really see why you would use 3 or 4. Apart from being longer to type, they will generate more code. Since in a or condition the second check is skipped if the first is true, there shouldn't be a performance hit except for version 4 if value is not 1 often(of course hardware with branch prediction will mostly negate that).
1. if (a>=1)
2. if (a>0)
3. if (a>1 || a==1)
4. if (a==1 || a>1)
On x86, options 1 and 2 produce a cmp instruction. This will set various registers. The cmp is then followed by a condition branch/jump based on registers. For the first, it emits bge, for the second it emits bgt.
Option 3 and 4 - in theory - require two cmps and two branches, but chances are the compiler will simply optimize them to be the same as 1.
You should generally choose whichever (a) follows the conventions in the code you are working on (b) use whichever most clearly expresses the algorithm you are implementing.
There are times when explicitly writing "if a is equal to one, or it has a value greater than 1", and in those times you should write if (a == 1 || a > 1). But if you are just checking that a has a positive, non-zero, integer value, you should write if (a > 0), since that is what that says.
If you find that such a case is a part of a performance bottleneck, you should inspect the assembly instructions and adjust accordingly - e.g. if you find you have two cmps and branches, then write the code to use one compare and one branch.
Nope! They all are the same for an int. However, I would prefer to use if (a>0).
is there be any performance effect on "Lines of code - (C)" running inside nested ifs?
if (condition_1)
{
/* Lines of code */ - (A)
if (condition_2)
{
/* Lines of code */ - (B)
if (condition_n)
{
/* Lines of code */ - (C)
}
}
}
Does that mean you can nest any number of if statements without effecting the execution time for the code enclosing at the end of last if statement?
Remember C and C++ are translated to their assembly equivalents. In most cases, this is likely to be via some form of compare (e.g. cmp) and some form of jmp instruction.
As such, whatever code is generated from (C) will still be the same. The if nesting has no bearing on the output. If the resultant code is to generate add eax, 1 no matter how many ifs precede that, it will still be the same thing.
The only performance penalty will be in the number of if statements you use and whether or not the resultant assembly (jxx) is expensive on your system. However, I doubt that repeated nested use of if is likely to be a performance bottleneck in your application. Usually, it is time required to process data and or time required to get data.
You won't affect the execution time of the indicated code itself, but if evaluating your conditions is complex, or affected by other factors, then it could potentially lengthen the total time of execution.
The code will run as fast as if it was outside.
Just remember that evaluating an expression (in a if statement) is not "free" and will take a bit of time (more if the condition is more complex), so if your code is deeply nested it will take more time to reach it.