So, I'll quote my textbook(Computer Organization and Design) and then I'll ask my question:
Compiling if-then-else into Conditional branches
In the following code segment, f, g, j, i and j are variables. If the five variables f through j correspond to the five registers $s0 through $s4, what is the compiled MIPS code for this C if statement?
if (i == j) f = g + h; else f = g - h;
Figure 2.9 is a flowchart of what the MIPS code should do. The first expression compares for equality, so it would seem that we would want the branch if registers are equal instruction (beq). In general, the code will be more efficient if we test for the opposite condition to branch over the code that performs the subsequent then part of the if (the label Else is defined below) and so we use the branch if registers are not equal instruction (bne):
bne $s3, $s4, Else # go to Else if i ≠ j
I've searched for a while but I couldn't find why bne would be more efficient than beq.
(I did however find that bne is sometimes recommended because it makes the code easier to understand, as the statements to be executed when the condition holds are right below the bne statement.)
So, if it would not be more efficient in general, it still could be more efficient in this particular exercise. I've thought about that, and I assumed that a branch instruction costs more time if taken, and therefore we'd want to minimize the amount of jumps (taken branches) needed. This means that, when we expect the condition to hold, we should use bne, and when we expect the condition to fail, we should use beq.
Now if we test whether $s3 equals $s4, when we have no information whatsoever about the content of those registers it's not reasonable to assume that they're likely to be equal. On the contrary, it's more likely that they're not equal, which should favour using beq instead of bne.
So, to sum up: textbook says bne is more efficient than beq, whether it's in general or just in this example is not clear, but in either case I don't understand why.
The efficiency is not from a direct comparison of the machine code for a bne versus beq. The text describes optimizing the over all performance by coding to shorten the most common code path.
If you assume the values are more likely to be unequal then only one instruction needs be processed when using bne, if you use beq you must perform an additional jump on failure.
The shortest path is to drop through the compare, to fail it and not jump.
from http://www.cs.gmu.edu/~setia/cs365-S02/class3.pdf:
Uncommon Case for branches
beq $18, $19, L1
else handling
jmp
replaced by
bne $18, $19, L2
success handling
end
L2:
Make the common case fast -
one instruction for most branches
Re-reading your question, I think the crux is this assumption:
"Now if we test whether $s3 equals $s4, when we have no information
whatsoever about the content of those registers, it's not reasonable
to assume that they're likely to be equal; on the contrary, it's more
likely that they're not equal, which should result in using beq
instead of bne."
This seems to be the confusion, we need to find some evidence or reason to determine which possibility is more likely, registers equal or unequal.
In this case we are examining an if-then-else. I make the assertion that we expect the if-test to pass, this is the psychology described by twalberg. The registers are unlikely to contain random values as they contain data that the programer is expecting - the result of previous operations.
I believe this has something to do with simplifying the compiler. In the case that you have an equality assertion, you would want to skip over the code to execute if the condition is not met. I would assume that this decision was made to use the exact same procedure in the case where you do have an else condition and not. Here's what I think. By all means, if my reasoning is wrong, please tell me! :)
Starting with the given pseudocode from OP:
if (i == j) {
f = g + h;
}
else {
f = g - h;
}
Would translate to:
bne $s3, $s4, Else
add $s0, $s1, $s2
j Exit
Else: sub $s0, $s1, $s2
Exit: ...
If you changed the C code to just perform the addition if i == j:
if (i == j) {
f = g + h;
}
The compiler would come up with something like this:
bne $s3, $s4, Exit
add $s0, $s1, $s2
Exit: ...
Now, let's think about what the compiled code would look like if we tested the assertion using beq instead. I'm guessing it would look somewhat like this:
beq $s3, $s4, Equal
j Exit
Equal:
add $s0, $s1, $s2
Exit: ...
Which seems a lot less efficient than testing with bne.
EDIT
There's also the fact that checking for inequality is faster for a CPU than checking for equality.
If you try to compare 2 32 bit numbers for equality, the CPU has to compare every single bit of that 32 bit number to make sure they are equal. If testing for non-equality, if the first bits of the 2 numbers are different, the CPU doesn't have to bother with testing any of the other bits to assert the inequality.
Another reason for this is that simple branch predictors usually assume that forward branches are not taken and that backwards branches are taken. This assumption gives better performance for simple loops.
Related
beq $s0, $s1, Lab1
add $s2, $s0, $s1
Lab1: sub $s1, $s1, $s0
when $s0, $s1 are not equal line2 will be executed. is line3 going to be executed after line2?
or can line3 be executed when only the if statement is satisfied and send to Lab1?
I hope I made my question clear. Thanks in advance.
Every instruction tells the processor what instruction comes next.
Let's take a closer look at the add instruction.
It computes the sum, places that into the target register, and also, in parallel, increments the program counter by 4 — definitively telling the processor that the next instruction is the next one in address order sequence.
A nop instruction is commonly said to do nothing — it even stands for no-operation — but it does increment the pc, so technically it doesn't do nothing at all.
As a mentor to assembly language students, I find it useful to emphasize the program counter.
Experienced assembly language programmers often overlook the program counter as and because it is so fundamental to the operation of the processor. So, let's talk about it for a moment.
Every instruction tells the processor what instruction comes next: each instruction updates the program counter, and, this update to the program counter is the mechanism by which each instruction tells the processor what is next. Each instruction has its own memory address; a given instruction executes because the program counter held its address. Sequential operation isn't magic — each instruction has to tell the processor what is next (i.e. has to update the pc).
Programs can also interact with the program counter, calling (jal) captures the program counter next into the $ra register for the subroutine or function to use to return back control of the processor to the caller.
Only slightly overly simplified, within a subroutine or function, moving the program counter backwards forms a loop — the processor goes backward to re-execute something it already did before; moving it forwards skip something, as needed for if-then or if-then-else.
But each instruction has well-defined way that it modifies the program counter, whether it appears to be explicit or implicit.
Assembly Machine Code Operation
beq $s0, $s1, Lab1 # skip 1 instruction on condition $s0 == $s1
add $s2, $s0, $s1 # not skipped if $s0 != $s1: next is pc+4
Lab1: # no machine code for this
sub $s1, $s1, $s0 # run after beq when $s0 == $s1 -or else- after add
In assembly language, the label informs the assembler how to translate an instruction that uses it. Lab1 will be associated with a location, an address, here the address of the sub instruction.
The beq is a conditional pc-relative branch. Thus, the value it wants is the delta between the pc-next (of itself) and the branch target, here Lab1. A delta of 0 would cause no instructions to be skipped, and a delta of 1 will cause 1 instruction to be skipped. Here we want to skip 1 instruction, so the delta will be 1. There is literally a 1 in the machine code for that beq.
After executing the beq the processor will have been told (based on the specified eq condition) whether it will branch or not. It does this by either adjusting the pc to pc+4 -or- to pc+4+delta.
Both the other instructions are what we call sequential so they inform the processor that pc-next is pc+4. Knowing that you can follow the full sequencing on either the conditional branch taken or not taken.
I've heard this often enough to actually question it - many people say that in an if-else statement, one should put the condition that is most likely to be true first. So, if condition is likely to be false most of the time, put !condition in the if statement, otherwise use condition. Some contrived illustration of what I mean:
if (likely) {
// do this
} else {
// do that
}
if (!unlikely) {
// do this
} else {
// do that
}
Some people say that this is more efficient - perhaps due to branch prediction or some other optimization, I've never actually enquired when the topic has been breached - but as far as I can tell there will always be one test, and both paths will result in a jump.
So my question is - is there a convincing reason (where a "convincing reason" may be a tiny efficiency gain) why the condition that is most likely to be true should come first in an if-else statement?
There are 2 reasons why order may matter:
multiple branches of several if/else_if/else_if/else statement have different probability due to known distribution of data
Sample - sort apples, most are good yellow once (90%), but some are orange (5%) and some are of other colors (5%).
if (apple.isYellow) {...}
else if (apple.isOrange) {....}
else {...}
vs.
if (!apple.isYellow && !apple.isOrange) {...}
else if (apple.isOrange ) {....}
else {...}
In first sample 90% of apples checked with just one if check and 10% will hit 2, but in second sample only 5% hit one check and 95% hit two.
So if you know that that is significant difference between chances when one branch will be used it may be useful to move it up to be the first condition.
Note that your sample of single if makes no difference on that level.
low level CPU optimizations that may favor one of the branches (also this is more about incoming data to consistently hitting the same branch of condition).
Earlier/simpler CPUs may need to clear command parsing pipeline if conditional jump is performed compared to case where code executed sequentially. So this may be a reason for such suggestion.
Sample (fake assembly):
IF R1 > R2 JUMP ElseBranch
ADD R2, R3 -> R4 // CPU decodes this command WHILE executing previous IF
....
JUMP EndCondition
ElseBranch:
ADD 4, R3 -> R4 // CPU will have to decodes this command AFTER
// and drop results of parsing ADD R2, R3 -> R4
....
EndCondition:
....
Modern CPUs should not have the problem as they would parse commands for both branches. They even have branch prediction logic for conditions. So if condition mostly resolved one way CPU will assume that condition will be resolved that particular way and start executing code in that branch before check is finished. To my knowledge it does not matter on current CPUs whether it is first or alternative branch of condition. Check out Why is it faster to process a sorted array than an unsorted array? for good info on that.
Is
if(!test)
faster than
if(test==-1)
I can produce assembly but there is too much assembly produced and I can never locate the particulars I'm after. I was hoping someone just knows the answer. I would guess they are the same unless most CPU architectures have some sort of "compare to zero" short cut.
thanks for any help.
Typically, yes. In typical processors testing against zero, or testing sign (negative/positive) are simple condition code checks. This means that instructions can be re-ordered to omit a test instruction. In pseudo assembly, consider this:
Loop:
LOADCC r1, test // load test into register 1, and set condition codes
BCZS Loop // If zero was set, go to Loop
Now consider testing against 1:
Loop:
LOAD r1, test // load test into register 1
SUBT r1, 1 // Subtract Test instruction, with destination suppressed
BCNE Loop // If not equal to 1, go to Loop
Now for the usual pre-optimization disclaimer: Is your program too slow? Don't optimize, profile it.
It depends.
Of course it's going to depend, not all architectures are equal, not all µarchs are equal, even compilers aren't equal but I'll assume they compile this in a reasonable way.
Let's say the platform is 32bit x86, the assembly might look something like
test eax, eax
jnz skip
Vs:
cmp eax, -1
jnz skip
So what's the difference? Not much. The first snippet takes a byte less. The second snippet might be implemented with an inc to make it shorter, but that would make it destructive so it doesn't always apply, and anyway, it's probably slower (but again it depends).
Take any modern Intel CPU. They do "macro fusion", which means they take a comparison and a branch (subject to some limitations), and fuse them. The comparison becomes essentially free in most cases. The same goes for test. Not inc though, but the inc trick only really applied in the first place because we just happened to compare to -1.
Apart from any "weird effects" (due to changed alignment and whatnot), there should be absolutely no difference on that platform. Not even a small difference.
Even if you got lucky and got the test for free as a result of a previous arithmetic instruction, it still wouldn't be any better.
It'll be different on other platforms, of course.
On x86 there won't be any noticeably difference, unless you are doing some math at the same time (e.g. while(--x) the result of --x will automatically set the condition code, where while(x) ... will necessitate some sort of test on the value in x before we know if it's zero or not.
Many other processors do have a "automatic updates of the condition codes on LOAD or MOVE instructions", which means that checking for "postive", "negative" and "zero" is "free" with every movement of data. Of course, you pay for that by not being able to backward propagate the compare instruction from the branch instruction, so if you have a comparison, the very next instruction MUST be a conditional branch - where an extra instruction between these would possibly help with alleviating any delay in the "result" from such an instruction.
In general, these sort of micro-optimisations are best left to compilers, rather than the user - the compiler will quite often convert for(i = 0; i < 1000; i++) into for(i = 1000-1; i >= 0; i--) if it thinks that makes sense [and the order of the loop isn't important in the compiler's view]. Trying to be clever with these sort of things tend to make the code unreadable, and performance can suffer badly on other systems (because when you start tweaking "natural" code to "unnatural", the compiler tends to think that you really meant what you wrote, and not optimise it the same way as the "natural" version).
To check an int within range [1, ∞) or not, I can use the following ways (use #1, #2 a lot):
if (a>=1)
if (a>0)
if (a>1 || a==1)
if (a==1 || a>1)
Is there any difference that I should pay attention to among the four versions?
Functionally there is no difference between the 4 ways you listed. This is mainly an issue of style. I would venture that #1 and #2 are the most common forms though, if I saw #3 or #4 on a code review I would suggest a change.
Perf wise I suppose it is possible that some compiler out there optimizes one better than the other. But I really doubt it. At best it would be a micro-optimization and nothing I would ever base my coding style on without direct profiler input
I don't really see why you would use 3 or 4. Apart from being longer to type, they will generate more code. Since in a or condition the second check is skipped if the first is true, there shouldn't be a performance hit except for version 4 if value is not 1 often(of course hardware with branch prediction will mostly negate that).
1. if (a>=1)
2. if (a>0)
3. if (a>1 || a==1)
4. if (a==1 || a>1)
On x86, options 1 and 2 produce a cmp instruction. This will set various registers. The cmp is then followed by a condition branch/jump based on registers. For the first, it emits bge, for the second it emits bgt.
Option 3 and 4 - in theory - require two cmps and two branches, but chances are the compiler will simply optimize them to be the same as 1.
You should generally choose whichever (a) follows the conventions in the code you are working on (b) use whichever most clearly expresses the algorithm you are implementing.
There are times when explicitly writing "if a is equal to one, or it has a value greater than 1", and in those times you should write if (a == 1 || a > 1). But if you are just checking that a has a positive, non-zero, integer value, you should write if (a > 0), since that is what that says.
If you find that such a case is a part of a performance bottleneck, you should inspect the assembly instructions and adjust accordingly - e.g. if you find you have two cmps and branches, then write the code to use one compare and one branch.
Nope! They all are the same for an int. However, I would prefer to use if (a>0).
In the GCC (version 4.8.2) manual, the following is stated:
-ftree-loop-if-convert-stores:
Attempt to also if-convert conditional jumps containing memory
writes. This transformation can be unsafe for multi-threaded
programs as it transforms conditional memory writes into
unconditional memory writes. For example,
for (i = 0; i < N; i++)
if (cond)
A[i] = expr;
is transformed to
for (i = 0; i < N; i++)
A[i] = cond ? expr : A[i];
potentially producing data races.
I wonder, however, if there is a performance gain by using the operator? versus the if statement.
In the first piece of code, A[i] is set to expr only if the condition is met. If it is not met, then the code inside the statement is skipped.
In the second one, A[i] seems to be written regardless of the condition; the condition only affects the value it is set to.
By using operator?, we are also doing a check; however, we are adding some overhead in the case that the condition is not met. Have I missed something?
What is says is that conditional jumps are converted to conditional move instructions, the cmove family of instructions. They improve speed because they do not stall the processor pipeline like jumps do.
With a jump instructions, you don't know in advanced which instructions to load, so a prediction is used and a branch is loaded in the pipeline. If the prediction was correct, all is well, the next instructions are already executing on the pipeline. However, after the jump is evaluated, if the prediction was wrong, all the following instructions already in the pipeline are useless, so the pipeline must be freed, and the correct instructions are loaded. Modern processors contain 16-30 stages of pipe, and a branch mispredictions degrade performance severely. Conditional moves bypass this because they do not insert branches in the program flow.
But does cmove always write?
From Intel x86 Instruction Set Reference:
The CMOVcc instructions check the state of one or more of the status flags in the EFLAGS register [..] and perform a move operation if the flags are in a specified state (or condition). [..] If the condition is not satisfied, a move
is not performed and execution continues with the instruction following the CMOVcc instruction.
Edit
Upon further investigating gcc manual, I got confused, because as far as I know the compiler doesn't optimize transforming C code into another C code, but uses internal data structures like Control Flow Graphs so I really don't know what they mean with their example. I suppose they mean the C equivalent of the new flow generated. I am not sure anymore if this optimization is about generating cmoves.
Edit 2
Since cmove operates with registers and not memory, this
if (cond)
A[i] = expr
cannot generate cmove.
However this
A[i] = cond ? expr : A[i];
can.
Suppose we have in bx the expr value.
load A[i] into ax
cmp // cond
cmove ax, bx
store ax into &A[i]
So in order to use cmove you have to read A[i] value and write it back if cond if false, which is not equivalent with the if statement, but with the ternary operator.