difference between .IF and IF in assembly - if-statement

what is the difference of .IF and IF directive in assembly?
in document for .IF :
.IF condition1
statements
[[.ELSEIF condition2
statements]]
[[.ELSE
statements]]
.ENDIF
and for IF :
IF expression1
ifstatements
[[ELSEIF expression2
elseifstatements]]
[[ELSE
elsestatements]]
ENDIF

IF-ELSEIF-ELSE-ENDIF (without the dot) are compile-time directives. The assembler will test the conditions, and based on the results, will only include one of the sequences of statements in the resulting program. They serve the same purpose as the C preprocessor directives #if, #elif, #else, and #endif.
.IF-.ELSEIF-.ELSE-.ENDIF (with the dot) are execution-time directives. The assembler will generate comparison and jumping instructions. They serve the same purpose as C statements in the form if (...) { ... } else if (...) { ... } else { ... }.
Note: I'm not fluent in masm, so there might be errors in the notation of these examples.
something EQU 1
somewhere:
mov ax, 42
IF something == 1
xor bx, 10
ELSE
mov bx, 20
ENDIF
add ax, bx
During the preprocessing phase of compilation, the compiler will test the conditions in IF and ELSEIF statements (without the dot), and select one of the blocks of code that will end up in the program. The above code is turned into the following:
somewhere:
mov ax, 42
xor bx, 10
add ax, bx
Another example:
something EQU 1
somewhere:
mov ax, 42
mov dx, something
.IF dx == 1
xor bx, 10
.ELSE
mov bx, 20
.ENDIF
add ax, bx
During the preprocessing phase of compilation, the compiler will turn .IF-statements (with the dot) into assembly instructions. The above code is probably turned into the following:
something EQU 1
somewhere:
mov ax, 42
mov dx, 1
cmp dx, 1
jnz else_clause
xor bx, 10
jmp past_endif
else_clause:
mov bx, 20
past_endif:
add ax, bx
The conditions are actually checked at execution time.

Wrong
"IF-ELSEIF-ELSE-ENDIF (without the dot) are compile-time directives. The assembler will test the conditions, and based on the results, will only include one of the sequences of statements in the resulting program. They serve the same purpose as the C preprocessor directives #if, #elif, #else, and #endif.
.IF-.ELSEIF-.ELSE-.ENDIF (with the dot) are execution-time directives. The assembler will generate comparison and jumping instructions. They serve the same purpose as C statements in the form if (...) { ... } else if (...) { ... } else { ... }."
True
.if marks the beginning of a section of code which is only considered part of the source program being assembled if the argument (which must be an absolute expression) is non-zero.
reference:
http://web.mit.edu/gnu/doc/html/as_toc.html#SEC65

Related

Efficient overflow-immune arithmetic mean in C/C++

The arithmetic mean of two unsigned integers is defined as:
mean = (a+b)/2
Directly implementing this in C/C++ may overflow and produce a wrong result. A correct implementation would avoid this. One way of coding it could be:
mean = a/2 + b/2 + (a%2 + b%2)/2
But this produces rather a lot of code with typical compilers. In assembler, this usually can be done much more efficiently. For example, the x86 can do this in the following way (assembler pseudo code, I hope you get the point):
ADD a,b ; addition, leaving the overflow condition in the carry bit
RCR a,1 ; rotate right through carry, effectively a division by 2
After those two instructions, the result is in a, and the remainder of the division is in the carry bit. If correct rounding is desired, a third ADC instruction would have to add the carry into the result.
Note that the RCR instruction is used, which rotates a register through the carry. In our case, it is a rotate by one position, so that the previous carry becomes the most significant bit in the register, and the new carry holds the previous LSB from the register. It seems that MSVC doesn't even offer an intrinsic for this instruction.
Is there a known C/C++ pattern that can be expected to be recognized by an optimizing compiler so that it produces such efficient code? Or, more generally, is there a rational way how to program in C/C++ source level so that the carry bit is being used by the compiler to optimize the generated code?
EDIT:
A 1-hour lecture about std::midpoint: https://www.youtube.com/watch?v=sBtAGxBh-XI
Wow!
EDIT2: Great discussion on Microsoft blog
The following method avoids overflow and should result in fairly efficient assembly (example) without depending on non-standard features:
mean = (a&b) + (a^b)/2;
There are three typical methods to compute average without overflow, one of which is limited to uint32_t (on 64-bit architectures).
// average "SWAR" / Montgomery
uint32_t avg(uint32_t a, uint32_t b) {
return (a & b) + ((a ^ b) >> 1);
}
// in case the relative magnitudes are known
uint32_t avg2(uint32_t min, uint32_t max) {
return min + (max - min) / 2;
}
// in case the relative magnitudes are not known
uint32_t avg2_constrained(uint32_t a, uint32_t b) {
return a + (int32_t)(b - a) / 2;
}
// average increase width (not applicable to uint64_t)
uint32_t avg3(uint32_t a, uint32_t b) {
return ((uint64_t)a + b) >> 1;
}
The corresponding assembler sequences from clang in two architectures are
avg(unsigned int, unsigned int)
mov eax, esi
and eax, edi
xor esi, edi
shr esi
add eax, esi
avg2(unsigned int, unsigned int)
sub esi, edi
shr esi
lea eax, [rsi + rdi]
avg3(unsigned int, unsigned int)
mov ecx, edi
mov eax, esi
add rax, rcx
shr rax
vs.
avg(unsigned int, unsigned int)
and w8, w1, w0
eor w9, w1, w0
add w0, w8, w9, lsr #1
ret
avg2(unsigned int, unsigned int)
sub w8, w1, w0
add w0, w0, w8, lsr #1
ret
avg3(unsigned int, unsigned int):
mov w8, w1
add x8, x8, w0, uxtw
lsr x0, x8, #1
ret
Out of these three versions, avg2 would perform in ARM64 as well, as the optimal sequence using carry flag -- and also it's likely that avg3 would perform as well, noticing that the mov w8,w1 is used to clear the top 32-bits, which may be unnecessary given that the compiler knows they are cleared by any previous instruction that is used to produce the value.
Similar statement can be made of the Intel version for avg3, which would in optimal case compiled to just the two meaningful instructions:
add rax, rcx
shr rax
See https://godbolt.org/z/5TMd3zr81 for online comparison.
The "SWAR"/Montgomery version is typically only justified, when trying to compute multiple averages packed to a single (large) integer in which case the full formula contains masking with the bit positions of the highest bits: return (a & b) + (((a ^ b) >> 1) & ~kH;.

Is there a branchless method to quickly find the min/max of two double-precision floating-point values?

I have two doubles, a and b, which are both in [0,1]. I want the min/max of a and b without branching for performance reasons.
Given that a and b are both positive, and below 1, is there an efficient way of getting the min/max of the two? Ideally, I want no branching.
Yes, there is a way to calculate the maximum or minimum of two doubles without any branches. The C++ code to do so looks like this:
#include <algorithm>
double FindMinimum(double a, double b)
{
return std::min(a, b);
}
double FindMaximum(double a, double b)
{
return std::max(a, b);
}
I bet you've seen this before. Lest you don't believe that this is branchless, check out the disassembly:
FindMinimum(double, double):
minsd xmm1, xmm0
movapd xmm0, xmm1
ret
FindMaximum(double, double):
maxsd xmm1, xmm0
movapd xmm0, xmm1
ret
That's what you get from all popular compilers targeting x86. The SSE2 instruction set is used, specifically the minsd/maxsd instructions, which branchlessly evaluate the minimum/maximum value of two double-precision floating-point values.
All 64-bit x86 processors support SSE2; it is required by the AMD64 extensions. Even most x86 processors without 64-bit support SSE2. It was released in 2000. You'd have to go back a long way to find a processor that didn't support SSE2. But what about if you did? Well, even there, you get branchless code on most popular compilers:
FindMinimum(double, double):
fld QWORD PTR [esp + 12]
fld QWORD PTR [esp + 4]
fucomi st(1)
fcmovnbe st(0), st(1)
fstp st(1)
ret
FindMaximum(double, double):
fld QWORD PTR [esp + 4]
fld QWORD PTR [esp + 12]
fucomi st(1)
fxch st(1)
fcmovnbe st(0), st(1)
fstp st(1)
ret
The fucomi instruction performs a comparison, setting flags, and then the fcmovnbe instruction performs a conditional move, based on the value of those flags. This is all completely branchless, and relies on instructions introduced to the x86 ISA with the Pentium Pro back in 1995, supported on all x86 chips since the Pentium II.
The only compiler that won't generate branchless code here is MSVC, because it doesn't take advantage of the FCMOVxx instruction. Instead, you get:
double FindMinimum(double, double) PROC
fld QWORD PTR [a]
fld QWORD PTR [b]
fcom st(1) ; compare "b" to "a"
fnstsw ax ; transfer FPU status word to AX register
test ah, 5 ; check C0 and C2 flags
jp Alt
fstp st(1) ; return "b"
ret
Alt:
fstp st(0) ; return "a"
ret
double FindMinimum(double, double) ENDP
double FindMaximum(double, double) PROC
fld QWORD PTR [b]
fld QWORD PTR [a]
fcom st(1) ; compare "b" to "a"
fnstsw ax ; transfer FPU status word to AX register
test ah, 5 ; check C0 and C2 flags
jp Alt
fstp st(0) ; return "b"
ret
Alt:
fstp st(1) ; return "a"
ret
double FindMaximum(double, double) ENDP
Notice the branching JP instruction (jump if parity bit set). The FCOM instruction is used to do the comparison, which is part of the base x87 FPU instruction set. Unfortunately, this sets flags in the FPU status word, so in order to branch on those flags, they need to be extracted. That's the purpose of the FNSTSW instruction, which stores the x87 FPU status word to the general-purpose AX register (it could also store to memory, but…why?). The code then TESTs the appropriate bit, and branches accordingly to ensure that the correct value is returned. In addition to the branch, retrieving the FPU status word will also be relatively slow. This is why the Pentium Pro introduced the FCOM instructions.
However, it is unlikely that you would be able to improve upon the speed of any of this code by using bit-twiddling operations to determine min/max. There are two basic reasons:
The only compiler generating inefficient code is MSVC, and there's no good way to force it to generate the instructions you want it to. Although inline assembly is supported in MSVC for 32-bit x86 targets, it is a fool's errand when seeking performance improvements. I'll also quote myself:
Inline assembly disrupts the optimizer in rather significant ways, so unless you're writing significant swaths of code in inline assembly, there is unlikely to be a substantial net performance gain. Furthermore, Microsoft's inline assembly syntax is extremely limited. It trades flexibility for simplicity in a big way. In particular, there is no way to specify input values, so you're stuck loading the input from memory into a register, and the caller is forced to spill the input from a register to memory in preparation. This creates a phenomenon I like to call "a whole lotta shufflin' goin' on", or for short, "slow code". You don't drop to inline assembly in cases where slow code is acceptable. Thus, it is always preferable (at least on MSVC) to figure out how to write C/C++ source code that persuades the compiler to emit the object code you want. Even if you can only get close to the ideal output, that's still considerably better than the penalty you pay for using inline assembly.
In order to get access to the raw bits of a floating-point value, you'd have to do a domain transition, from floating-point to integer, and then back to floating-point. That's slow, especially without SSE2, because the only way to get a value from the x87 FPU to the general-purpose integer registers in the ALU is indirectly via memory.
If you wanted to pursue this strategy anyway—say, to benchmark it—you could take advantage of the fact that floating-point values are lexicographically ordered in terms of their IEEE 754 representations, except for the sign bit. So, since you are assuming that both values are positive:
FindMinimumOfTwoPositiveDoubles(double a, double b):
mov rax, QWORD PTR [a]
mov rdx, QWORD PTR [b]
sub rax, rdx ; subtract bitwise representation of the two values
shr rax, 63 ; isolate the sign bit to see if the result was negative
ret
FindMaximumOfTwoPositiveDoubles(double a, double b):
mov rax, QWORD PTR [b] ; \ reverse order of parameters
mov rdx, QWORD PTR [a] ; / for the SUB operation
sub rax, rdx
shr rax, 63
ret
Or, to avoid inline assembly:
bool FindMinimumOfTwoPositiveDoubles(double a, double b)
{
static_assert(sizeof(a) == sizeof(uint64_t),
"A double must be the same size as a uint64_t for this bit manipulation to work.");
const uint64_t aBits = *(reinterpret_cast<uint64_t*>(&a));
const uint64_t bBits = *(reinterpret_cast<uint64_t*>(&b));
return ((aBits - bBits) >> ((sizeof(uint64_t) * CHAR_BIT) - 1));
}
bool FindMaximumOfTwoPositiveDoubles(double a, double b)
{
static_assert(sizeof(a) == sizeof(uint64_t),
"A double must be the same size as a uint64_t for this bit manipulation to work.");
const uint64_t aBits = *(reinterpret_cast<uint64_t*>(&a));
const uint64_t bBits = *(reinterpret_cast<uint64_t*>(&b));
return ((bBits - aBits) >> ((sizeof(uint64_t) * CHAR_BIT) - 1));
}
Note that there are severe caveats to this implementation. In particular, it will break if the two floating-point values have different signs, or if both values are negative. If both values are negative, then the code can be modified to flip their signs, do the comparison, and then return the opposite value. To handle the case where the two values have different signs, code can be added to check the sign bit.
// ...
// Enforce two's-complement lexicographic ordering.
if (aBits < 0)
{
aBits = ((1 << ((sizeof(uint64_t) * CHAR_BIT) - 1)) - aBits);
}
if (bBits < 0)
{
bBits = ((1 << ((sizeof(uint64_t) * CHAR_BIT) - 1)) - bBits);
}
// ...
Dealing with negative zero will also be a problem. IEEE 754 says that +0.0 is equal to −0.0, so your comparison function will have to decide if it wants to treat these values as different, or add special code to the comparison routines that ensures negative and positive zero are treated as equivalent.
Adding all of this special-case code will certainly reduce performance to the point that we will break even with a naïve floating-point comparison, and will very likely end up being slower.

Assembly: JA and JB work incorrectly

Since my main OS is linux and have project on visual studio, I decided to use online compilers to achieve it. I found this which was suggested by many. So here is my code:
#include <iostream>
using namespace std;
int main(void) {
float a = 1;
float b = 20.2;
float res = 0;
float res1 = 0;
_asm {
FLD a
FCOM b
JA midi
JMP modi
midi:
FST res
JMP OUT
modi:
FST res1
JMP OUT
}
OUT:
cout << "res = " << res << endl;
cout << "res1 = " << res1 << endl;
return 0;
}
My goal is simple, if a is greater that b than put a in res, otherwise in res1. But it seems like whatever I have in variable a it always jumps on midi and result is always res = whatever is in a. Hope you can help. Sorry if my questions is silly, I've just started to learn assembly. thanks :)
P.S
same thing goes with JB but the opposite. It always prints res1 = whatever is in b.
From this page:
FCOM compares ST0 with the given operand, and sets the FPU flags
accordingly.
But your JA midi is testing the CPU flags.
It goes on:
FCOMI and FCOMIP work like the corresponding forms of FCOM and FCOMP,
but write their results directly to the CPU flags register rather
than the FPU status word, so they can be immediately followed by
conditional jump or conditional move instructions.
This is a common pitfall, FCOM doesn't set the flags in the flags register, it sets the condition code flags in the FPU status word.
From Intel manual 2:
Compares the contents of register ST(0) and source value and sets condition code flags C0, C2, and C3 in the FPU
status word according to the results (see the table below)
Where you can see that C3 takes the role of ZF and C1 of CF.
Use FCOMI/FUCOMI (and variants) to set the flags accordingly.
Performs an unordered comparison of the contents of registers ST(0) and ST(i) and sets the status flags ZF, PF, and
CF in the EFLAGS register according to the results [Table is identical to the above, where C2 is PF].
The difference between FCOMI and FUCOMI is that the latter allow for unordered operands, i.e. qNaNs.
Alternatively you can still use FCOM but then move the condition codes to the flags with:
fnstsw ax ;FP Not-checked STore Status Word in AX
sahf ;Store AH into flags
Intel designed fnstsw to be compatible with sahf.
The former moves C3, C2 and C1 to bit6, bit2 and bit0, respectively, of AH.
The latter set the flags as RFLAGS(SF:ZF:0:AF:0:PF:1:CF) ← AH.
You can also test ax directly after the fnstsw ax as suggested in Section 8.3.6.1 Branching on the x87 FPU Condition Codes of manual 1.

Is < faster than <=?

Is if (a < 901) faster than if (a <= 900)?
Not exactly as in this simple example, but there are slight performance changes on loop complex code. I suppose this has to do something with generated machine code in case it's even true.
No, it will not be faster on most architectures. You didn't specify, but on x86, all of the integral comparisons will be typically implemented in two machine instructions:
A test or cmp instruction, which sets EFLAGS
And a Jcc (jump) instruction, depending on the comparison type (and code layout):
jne - Jump if not equal --> ZF = 0
jz - Jump if zero (equal) --> ZF = 1
jg - Jump if greater --> ZF = 0 and SF = OF
(etc...)
Example (Edited for brevity) Compiled with $ gcc -m32 -S -masm=intel test.c
if (a < b) {
// Do something 1
}
Compiles to:
mov eax, DWORD PTR [esp+24] ; a
cmp eax, DWORD PTR [esp+28] ; b
jge .L2 ; jump if a is >= b
; Do something 1
.L2:
And
if (a <= b) {
// Do something 2
}
Compiles to:
mov eax, DWORD PTR [esp+24] ; a
cmp eax, DWORD PTR [esp+28] ; b
jg .L5 ; jump if a is > b
; Do something 2
.L5:
So the only difference between the two is a jg versus a jge instruction. The two will take the same amount of time.
I'd like to address the comment that nothing indicates that the different jump instructions take the same amount of time. This one is a little tricky to answer, but here's what I can give: In the Intel Instruction Set Reference, they are all grouped together under one common instruction, Jcc (Jump if condition is met). The same grouping is made together under the Optimization Reference Manual, in Appendix C. Latency and Throughput.
Latency — The number of clock cycles that are required for the
execution core to complete the execution of all of the μops that form
an instruction.
Throughput — The number of clock cycles required to
wait before the issue ports are free to accept the same instruction
again. For many instructions, the throughput of an instruction can be
significantly less than its latency
The values for Jcc are:
Latency Throughput
Jcc N/A 0.5
with the following footnote on Jcc:
Selection of conditional jump instructions should be based on the recommendation of section Section 3.4.1, “Branch Prediction Optimization,” to improve the predictability of branches. When branches are predicted successfully, the latency of jcc is effectively zero.
So, nothing in the Intel docs ever treats one Jcc instruction any differently from the others.
If one thinks about the actual circuitry used to implement the instructions, one can assume that there would be simple AND/OR gates on the different bits in EFLAGS, to determine whether the conditions are met. There is then, no reason that an instruction testing two bits should take any more or less time than one testing only one (Ignoring gate propagation delay, which is much less than the clock period.)
Edit: Floating Point
This holds true for x87 floating point as well: (Pretty much same code as above, but with double instead of int.)
fld QWORD PTR [esp+32]
fld QWORD PTR [esp+40]
fucomip st, st(1) ; Compare ST(0) and ST(1), and set CF, PF, ZF in EFLAGS
fstp st(0)
seta al ; Set al if above (CF=0 and ZF=0).
test al, al
je .L2
; Do something 1
.L2:
fld QWORD PTR [esp+32]
fld QWORD PTR [esp+40]
fucomip st, st(1) ; (same thing as above)
fstp st(0)
setae al ; Set al if above or equal (CF=0).
test al, al
je .L5
; Do something 2
.L5:
leave
ret
Historically (we're talking the 1980s and early 1990s), there were some architectures in which this was true. The root issue is that integer comparison is inherently implemented via integer subtractions. This gives rise to the following cases.
Comparison Subtraction
---------- -----------
A < B --> A - B < 0
A = B --> A - B = 0
A > B --> A - B > 0
Now, when A < B the subtraction has to borrow a high-bit for the subtraction to be correct, just like you carry and borrow when adding and subtracting by hand. This "borrowed" bit was usually referred to as the carry bit and would be testable by a branch instruction. A second bit called the zero bit would be set if the subtraction were identically zero which implied equality.
There were usually at least two conditional branch instructions, one to branch on the carry bit and one on the zero bit.
Now, to get at the heart of the matter, let's expand the previous table to include the carry and zero bit results.
Comparison Subtraction Carry Bit Zero Bit
---------- ----------- --------- --------
A < B --> A - B < 0 0 0
A = B --> A - B = 0 1 1
A > B --> A - B > 0 1 0
So, implementing a branch for A < B can be done in one instruction, because the carry bit is clear only in this case, , that is,
;; Implementation of "if (A < B) goto address;"
cmp A, B ;; compare A to B
bcz address ;; Branch if Carry is Zero to the new address
But, if we want to do a less-than-or-equal comparison, we need to do an additional check of the zero flag to catch the case of equality.
;; Implementation of "if (A <= B) goto address;"
cmp A, B ;; compare A to B
bcz address ;; branch if A < B
bzs address ;; also, Branch if the Zero bit is Set
So, on some machines, using a "less than" comparison might save one machine instruction. This was relevant in the era of sub-megahertz processor speed and 1:1 CPU-to-memory speed ratios, but it is almost totally irrelevant today.
Assuming we're talking about internal integer types, there's no possible way one could be faster than the other. They're obviously semantically identical. They both ask the compiler to do precisely the same thing. Only a horribly broken compiler would generate inferior code for one of these.
If there was some platform where < was faster than <= for simple integer types, the compiler should always convert <= to < for constants. Any compiler that didn't would just be a bad compiler (for that platform).
I see that neither is faster. The compiler generates the same machine code in each condition with a different value.
if(a < 901)
cmpl $900, -4(%rbp)
jg .L2
if(a <=901)
cmpl $901, -4(%rbp)
jg .L3
My example if is from GCC on x86_64 platform on Linux.
Compiler writers are pretty smart people, and they think of these things and many others most of us take for granted.
I noticed that if it is not a constant, then the same machine code is generated in either case.
int b;
if(a < b)
cmpl -4(%rbp), %eax
jge .L2
if(a <=b)
cmpl -4(%rbp), %eax
jg .L3
For floating point code, the <= comparison may indeed be slower (by one instruction) even on modern architectures. Here's the first function:
int compare_strict(double a, double b) { return a < b; }
On PowerPC, first this performs a floating point comparison (which updates cr, the condition register), then moves the condition register to a GPR, shifts the "compared less than" bit into place, and then returns. It takes four instructions.
Now consider this function instead:
int compare_loose(double a, double b) { return a <= b; }
This requires the same work as compare_strict above, but now there's two bits of interest: "was less than" and "was equal to." This requires an extra instruction (cror - condition register bitwise OR) to combine these two bits into one. So compare_loose requires five instructions, while compare_strict requires four.
You might think that the compiler could optimize the second function like so:
int compare_loose(double a, double b) { return ! (a > b); }
However this will incorrectly handle NaNs. NaN1 <= NaN2 and NaN1 > NaN2 need to both evaluate to false.
Maybe the author of that unnamed book has read that a > 0 runs faster than a >= 1 and thinks that is true universally.
But it is because a 0 is involved (because CMP can, depending on the architecture, replaced e.g. with OR) and not because of the <.
At the very least, if this were true a compiler could trivially optimise a <= b to !(a > b), and so even if the comparison itself were actually slower, with all but the most naive compiler you would not notice a difference.
TL;DR answer
For most combinations of architecture, compiler and language, < will not be faster than <=.
Full answer
Other answers have concentrated on x86 architecture, and I don't know the ARM architecture (which your example assembler seems to be) well enough to comment specifically on the code generated, but this is an example of a micro-optimisation which is very architecture specific, and is as likely to be an anti-optimisation as it is to be an optimisation.
As such, I would suggest that this sort of micro-optimisation is an example of cargo cult programming rather than best software engineering practice.
Counterexample
There are probably some architectures where this is an optimisation, but I know of at least one architecture where the opposite may be true. The venerable Transputer architecture only had machine code instructions for equal to and greater than or equal to, so all comparisons had to be built from these primitives.
Even then, in almost all cases, the compiler could order the evaluation instructions in such a way that in practice, no comparison had any advantage over any other. Worst case though, it might need to add a reverse instruction (REV) to swap the top two items on the operand stack. This was a single byte instruction which took a single cycle to run, so had the smallest overhead possible.
Summary
Whether or not a micro-optimisation like this is an optimisation or an anti-optimisation depends on the specific architecture you are using, so it is usually a bad idea to get into the habit of using architecture specific micro-optimisations, otherwise you might instinctively use one when it is inappropriate to do so, and it looks like this is exactly what the book you are reading is advocating.
They have the same speed. Maybe in some special architecture what he/she said is right, but in the x86 family at least I know they are the same. Because for doing this the CPU will do a substraction (a - b) and then check the flags of the flag register. Two bits of that register are called ZF (zero Flag) and SF (sign flag), and it is done in one cycle, because it will do it with one mask operation.
This would be highly dependent on the underlying architecture that the C is compiled to. Some processors and architectures might have explicit instructions for equal to, or less than and equal to, which execute in different numbers of cycles.
That would be pretty unusual though, as the compiler could work around it, making it irrelevant.
You should not be able to notice the difference even if there is any. Besides, in practice, you'll have to do an additional a + 1 or a - 1 to make the condition stand unless you're going to use some magic constants, which is a very bad practice by all means.
When I wrote the first version of this answer, I was only looking at the title question about < vs. <= in general, not the specific example of a constant a < 901 vs. a <= 900. Many compilers always shrink the magnitude of constants by converting between < and <=, e.g. because x86 immediate operand have a shorter 1-byte encoding for -128..127.
For ARM, being able to encode as an immediate depends on being able to rotate a narrow field into any position in a word. So cmp r0, #0x00f000 would be encodeable, while cmp r0, #0x00efff would not be. So the make-it-smaller rule for comparison vs. a compile-time constant doesn't always apply for ARM. AArch64 is either shift-by-12 or not, instead of an arbitrary rotation, for instructions like cmp and cmn, unlike 32-bit ARM and Thumb modes.
< vs. <= in general, including for runtime-variable conditions
In assembly language on most machines, a comparison for <= has the same cost as a comparison for <. This applies whether you're branching on it, booleanizing it to create a 0/1 integer, or using it as a predicate for a branchless select operation (like x86 CMOV). The other answers have only addressed this part of the question.
But this question is about the C++ operators, the input to the optimizer. Normally they're both equally efficient; the advice from the book sounds totally bogus because compilers can always transform the comparison that they implement in asm. But there is at least one exception where using <= can accidentally create something the compiler can't optimize.
As a loop condition, there are cases where <= is qualitatively different from <, when it stops the compiler from proving that a loop is not infinite. This can make a big difference, disabling auto-vectorization.
Unsigned overflow is well-defined as base-2 wrap around, unlike signed overflow (UB). Signed loop counters are generally safe from this with compilers that optimize based on signed-overflow UB not happening: ++i <= size will always eventually become false. (What Every C Programmer Should Know About Undefined Behavior)
void foo(unsigned size) {
unsigned upper_bound = size - 1; // or any calculation that could produce UINT_MAX
for(unsigned i=0 ; i <= upper_bound ; i++)
...
Compilers can only optimize in ways that preserve the (defined and legally observable) behaviour of the C++ source for all possible input values, except ones that lead to undefined behaviour.
(A simple i <= size would create the problem too, but I thought calculating an upper bound was a more realistic example of accidentally introducing the possibility of an infinite loop for an input you don't care about but which the compiler must consider.)
In this case, size=0 leads to upper_bound=UINT_MAX, and i <= UINT_MAX is always true. So this loop is infinite for size=0, and the compiler has to respect that even though you as the programmer probably never intend to pass size=0. If the compiler can inline this function into a caller where it can prove that size=0 is impossible, then great, it can optimize like it could for i < size.
Asm like if(!size) skip the loop; do{...}while(--size); is one normally-efficient way to optimize a for( i<size ) loop, if the actual value of i isn't needed inside the loop (Why are loops always compiled into "do...while" style (tail jump)?).
But that do{}while can't be infinite: if entered with size==0, we get 2^n iterations. (Iterating over all unsigned integers in a for loop C makes it possible to express a loop over all unsigned integers including zero, but it's not easy without a carry flag the way it is in asm.)
With wraparound of the loop counter being a possibility, modern compilers often just "give up", and don't optimize nearly as aggressively.
Example: sum of integers from 1 to n
Using unsigned i <= n defeats clang's idiom-recognition that optimizes sum(1 .. n) loops with a closed form based on Gauss's n * (n+1) / 2 formula.
unsigned sum_1_to_n_finite(unsigned n) {
unsigned total = 0;
for (unsigned i = 0 ; i < n+1 ; ++i)
total += i;
return total;
}
x86-64 asm from clang7.0 and gcc8.2 on the Godbolt compiler explorer
# clang7.0 -O3 closed-form
cmp edi, -1 # n passed in EDI: x86-64 System V calling convention
je .LBB1_1 # if (n == UINT_MAX) return 0; // C++ loop runs 0 times
# else fall through into the closed-form calc
mov ecx, edi # zero-extend n into RCX
lea eax, [rdi - 1] # n-1
imul rax, rcx # n * (n-1) # 64-bit
shr rax # n * (n-1) / 2
add eax, edi # n + (stuff / 2) = n * (n+1) / 2 # truncated to 32-bit
ret # computed without possible overflow of the product before right shifting
.LBB1_1:
xor eax, eax
ret
But for the naive version, we just get a dumb loop from clang.
unsigned sum_1_to_n_naive(unsigned n) {
unsigned total = 0;
for (unsigned i = 0 ; i<=n ; ++i)
total += i;
return total;
}
# clang7.0 -O3
sum_1_to_n(unsigned int):
xor ecx, ecx # i = 0
xor eax, eax # retval = 0
.LBB0_1: # do {
add eax, ecx # retval += i
add ecx, 1 # ++1
cmp ecx, edi
jbe .LBB0_1 # } while( i<n );
ret
GCC doesn't use a closed-form either way, so the choice of loop condition doesn't really hurt it; it auto-vectorizes with SIMD integer addition, running 4 i values in parallel in the elements of an XMM register.
# "naive" inner loop
.L3:
add eax, 1 # do {
paddd xmm0, xmm1 # vect_total_4.6, vect_vec_iv_.5
paddd xmm1, xmm2 # vect_vec_iv_.5, tmp114
cmp edx, eax # bnd.1, ivtmp.14 # bound and induction-variable tmp, I think.
ja .L3 #, # }while( n > i )
"finite" inner loop
# before the loop:
# xmm0 = 0 = totals
# xmm1 = {0,1,2,3} = i
# xmm2 = set1_epi32(4)
.L13: # do {
add eax, 1 # i++
paddd xmm0, xmm1 # total[0..3] += i[0..3]
paddd xmm1, xmm2 # i[0..3] += 4
cmp eax, edx
jne .L13 # }while( i != upper_limit );
then horizontal sum xmm0
and peeled cleanup for the last n%3 iterations, or something.
It also has a plain scalar loop which I think it uses for very small n, and/or for the infinite loop case.
BTW, both of these loops waste an instruction (and a uop on Sandybridge-family CPUs) on loop overhead. sub eax,1/jnz instead of add eax,1/cmp/jcc would be more efficient. 1 uop instead of 2 (after macro-fusion of sub/jcc or cmp/jcc). The code after both loops writes EAX unconditionally, so it's not using the final value of the loop counter.
You could say that line is correct in most scripting languages, since the extra character results in slightly slower code processing.
However, as the top answer pointed out, it should have no effect in C++, and anything being done with a scripting language probably isn't that concerned about optimization.
Only if the people who created the computers are bad with boolean logic. Which they shouldn't be.
Every comparison (>= <= > <) can be done in the same speed.
What every comparison is, is just a subtraction (the difference) and seeing if it's positive/negative.
(If the msb is set, the number is negative)
How to check a >= b? Sub a-b >= 0 Check if a-b is positive.
How to check a <= b? Sub 0 <= b-a Check if b-a is positive.
How to check a < b? Sub a-b < 0 Check if a-b is negative.
How to check a > b? Sub 0 > b-a Check if b-a is negative.
Simply put, the computer can just do this underneath the hood for the given op:
a >= b == msb(a-b)==0
a <= b == msb(b-a)==0
a > b == msb(b-a)==1
a < b == msb(a-b)==1
and of course the computer wouldn't actually need to do the ==0 or ==1 either.
for the ==0 it could just invert the msb from the circuit.
Anyway, they most certainly wouldn't have made a >= b be calculated as a>b || a==b lol
In C and C++, an important rule for the compiler is the “as-if” rule: If doing X has the exact same behavior as if you did Y, then the compiler is free to choose which one it uses.
In your case, “a < 901” and “a <= 900” always have the same result, so the compiler is free to compile either version. If one version was faster, for whatever reason, then any quality compiler would produce code for the version that is faster. So unless your compiler produced exceptionally bad code, both versions would run at equal speed.
Now if you had a situation where two bits of code will always produce the same result, but it is hard to prove for the compiler, and/or it is hard for the compiler to prove which if any version is faster, then you might get different code running at different speeds.
PS The original example might run at different speeds if the processor supports single byte constants (faster) and multi byte constants (slower), so comparing against 255 (1 byte) might be faster than comparing against 256 (two bytes). I’d expect the compiler to do whatever is faster.
Only if computation path depends on data:
a={1,1,1,1,1000,1,1,1,1}
while (i<=4)
{
for(j from 0 to a[i]){ do_work(); }
i++;
}
will compute 250 times more than while(i<4)
Real-world sample would be computing mandelbrot-set. If you include a pixel that iterates 1000000 times, it will cause a lag but the coincidence with <= usage probability is too low.

Can I expect float variable values that I set from literal constants to be unchanged after assignment to other variables?

If I do something like this:
float a = 1.5f;
float b = a;
void func(float arg)
{
if (arg == 1.5f) printf("You are teh awresome!");
}
func(b);
Will the text print every time (and on every machine)?
EDIT
I mean, I'm not really sure if the value will pass through the FPU at some point even if I'm not doing any calculations, and if so, whether the FPU will change the binary representation of the value. I've read somewhere that the (approximate) same floating point values can have multiple binary representations in IEEE 754.
First of all 1.5 can be stored accurately in memory so for this specific value, yes it will always be true.
More generally I think that the inaccuracies only pop up when you're doing computations, if you just store a value even if it doesn't have an accurate IEEE representation it will always be mapped to the same value (so 0.3 == 0.3 even though 0.3 * 3 != 0.9).
In the example, the value 1.5F has an exact representation in IEEE 754 (and pretty much any other conceivable binary or decimal floating point representation), so the answer is almost certainly going to be yes. However, there is no guarantee, and there could be compilers which do not manage to achieve the result.
If you change the value to one without an exact binary representation, such as 5.1F, the result is far from guaranteed.
Way, way, way back in their excellent classic book "The Elements of Programming Style", Kernighan & Plauger said:
A wise programmer once said, "Floating point numbers are like sand piles; every time you move one, you lose a little sand and you pick up a little dirt". And after a few computations, things can get pretty dirty.
(It's one of two phrases in the book that I highlighted many years ago1.)
They also observe:
10.0 times 0.1 is hardly ever 1.0.
Don't compare floating point numbers just for equality
Those observations were made in 1978 (for the second edition), but are still fundamentally valid today.
If the question is viewed at its most extremely restricted scope, you may be OK. If the question is varied very much, you are more likely to be bitten than not, and you'll probably be bitten sooner rather later.
1 The other highlighted phrase is (minus bullets):
the subroutine call permits us to summarize the irregularities in the argument list [...]
[t]he subroutine itself summarizes the regularities of the code [...]
If it passes the FPU on one point in time it could be due to optimizations and register handling by the compiler that you end up comparing a FPU register with a value from the memory. The first one may have a higher precision as the latter one and so the comparison gives you a false.
This may vary depending on compiler, compiler options and the platform you run on.
Here's a (not quite comprehensive) proof that (at least in GCC) you are guaranteed equality for floating literals.
Python code to generate file:
print """
#include <stdio.h>
int main()
{
"""
import random
chars = "abcdefghijklmnopqrstuvwxyz"
randoms = [str(random.random()) for _ in xrange(26)]
for c, r in zip(chars, randoms):
print "float %s = %sf;" % (c, r)
for c, r in zip(chars, randoms):
print r'if (%s != %sf) { printf("Error!\n"); }' % (c,r)
print """
return 0;
}
"""
Snippet of generated file:
#include <stdio.h>
int main()
{
float a = 0.199698325654f;
float b = 0.402517512357f;
float c = 0.700489844438f;
float d = 0.699640984356f;
if (a != 0.199698325654f) { printf("Error!\n"); }
if (b != 0.402517512357f) { printf("Error!\n"); }
if (c != 0.700489844438f) { printf("Error!\n"); }
if (d != 0.699640984356f) { printf("Error!\n"); }
return 0;
}
And running it correctly does not print anything to the screen:
$ ./a.out
$
But here's the catch: if you don't put the literal f after the floats in the check for equality, it will fail every time! You can leave the literal f out of the assignment, though, without problems.
I took a look at the disassembly from the VS2005 compiler. When running this simple program, I found that after float f=.1;, the condition f==.1 resulted in .... FALSE.
EDIT - this was due to the comparand being a double. When using a float literal (i.e. .1f), the comparison resulted in TRUE. This also holds when comparing double variables with double literals.
I added the source code and disassembly here:
float f=.1f;
0041363E fld dword ptr [__real#3dcccccd (415764h)]
00413644 fstp dword ptr [f]
double d=.1;
00413647 fld qword ptr [__real#3fb999999999999a (415748h)]
0041364D fstp qword ptr [d]
bool ffequal = f == .1f;
00413650 fld dword ptr [f]
00413653 fcomp qword ptr [__real#3fb99999a0000000 (415758h)]
00413659 fnstsw ax
0041365B test ah,44h
0041365E jp main+4Ch (41366Ch)
00413660 mov dword ptr [ebp-0F8h],1
0041366A jmp main+56h (413676h)
0041366C mov dword ptr [ebp-0F8h],0
00413676 mov al,byte ptr [ebp-0F8h]
0041367C mov byte ptr [fequal],al
// here, ffequal is true
bool dfequal = d == .1f;
0041367F fld qword ptr [__real#3fb99999a0000000 (415758h)]
00413685 fcomp qword ptr [d]
00413688 fnstsw ax
0041368A test ah,44h
0041368D jp main+7Bh (41369Bh)
0041368F mov dword ptr [ebp-0F8h],1
00413699 jmp main+85h (4136A5h)
0041369B mov dword ptr [ebp-0F8h],0
004136A5 mov al,byte ptr [ebp-0F8h]
004136AB mov byte ptr [dequal],al
// here, dfequal is false.
bool ddequal = d == .1;
004136AE fld qword ptr [__real#3fb999999999999a (415748h)]
004136B4 fcomp qword ptr [d]
004136B7 fnstsw ax
004136B9 test ah,44h
004136BC jp main+0AAh (4136CAh)
004136BE mov dword ptr [ebp-104h],1
004136C8 jmp main+0B4h (4136D4h)
004136CA mov dword ptr [ebp-104h],0
004136D4 mov al,byte ptr [ebp-104h]
004136DA mov byte ptr [ddequal],al
// ddequal is true
I don't think it will behave equal:
float f1 = .1;
// compiled as
//// 64-bits literal into EAX:EBX
if( f1 == .1 )
//// load .1 into 80 bits of FPU register 1
//// copy EAX:EBX into FPU register 2 (leaving alone the last 16 bits)
//// call FPU compare instruction
////
When the compiler generates code as mentioned above, the condition will never be true: 16 bits will be different when comparing the 64bits version of .1 with it's 80bits version.
Concluding: no: you cannot guarantee equality on each and every machine with this code.
It should print every time and on every machine. You're not using the output of any calculation, so there's no reason to assume that the values would change.
One caveat: certain "magic" numbers don't necessarily come from memory. In x86 we have the following instructions:
FLDZ - load 0.0 into ST(0)
FLD1 - load 1.0 into ST(0)
FLD2T - load log2(10) into ST(0)
FLD2E - load log2(e) into ST(0)
FLDPI - load pi into ST(0)
FLDLG2 - load log10(2) into ST(0)
FLDLN2 - load ln(2) into ST(0)
So if you happen to be comparing with one of these values you may possibly not get the exact same value as the const literal you referenced.