Compiler optimizations may cause integer overflow. Is that okay? - c++

I have an int x. For simplicity, say ints occupy the range -2^31 to 2^31-1. I want to compute 2*x-1. I allow x to be any value 0 <= x <= 2^30. If I compute 2*(2^30), I get 2^31, which is an integer overflow.
One solution is to compute 2*(x-1)+1. There's one more subtraction than I want, but this shouldn't overflow. However, the compiler will optimize this to 2*x-1. Is this a problem for the source code? Is this a problem for the executable?
Here is the godbolt output for 2*x-1:
func(int): # #func(int)
lea eax, [rdi + rdi]
dec eax
ret
Here is the godbolt output for 2*(x-1)+1:
func(int): # #func(int)
lea eax, [rdi + rdi]
dec eax
ret

As Miles hinted: The C++ code text is bound by the rules of the C++ language (integer overflow = bad), but the compiler is only bound by the rules of the cpu (overflow=ok). It is allowed to make optimizations that the code isn't allowed to.
But don't take this as an excuse to get lazy. If you write undefined behavior, the compiler will take that as a hint and do other optimizations that result in your program doing the wrong thing.

Just because signed integer overflow isn't well-defined at the C++ language level doesn't mean that's the case at the assembly level. It's up to the compiler to emit assembly code that is well-defined on the CPU architecture you're targeting.
I'm pretty sure every CPU made in this century has used two's complement signed integers, and overflow is perfectly well defined for those. That means there is no problem simply calculating 2*x, letting the result overflow, then subtracting 1 and letting the result underflow back around.
Many such C++ language-level rules exist to paper over different CPU architectures. In this case, signed integer overflow was made undefined so that compilers targeting CPUs that use e.g. one's complement or sign/magnitude representations of signed integers aren't forced to add extra instructions to conform to the overflow behavior of two's complement.
Don't assume, however, that you can use a construct that is well-defined on your target CPU but undefined in C++ and get the answer you expect. C++ compilers assume undefined behavior cannot happen when performing optimization, and so they can and will emit different code from what you were expecting if your code isn't well-defined C++.

The ISO C++ rules apply to your source code (always, regardless of the target machine). Not to the asm the compiler chooses to make, especially for targets where signed integer wrapping just works.
The "as if" rules requires that the asm implementation of the function produce the same result as the C++ abstract machine, for every input value where the abstract machine doesn't encounter signed integer overflow (or other undefined behaviour). It doesn't matter how the asm produces those results, that's the entire point of the as-if rule. In some cases, like yours, the most efficient implementation would wrap and unwrap for some values that the abstract machine wouldn't. (Or in general, not wrap where the abstract machine does for unsigned or gcc -fwrapv.)
One effect of signed integer overflow being UB in the C++ abstract machine is that it lets the compiler optimize an int loop counter to pointer width, not redoing sign-extension every time through the loop or things like that. Also, compilers can infer value-range restrictions. But that's totally separate from how they implement the logic into asm for some target machine. UB doesn't mean "required to fail", in fact just the opposite, unless you compile with -fsanitize=undefined. It's extra freedom for the optimizer to make asm that doesn't match the source if you interpreted the source with more guarantees than ISO C++ actually gives (plus any guarantees the implementation makes beyond that, like if you use gcc -fwrapv.)
For an expression like x/2, every possible int x has well-defined behaviour. For 2*x, the compiler can assume that x >= INT_MIN/2 and x <= INT_MAX/2, because larger magnitudes would involve UB.
2*(x-1)+1 implies a legal value-range for x from (INT_MIN+1)/2 to (INT_MAX+1)/2. e.g. on a 32-bit 2's complement target, -1073741823 (0xc0000001) to 1073741824 (0x40000000). On the positive side, 2*0x3fffffff doesn't overflow, doesn't wrap on increment because 2*x was even.
2*x - 1 implies a legal value-range for x from INT_MIN/2 + 1 to INT_MAX/2. e.g. on a 32-bit 2's complement target, -1073741823 (0xc0000001) to 1073741823 (0x3fffffff). So the largest value the expression can produce is 2^n - 3, because INT_MAX will be odd.
In this case, the more complicated expression's legal value-range is a superset of the simpler expression, but in general that's not always the case.
They produce the same result for every x that's a well-defined input for both of them. And x86 asm (where wrapping is well-defined) that works like one or the other can implement either, producing correct results for all non-UB cases. So the compiler would be doing a bad job if it didn't make the same efficient asm for both.
In general, 2's complement and unsigned binary integer math is commutative and associative (for operations where that's mathematically true, like + and *), and compilers can and should take full advantage. e.g. rearranging a+b+c+d to (a+b)+(c+d) to shorten dependency chains. (See an answer on Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)? for an example of GCC doing it with integer, but not FP.)
Unfortunately, GCC has sometimes been reluctant to do signed-int optimizations like that because its internals were treating signed integer math as non-associative, perhaps because of a misguided application of C++ UB rules to optimizing asm for the target machine. That's a GCC missed optimization; Clang didn't have that problem.
Further reading:
Is there some meaningful statistical data to justify keeping signed integer arithmetic overflow undefined? re: some useful loop optimizations it allows.
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
Does undefined behavior apply to asm code? (no)
Is integer overflow undefined in inline x86 assembly?
The whole situation is basically a mess, and the designers of C didn't anticipate the current sophistication of optimizing compilers. Languages like Rust are better suited to it: if you want wrapping, you can (and must) tell the compiler about it on a per-operation basis, for both signed and unsigned types. Like x.wrapping_add(1).
Re: why does clang split up the 2*x and the -1 with lea/dec
Clang is optimizing for latency on Intel CPUs before Ice lake, saving one cycle of latency at the cost of an extra uop of throughput cost. (Compilers often favour latency since modern CPUs are often wide enough to chew through the throughput costs, although it does eat up space in the out-of-order exec window for hiding cache miss latency.)
lea eax, [rdi + rdi - 1] has 3 cycle latency on Skylake, vs. 1 for the LEA it used. (See Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? for some details). On AMD Zen family, it's break-even for latency (a complex LEA only has 2c latency) while still costing an extra uop. On Ice Lake and later Intel, even a 3-component LEA is still only 1 cycle so it's pure downside there. See https://uops.info/, the entry for LEA_B_I_D8 (R32) (Base, Index, 8-bit displacement, with scale-factor = 1.)
This tuning decision is unrelated to integer overflow.

Signed integer overflow/underflow is undefined behavior precisely so that compilers may make optimizations such as this. Because the compiler is allowed to do anything in the case of overflow/underflow, it can do this, or whatever else is more optimal for the use cases it is required to care about.
If the behavior on signed overflow had been specified as “What the DEC PDP-8 did back in 1973,” compilers for other targets would need to insert instructions to check for overflow and, if it occurs, produce that result instead of whatever the CPU does natively.

Related

Compiler optimizations allowed via "int", "least" and "fast" non-fixed width types C/C++

Clearly, fixed-width integral types should be used when the size is important.
However, I read (Insomniac Games style guide), that "int" should be preferred for loop counters / function args / return codes / ect when the size isn't important - the rationale given was that fixed-width types can preclude certain compiler optimizations.
Now, I'd like to make a distinction between "compiler optimization" and "a more suitable typedef for the target architecture". The latter has global scope, and my guess probably has very limited impact unless the compiler can somehow reason about the global performance of the program parameterized by this typedef. The former has local scope, where the compiler would have the freedom to optimize number of bytes used, and operations, based on local register pressure / usage, among other things.
Does the standard permit "compiler optimizations" (as we've defined) for non-fixed-width types? Any good examples of this?
If not, and assuming the CPU can operate on smaller types as least as fast as larger types, then I see no harm, from a performance standpoint, of using fixed-width integers sized according to local context. At least that gives the possibility of relieving register pressure, and I'd argue couldn't be worse.
The reason that the rule of thumb is to use an int is that the standard defines this integral type as the natural data type of the CPU (provided that it is sufficiently wide for the range INT_MIN to INT_MAX. That's where the best-performance stems from.
There are many things wrong with int_fast types - most notably that they can be slower than int!
#include <stdio.h>
#include <inttypes.h>
int main(void) {
printf("%zu\n", sizeof (int_fast32_t));
}
Run this on x86-64 and it prints 8... but it makes no sense - using 64-bit registers often require prefixes in x86-64 bit mode, and the "behaviour on overflow is undefined" means that using 32-bit int it doesn't matter if the upper 32 bits of the 64 bit register are set after arithmetic - the behaviour is "still correct".
What is even worse, however, than using the signed fast or least types, is using a small unsigned integer instead of size_t or a signed integer for a loop counter - now the compiler must generate extra code to "ensure the correct wraparound behaviour".
I'm not very familiar with the x86 instruction set but unless you can guarantee that practically every arithmetic and move instruction also allows additional shift and (sign) extends then the assumption that smaller types are "as least as fast" as larger ones is not true.
The complexity of x86 makes it pretty hard to come up with simple examples so lets consider an ARM microcontroller instead.
Lets define two addition functions which only differ by return type. "add32" which returns an integer of full register width and "add8" which only returns a single byte.
int32_t add32(int32_t a, int32_t b) { return a + b; }
int8_t add8(int32_t a, int32_t b) { return a + b; }
Compiling those functions with -Os gives the following assembly:
add32(int, int):
add r0, r0, r1
bx lr
add8(int, int):
add r0, r0, r1
sxtb r0, r0 // Sign-extend single byte
bx lr
Notice how the function which only returns a byte is one instruction longer. It has to truncate the 32bit addition to a single byte.
Here is a link to the code # compiler explorer:
https://godbolt.org/z/ABFQKe
However, I read (Insomniac Games style guide), that "int" should be preferred for loop counters
You should rather be using size_t, whenever iterating over an array. int has other problems than performance, such as being signed and also problematic when porting.
From a standard point-of-view, for a scenario where "n" is the size of an int, there exists no case where int_fastn_t should perform worse than int, or the compiler/standard lib/ABI/system has a fault.
Does the standard permit "compiler optimizations" (as we've defined) for non-fixed-width types? Any good examples of this?
Sure, the compiler might optimize the use of integer types quite wildly, as long as it doesn't affect the outcome of the result. No matter if they are int or int32_t.
For example, an 8 bit CPU compiler might optimize int a=1; int b=1; ... c = a + b; to be performed on 8 bit arithmetic, ignoring integer promotions and the actual size of int. It will however most likely have to allocate 16 bits of memory to store the result.
But if we give it some rotten code like char a = 0x80; int b = a >> 1;, it will have to do the optimization so that the side affects of integer promotion are taken in account. That is, the result could be 0xFFC0 rather than 0x40 as one might have expected (assuming signed char, 2's complement, arithmetic shift). The a >> 1 part isn't possible to optimize to an 8 bit type because of this - it has to be carried out with 16 bit arithmetic.
I think the question you are trying to ask is:
Is the compiler allowed to make additional optimizations for a
non-fixed-width type such as int beyond what it would be allowed for
a fixed width type like int32_t that happens to have the same
length on the current platform?
That is, you are not interested in the part where the size of the non-fixed width type is allowed to be chosen appropriately for the hardware - you are aware of that and are asking if beyond that additional optimizations are available?
The answer, as far as I am aware or have seen, is no. No both in the sense that compilers do not actually optimize int differently than int32_t (on platforms where int is 32-bits), and also no in the sense that there are not optimizations allowed by the standard for int which are also not allowed for int32_t1 (this second part is wrong - see comments).
The easiest way to see this is that the various fixed width integers are all typedefs for various underlying primitive integer types - so on a platform with 32-bit integers int32_t will probably be a typedef (perhaps indirectly) of int. So from a behavioral and optimization point of view, the types are identical, and as soon as you are in the IR world of the compiler, the original type probably isn't even really available without jumping through oops (i.e., int and int32_t will generate the same IR).
So I think the advice you received was wrong, or at best misleading.
1 Of course the answer to the question of "Is it allowed for a compiler to optimize int better than int32_t is yes, since there are not particular requirements on optimization so a compiler could do something weird like that, or the reverse, such as optimizing int32_t better than int. I that's not very interesting though.

Enforcing order of execution

I would like to ensure that the calculations requested are executed exactly in the order I specify, without any alterations from either the compiler or CPU (including the linker, assembler, and anything else you can think of).
Operator left-to-right associativity is assumed in the C language
I am working in C (possibly also interested in C++ solutions), which states that for operations of equal precedence there is an assumed left-to-right operator associativity, and hence
a = b + c - d + e + f - g ...;
is equivalent to
a = (...(((((b + c) - d) + e) + f) - g) ...);
A small example
However, consider the following example:
double a, b = -2, c = -3;
a = 1 + 2 - 2 + 3 + 4;
a += 2*b;
a += c;
So many opportunities for optimisation
For many compilers and pre-processors they may be clever enough to recognise the "+ 2 - 2" is redundant and optimise this away. Similarly they could recognise that the "+= 2*b" followed by the "+= c" can be written using a single FMA. Even if they don't optimise in an FMA, they may switch the order of these operations etc. Furthermore, if the compiler doesn't do any of these optimisations, the CPU may well decide to do some out of order execution, and decide it can do the "+= c" before the "+= 2*b", etc.
As floating-point arithmetic is non-associative, each type of optimisation may result in a different end result, which may be noticeable if the following is inlined somewhere.
Why worry about floating point associativity?
For most of my code I would like as much optimisation as I can have and don't care about floating-point associativity or bit-wise reproduciblilty, but occasionally there is a small snippet (similar to the above example) which I would like to be untampered with and totally respected. This is because I am working with a mathematical method which exactly requires a reproducible result.
What can I do to resolve this?
A few ideas which have come to mind:
Disable compiler optimisations and out of order execution
I don't want this, as I want the other 99% of my code to be heavily optimised. (This seems to be cutting off my nose to spite my face). I also most likely won't have permission to change my hardware settings.
Use a pragma
Write some assembly
The code snippets are small enough that this might be reasonable, although I'm not very confident in this, especially if (when) it comes to debugging.
Put this in a separate file, compile separately as un-optimised as possible, and then link using a function call
Volatile variables
To my mind these are just for ensuring that memory access is respected and un-optimised, but perhaps they might prove useful.
Access everything through judicious use of pointers
Perhaps, but this seems like a disaster in readability, performance, and bugs waiting to happen.
If anyone can think of any feasibly solutions (either from any of the ideas I've suggested or otherwise) that would be ideal. The "pragma" option or "function call" to my mind seem like the best approaches.
The ultimate goal
To have something that marks off a small chuck of simple and largely vanilla C code as protected and untouchable to any (realistically most) optimisations, while allowing for the rest of the code to be heavily optimised, covering optimisations from both the CPU and compiler.
This is not a complete answer, but it is informative, partially answers, and is too long for a comment.
Clarifying the Goal
The question actually seeks reproducibility of floating-point results, not order of execution. Also, order of execution is irrelevant; we do not care if, in (a+b)+(c+d), a+b or c+d is executed first. We care that the result of a+b is added to the result of c+d, without any reassociation or other rewriting of arithmetic unless the result is known to be the same.
Reproducibility of floating-point arithmetic is in general an unsolved technological problem. (There is no theoretical barrier; we have reproducible elementary operations. Reproducibility is a matter of what hardware and software vendors have provided and how hard it is to express the computations we want performed.)
Do you want reproducibility on one platform (e.g., always using the same version of the same math library)? Does your code use any math library routines like sin or log? Do you want reproducibility across different platforms? With multithreading? Across changes of compiler version?
Addressing Some Specific Issues
The samples shown in the question can largely be handled by writing each individual floating-point operation in its own statement, as by replacing:
a = 1 + 2 - 2 + 3 + 4;
a += 2*b;
a += c;
with:
t0 = 1 + 2;
t0 = t0 - 2;
t0 = t0 + 3;
t0 = t0 + 4;
t1 = 2*b;
t0 += t1;
a += c;
The basis for this is that both C and C++ permit an implementation to use “excess precision” when evaluating an expression but require that precision to be “discarded” when an assignment or cast is performed. Limiting each assignment expression to one operation or executing a cast after each operation effectively isolates the operations.
In many cases, a compiler will then generate code using instructions of the nominal type, instead of instructions using a type with excess precision. In particular, this should avoid a fused multiply-add (FMA) being substituted for a multiplication followed by an addition. (An FMA has effectively infinite precision in the product before it is added to the addend, thus falling under the “excess precision is permitted” rule.) There are caveats, however. An implementation might first evaluate an operation with excess precision and then round it to the nominal precision. In general, this can cause a different result than doing a single operation in the nominal precision. For the elementary operations of addition, subtract, multiplication, division, and even square root, this does not happen if the excess precision is sufficient greater than the nominal precision. (There are proofs that a result with sufficient excess precision is always close enough to the infinitely precise result that the rounding to nominal precision gets the same result.) This is true for the case where the nominal precision is the IEEE-754 basic 32-bit binary floating-point format, and the excess precision is the 64-bit format. However, it is not true where the nominal precision is the 64-bit format and the excess precision is Intel’s 80-bit format.
So, whether this workaround works depends on the platform.
Other Issues
Aside from the use of excess precision and features like FMA or the optimizer rewriting expressions, there are other things that affect reproducibility, such as non-standard treatment of subnormals (notably replacing them with zeroes), variations between math library routines. (sin, log, and similar functions return different results on different platforms. Nobody has fully implemented correctly rounded math library routines with known bounded performance.)
These are discussed in other Stack Overflow questions about floating-point reproducibility, as well as papers, specifications, and standards documents.
Irrelevant Issues
The order in which a processor executes floating-point operations is irrelevant. Processor reordering of calculations obeys rigid semantics; the results are identical regardless of the chronological order of execution. (Processor timing can affect results if, for example, a task is partitioned into subtasks, such as assigning multiple threads or processes to process different parts of the arrays. Among other issues, their results could arrive in different orders, and the process receiving their results might then add or otherwise combine their results in different orders.)
Using pointers will not fix anything. As far as C or C++ is concerned, *p where p is a pointer to double is the same as a where a is a double. One the objects has a name (a) and one of them does not, but they are like roses: They smell the same. (There are issues where, if you have some other pointer q, the compiler might not know whether *q and *p refer to the same thing. But that also holds true for *q and a.)
Using volatile qualifiers will not aid in reproducibility regarding the excess precision or expression rewriting issue. That is because only an object (not a value) is volatile, which means it has no effect until you write it or read it. But, if you write it, you are using an assignment expression1, so the rule about discarding excess precision already applies. When reading the object, you would force the compiler to retrieve the actual value from memory, but this value will not be any different than the non-volatile object has after assignment, so nothing is accomplished.
Footnote
1 I would have to check on other things that modify an object, such as ++, but those are likely not significant for this discussion.
Write this critical chunk of code in assembly language.
The situation you're in is unusual. Most of the time people want the compiler to do optimizations, so compiler developers don't spend much development effort on means to avoid them. Even with the knobs you do get (pragmas, separate compilation, indirections, ...) you can never be sure something won't be optimized. Some of the undesirable optimizations you mention (constant folding, for instance) cannot be turned off by any means in modern compilers.
If you use assembly language you can be sure you're getting exactly what you wrote. If you do it any other way you won't have that level of confidence.
"clever enough to recognise the + 2 - 2 is redundant and optimise this
away"
No ! All decent compilers will apply constant propagation and figure out that a is constant and optimize all your statement away, into something equivalent to a = 1;. Here the example with assembly.
Now if you make a volatile, the compiler has to assume that any change of a could have an impact outside the C++ programme. Constant propagation will still be performed to optimise each of these calculations, but the intermediary assignments are guaranteed to happen. Here the example with assembly.
If you don't want constant propagation to happen, you need to deactivate optimizations. In this case, the best would be to keep your code separate so to compile the rest with all optilizations on.
However this is not ideal. The optimizer could outperform you and with this approach, you'll loose global optimisation across the function boundaries.
Recommendation/quote of the day:
Don't diddle code; Find better algorithms
- B.W.Kernighan & P.J.Plauger

Optimized code in VC++ and ASM

Good evening. Sorry, I used google tradutor.
I use NASM in VC ++ on x86 and I'm learning how to use MASM on x64.
Is there any way to specify where each argument goes and the return of an assembly function in such a way that the compiler manages to leave the data there in the fastest way? We can too specify which registers will be used so that the compiler knows what data is still saved to make the best use of it?
For example, since there is no intrinsic function that applies the exactly IDIV r/m64 (64-bit signed integer division of assembly language), we may need to implement it. The IDIV requires that the low magnitude part of the dividend/numerator be in RAX, the high in RDX and the divisor/denominator in any register or in a region of memory. At the end, the quotient is in EAX and the remainder in EDX. We may therefore want to develop functions so (I put inutilities to exemplify):
void DivLongLongInt( long long NumLow , long long NumHigh , long long Den , long long *Quo , long long *Rem ){
__asm(
// Specify used register: [rax], specify pre location: NumLow --> [rax]
reg(rax)=NumLow ,
// Specify used register: [rdx], specify pre location: NumHigh --> [rdx]
reg(rdx)=NumHigh ,
// Specify required memory: memory64bits [den], specify pre location: Den --> [den]
mem[64](den)=Den ,
// Specify used register: [st0], specify pre location: Const(12.5) --> [st0]
reg(st0)=25*0.5 ,
// Specify used register: [bh]
reg(bh) ,
// Specify required memory: memory64bits [nothing]
mem[64](nothing) ,
// Specify used register: [st1]
reg(st1)
){
// Specify code
IDIV [den]
}(
// Specify pos location: [rax] --> *Quo
*Quo=reg(rax) ,
// Specify pos location: [rdx] --> *Rem
*Rem=reg(rdx)
) ;
}
Is it possible to do something at least close to that?
Thanks for all help.
If there is no way to do this, it's a shame because it would certainly be a great way to implement high-level functions with assembly-level features. I think it's a simple interface between C ++ and ASM that should already exist and enable assembly code to be embedded inline and at high level, practically as simple C++ code.
As others have mentioned, MSVC does not support any form of inline assembly when targeting x86-64.
Inline assembly is supported only in x86-32 builds, and even there, it is rather limited in what you can do. In particular, you can't specify inputs and outputs, so the use of inline assembly necessarily entails a lot of shuffling of values back and forth between registers and memory, which is precisely the opposite of what you want when writing high-performance code. Unless there is something that you cannot possibly do any other way except by causing the manual emission of machine code, you should avoid the inline assembler. Its original purpose was to do things like generate OUT instructions and call ROM BIOS interrupts in obsolete 8-bit and 16-bit programming environments. It made it into the 32-bit compiler for compatibility purposes, but the team drew the line with 64-bit.
Intrinsics are now the recommended solution, because these play much better with the optimizer. Virtually any SIMD code that you need the compiler to generate can be accomplished using intrinsics, just as you would on most any other compiler targeting x86, so not only are you getting better code, but you're also getting slightly more portable code.
Even on Gnu-style compilers that support extended asm blocks, which give you the type of input/output operand power that you are looking for, there are still lots of good reasons to avoid the use of inline asm. Intrinsics are still a better solution there, as is finding a way to represent what you want in C and persuading the compiler to generate the assembly code that you wish it to emit.
The only exception is cases where there are no intrinsics available. The IDIV instruction is, unfortunately, one of those cases. (There are intrinsics available for 128-bit multiplication. They go by various names: either Windows-specific or compiler-specific.)
On Gnu compilers that support 128-bit integer types as an extension on 64-bit targets, you can get the compiler to generate the code for you:
__int128_t dividend = 1234;
int64_t divisor = 64;
int64_t quotient = (dividend / divisor);
Now, this is generally compiled as a call to their library function that does 128-bit division, rather than an inline IDIV instruction that returns a 64-bit quotient. Presumably, this is because of the need to handle overflows, as David mentioned. Actually, it's worse than that. No C or C++ implementation can use the DIV/IDIV instructions because they are non-conforming. These instructions will result in overflow exceptions, whereas the standard says that the result should be truncated. (With multiplication, you do get inline IMUL/MUL instruction(s) because these don't have the overflow problem, since they return 128-bit results.)
This isn't actually as big of a loss as you might think. You seem to be assuming that the 64-bit IDIV instruction is really fast. It isn't. Although the actual numbers vary depending on the number of significant bits in the absolute value of the dividend, your values probably are quite large if you actually need the range of a 128-bit integer. Looking at Agner Fog's instruction tables will give you some idea of the performance you can expect on various architectures. It's getting faster on newer architectures (especially on the newer AMD processors; it's still sluggish on Intel), but it still has pretty substantial latencies. Just because it's one instruction doesn't mean that it runs in one cycle or anything like that. A single instruction might be good for code density when you're optimizing for size and worried about a call to a library function evicting other instructions from your cache, but division is a slow enough operation that this usually doesn't matter. In fact, division is so slow that compilers try very hard not to use it—whenever possible, they will do multiplication by the reciprocal, which is significantly faster. And if you're really needing to do multiplications quickly, you should look into parallelizing them with SIMD instructions, which all have intrinsics available.
Back to MSVC (although everything I said in the last paragraph still applies, of course), there are no 128-bit integer types, so if you need to implement this type of division, you will need to write the code in an external assembly module and link it in. The code is pretty simple, and Visual Studio has excellent, built-in support for assembling code with MASM and linking it directly into your project:
; Windows 64-bit calling convention passes parameters as follows:
; RCX == first 64-bit integer parameter (low bits of dividend)
; RDX == second 64-bit integer parameter (high bits of dividend)
; R8 == third 64-bit integer parameter (divisor)
; R9 == fourth 64-bit integer parameter (pointer to remainder)
Div128x64 PROC
mov rax, rcx
idiv r8 ; 128-bit divide (RDX:RAX / R8)
mov [r9], rdx ; store remainder
ret ; return, with quotient in RDX:RAX
Div128x64 ENDP
Then you just prototype that in your C++ code as:
extern int64_t Div128x64(int64_t loDividend,
int64_t hiDividend,
int64_t divisor,
int64_t* pRemainder);
and you're done. Call it as desired.
The equivalent can be written for unsigned division, using the DIV instruction.
No, you don't get intelligent register allocation, but this isn't really a big deal with register renaming in the front end that can often elide register-register moves entirely (in other words, MOVs become zero-latency operations). Plus, the IDIV instruction is so restrictive anyway in terms of its operands, since they are hardcoded to RAX and RDX, that it's pretty unlikely a scheduler would be able to keep the values in those registers anyway, at least for any non-trivial piece of code.
Beware that once you write the necessary code to check for the possibility of overflows, or worse—the code to handle exceptions—this will very likely end up performing the same or worse as a library function that does a proper 128-bit division, so you should arguably just write and use that (until such time as Microsoft sees fit to provide one). That can be written in C (also see implementation of __divti3 library function for Gnu compilers), which makes it a candidate for inlining and otherwise plays better with the optimizer.
No, it is not possible to do this. MSVC doesn't support inline assembly for x64 builds. Instead, you should use intrinsics; almost everything is available. The sad thing is, as far as I know, 128-bit idiv is missing from the intrinsics.
A note: you can solve your issue with two movs (to put inputs in the correct registers). And you should not worry about that; current CPUs handle mov very well. Putting mov into a code may not slow it down at all. And div is very expensive compared to a mov, so it doesn't matter too much.

Performance wise, how fast are Bitwise Operators vs. Normal Modulus?

Does using bitwise operations in normal flow or conditional statements like for, if, and so on increase overall performance and would it be better to use them where possible? For example:
if(i++ & 1) {
}
vs.
if(i % 2) {
}
Unless you're using an ancient compiler, it can already handle this level of conversion on its own. That is to say, a modern compiler can and will implement i % 2 using a bitwise AND instruction, provided it makes sense to do so on the target CPU (which, in fairness, it usually will).
In other words, don't expect to see any difference in performance between these, at least with a reasonably modern compiler with a reasonably competent optimizer. In this case, "reasonably" has a pretty broad definition too--even quite a few compilers that are decades old can handle this sort of micro-optimization with no difficulty at all.
TL;DR Write for semantics first, optimize measured hot-spots second.
At the CPU level, integer modulus and divisions are among the slowest operations. But you are not writing at the CPU level, instead you write in C++, which your compiler translates to an Intermediate Representation, which finally is translated into assembly according to the model of CPU for which you are compiling.
In this process, the compiler will apply Peephole Optimizations, among which figure Strength Reduction Optimizations such as (courtesy of Wikipedia):
Original Calculation Replacement Calculation
y = x / 8 y = x >> 3
y = x * 64 y = x << 6
y = x * 2 y = x << 1
y = x * 15 y = (x << 4) - x
The last example is perhaps the most interesting one. Whilst multiplying or dividing by powers of 2 is easily converted (manually) into bit-shifts operations, the compiler is generally taught to perform even smarter transformations that you would probably think about on your own and who are not as easily recognized (at the very least, I do not personally immediately recognize that (x << 4) - x means x * 15).
This is obviously CPU dependent, but you can expect that bitwise operations will never take more, and typically take less, CPU cycles to complete. In general, integer / and % are famously slow, as CPU instructions go. That said, with modern CPU pipelines having a specific instruction complete earlier doesn't mean your program necessarily runs faster.
Best practice is to write code that's understandable, maintainable, and expressive of the logic it implements. It's extremely rare that this kind of micro-optimisation makes a tangible difference, so it should only be used if profiling has indicated a critical bottleneck and this is proven to make a significant difference. Moreover, if on some specific platform it did make a significant difference, your compiler optimiser may already be substituting a bitwise operation when it can see that's equivalent (this usually requires that you're /-ing or %-ing by a constant).
For whatever it's worth, on x86 instructions specifically - and when the divisor is a runtime-variable value so can't be trivially optimised into e.g. bit-shifts or bitwise-ANDs, the time taken by / and % operations in CPU cycles can be looked up here. There are too many x86-compatible chips to list here, but as an arbitrary example of recent CPUs - if we take Agner's "Sunny Cove (Ice Lake)" (i.e. 10th gen Intel Core) data, DIV and IDIV instructions have a latency between 12 and 19 cycles, whereas bitwise-AND has 1 cycle. On many older CPUs DIV can be 40-60x worse.
By default you should use the operation that best expresses your intended meaning, because you should optimize for readable code. (Today most of the time the scarcest resource is the human programmer.)
So use & if you extract bits, and use % if you test for divisibility, i.e. whether the value is even or odd.
For unsigned values both operations have exactly the same effect, and your compiler should be smart enough to replace the division by the corresponding bit operation. If you are worried you can check the assembly code it generates.
Unfortunately integer division is slightly irregular on signed values, as it rounds towards zero and the result of % changes sign depending on the first operand. Bit operations, on the other hand, always round down. So the compiler cannot just replace the division by a simple bit operation. Instead it may either call a routine for integer division, or replace it with bit operations with additional logic to handle the irregularity. This may depends on the optimization level and on which of the operands are constants.
This irregularity at zero may even be a bad thing, because it is a nonlinearity. For example, I recently had a case where we used division on signed values from an ADC, which had to be very fast on an ARM Cortex M0. In this case it was better to replace it with a right shift, both for performance and to get rid of the nonlinearity.
C operators cannot be meaningfully compared in therms of "performance". There's no such thing as "faster" or "slower" operators at language level. Only the resultant compiled machine code can be analyzed for performance. In your specific example the resultant machine code will normally be exactly the same (if we ignore the fact that the first condition includes a postfix increment for some reason), meaning that there won't be any difference in performance whatsoever.
Here is the compiler (GCC 4.6) generated optimized -O3 code for both options:
int i = 34567;
int opt1 = i++ & 1;
int opt2 = i % 2;
Generated code for opt1:
l %r1,520(%r11)
nilf %r1,1
st %r1,516(%r11)
asi 520(%r11),1
Generated code for opt2:
l %r1,520(%r11)
nilf %r1,2147483649
ltr %r1,%r1
jhe .L14
ahi %r1,-1
oilf %r1,4294967294
ahi %r1,1
.L14: st %r1,512(%r11)
So 4 extra instructions...which are nothing for a prod environment. This would be a premature optimization and just introduce complexity
Always these answers about how clever compilers are, that people should not even think about the performance of their code, that they should not dare to question Her Cleverness The Compiler, that bla bla bla… and the result is that people get convinced that every time they use % [SOME POWER OF TWO] the compiler magically converts their code into & ([SOME POWER OF TWO] - 1). This is simply not true. If a shared library has this function:
int modulus (int a, int b) {
return a % b;
}
and a program launches modulus(135, 16), nowhere in the compiled code there will be any trace of bitwise magic. The reason? The compiler is clever, but it did not have a crystal ball when it compiled the library. It sees a generic modulus calculation with no information whatsoever about the fact that only powers of two will be involved and it leaves it as such.
But you can know if only powers of two will be passed to a function. And if that is the case, the only way to optimize your code is to rewrite your function as
unsigned int modulus_2 (unsigned int a, unsigned int b) {
return a & (b - 1);
}
The compiler cannot do that for you.
Bitwise operations are much faster.
This is why the compiler will use bitwise operations for you.
Actually, I think it will be faster to implement it as:
~i & 1
Similarly, if you look at the assembly code your compiler generates, you may see things like x ^= x instead of x=0. But (I hope) you are not going to use this in your C++ code.
In summary, do yourself, and whoever will need to maintain your code, a favor. Make your code readable, and let the compiler do these micro optimizations. It will do it better.

Do modern compilers optimize the x * 2 operation to x << 1?

Does the C++ compiler optimize the multiply by two operation x*2 to a bitshift operation x<<1?
I would love to believe that yes.
Actually VS2008 optimizes this to x+x:
01391000 push ecx
int x = 0;
scanf("%d", &x);
01391001 lea eax,[esp]
01391004 push eax
01391005 push offset string "%d" (13920F4h)
0139100A mov dword ptr [esp+8],0
01391012 call dword ptr [__imp__scanf (13920A4h)]
int y = x * 2;
01391018 mov ecx,dword ptr [esp+8]
0139101C lea edx,[ecx+ecx]
In an x64 build it is even more explicit and uses:
int y = x * 2;
000000013FB9101E mov edx,dword ptr [x]
printf("%d", y);
000000013FB91022 lea rcx,[string "%d" (13FB921B0h)]
000000013FB91029 add edx,edx
This is will the optimization settings on 'Maximize speed' (/O2)
This article from Raymond Chen could be interesting:
When is x/2 different from x>>1? :
http://blogs.msdn.com/oldnewthing/archive/2005/05/27/422551.aspx
Quoting Raymond:
Of course, the compiler is free to recognize this and rewrite your multiplication or shift operation. In fact, it is very likely to do this, because x+x is more easily pairable than a multiplication or shift. Your shift or multiply-by-two is probably going to be rewritten as something closer to an add eax, eax instruction.
[...]
Even if you assume that the shift fills with the sign bit, The result of the shift and the divide are different if x is negative.
(-1) / 2 ≡ 0
(-1) >> 1 ≡ -1
[...]
The moral of the story is to write what you mean. If you want to divide by two, then write "/2", not ">>1".
We can only assume it is wise to tell the compiler what you want, not what you want him to do: The compiler is better than an human is at optimizing small scale code (thanks for Daemin to point this subtle point): If you really want optimization, use a profiler, and study your algorithms' efficiency.
VS 2008 optimized mine to x << 1.
x = x * 2;
004013E7 mov eax,dword ptr [x]
004013EA shl eax,1
004013EC mov dword ptr [x],eax
EDIT: This was using VS default "Debug" configuration with optimization disabled (/Od). Using any of the optimization switches (/O1, /O2 (VS "Retail"), or /Ox) results in the the add self code Rob posted. Also, just for good measure, I verified x = x << 1 is indeed treated the same way as x = x * 2 by the cl compiler in both /Od and /Ox. So, in summary, cl.exe version 15.00.30729.01 for x86 treats * 2 and << 1 identically and I expect nearly all other recent compilers do the same.
Not if x is a float it won't.
Yes. They also optimize other similar operations, such as multiplying by non-powers of two that can be rewritten as the sums of some shifts. They will also optimize divisions by powers of 2 into right-shifts, but beware that when working with signed integers, the two operations are different! The compiler has to emit some extra bit twiddling instructions to make sure the results are the same for positive and negative numbers, but it's still faster than doing a division. It also similarly optimizes moduli by powers of 2.
The answer is "if it is faster" (or smaller). This depends on the target architecture heavily as well as the register usage model for a given compiler. In general, the answer is "yes, always" as this is a very simple peephole optimization to implement and is usually a decent win.
That's only the start of what optimizers can do. To see what your compiler does, look for the switch that causes it to emit assembler source. For the Digital Mars compilers, the output assembler can be examined with the OBJ2ASM tool. If you want to learn how your compiler works, looking at the assembler output can be very illuminating.
I'm sure they all do these kind of optimizations, but I wonder if they are still relevant. Older processors did multiplication by shifting and adding, which could take a number of cycles to complete. Modern processors, on the other hand, have a set of barrel-shifters which can do all the necessary shifts and additions simultaneously in one clock cycle or less. Has anyone actually benchmarked whether these optimizations really help?
Yes, they will.
Unless something is specified in a languages standard you'll never get a guaranteed answer to such a question. When in doubt have your compiler spit out assemble code and check. That's going to be the only way to really know.
#Ferruccio Barletta
That's a good question. I went Googling to try to find the answer.
I couldn't find answers for Intel processors directly, but this page has someone who tried to time things. It shows shifts to be more than twice as fast as ads and multiplies. Bit shifts are so simple (where a multiply could be a shift and an addition) that this makes sense.
So then I Googled AMD, and found an old optimization guide for the Athlon from 2002 that lists that lists the fastest ways to multiply numbers by contants between 2 and 32. Interestingly, it depends on the number. Some are ads, some shifts. It's on page 122.
A guide for the Athlon 64 shows the same thing (page 164 or so). It says multiplies are 3 (in 32-bit) or 4 (in 64-bit) cycle operations, where shifts are 1 and adds are 2.
It seems it is still useful as an optimization.
Ignoring cycle counts though, this kind of method would prevent you from tying up the multiplication execution units (possibly), so if you were doing lots of multiplications in a tight loop where some use constants and some don't the extra scheduling room might be useful.
But that's speculation.
Why do you think that's an optimization?
Why not 2*x → x+x? Or maybe the multiplication operation is as fast as the left-shift operation (maybe only if only one bit is set in the multiplicand)? If you never use the result, why not leave it out from the compiled output? If the compiler already loaded 2 to some register, maybe the multiplication instruction will be faster e.g. if we'd have to load the shift count first. Maybe the shift operation is larger, and your inner loop would no longer fit into the prefetch buffer of the CPU thus penalizing performance? Maybe the compiler can prove that the only time you call your function x will have the value 37 and x*2 can be replaced by 74? Maybe you're doing 2*x where x is a loop count (very common, though implicit, when looping over two-byte objects)? Then the compiler can change the loop from
for(int x = 0; x < count; ++x) ...2*x...
to the equivalent (leaving aside pathologies)
int count2 = count * 2
for(int x = 0; x < count2; x += 2) ...x...
which replaces count multiplications with a single multiplication, or it might be able to leverage the lea instruction which combines the multiplication with a memory read.
My point is: there are millions of factors deciding whether replacing x*2 by x<<1 yields a faster binary. An optimizing compiler will try to generate the fastest code for the program it is given, not for an isolated operation. Therefore optimization results for the same code can vary largely depending on the surrounding code, and they may not be trivial at all.
Generally, there are very few benchmarks that show large differences between compilers. It is therefore a fair assumption that compilers are doing a good job because if there were cheap optimizations left, at least one of the compilers would implement them -- and all the others would follow in their next release.
It depends on what compiler you have. Visual C++ for example is notoriously poor in optimizing. If you edit your post to say what compiler you are using, it would be easier to answer.