Signed overflow in C++ and undefined behaviour (UB) - c++

I'm wondering about the use of code like the following
int result = 0;
int factor = 1;
for (...) {
result = ...
factor *= 10;
}
return result;
If the loop is iterated over n times, then factor is multiplied by 10 exactly n times. However, factor is only ever used after having been multiplied by 10 a total of n-1 times. If we assume that factor never overflows except on the last iteration of the loop, but may overflow on the last iteration of the loop, then should such code be acceptable? In this case, the value of factor would provably never be used after the overflow has happened.
I'm having a debate on whether code like this should be accepted. It would be possible to put the multiplication inside an if-statement and just not do the multiplication on the last iteration of the loop when it can overflow. The downside is that it clutters the code and adds an unnecessary branch that would need to check for on all the previous loop iterations. I could also iterate over the loop one fewer time and replicate the loop body once after the loop, again, this complicates the code.
The actual code in question is used in a tight inner-loop that consumes a large chunk of the total CPU time in a real-time graphics application.

Compilers do assume that a valid C++ program does not contain UB. Consider for example:
if (x == nullptr) {
*x = 3;
} else {
*x = 5;
}
If x == nullptr then dereferencing it and assigning a value is UB. Hence the only way this could end in a valid program is when x == nullptr will never yield true and the compiler can assume under the as if rule, the above is equivalent to:
*x = 5;
Now in your code
int result = 0;
int factor = 1;
for (...) { // Loop until factor overflows but not more
result = ...
factor *= 10;
}
return result;
The last multiplication of factor cannot happen in a valid program (signed overflow is undefined). Hence also the assignment to result cannot happen. As there is no way to branch before the last iteration also the previous iteration cannot happen. Eventually, the part of code that is correct (i.e., no undefined behaviour ever happens) is:
// nothing :(

The behaviour of int overflow is undefined.
It doesn't matter if you read factor outside the loop body; if it has overflowed by then then the behaviour of your code on, after, and somewhat paradoxically before the overflow is undefined.
One issue that might arise in keeping this code is that compilers are getting more and more aggressive when it comes to optimisation. In particular they are developing a habit where they assume that undefined behaviour never happens. For this to be the case, they may remove the for loop altogether.
Can't you use an unsigned type for factor although then you'd need to worry about unwanted conversion of int to unsigned in expressions containing both?

It might be insightful to consider real-world optimizers. Loop unrolling is a known technique. The basic idea of loop unrolling is that
for (int i = 0; i != 3; ++i)
foo()
might be better implemented behind the scenes as
foo()
foo()
foo()
This is the easy case, with a fixed bound. But modern compilers can also do this
for variable bounds:
for (int i = 0; i != N; ++i)
foo();
becomes
__RELATIVE_JUMP(3-N)
foo();
foo();
foo();
Obviously this only works if the compiler knows that N<=3. And that's where we get back to the original question:
int result = 0;
int factor = 1;
for (...) {
result = ...
factor *= 10;
}
return result;
Because the compiler knows that signed overflow does not occur, it knows that the loop can execute a maximum of 9 times on 32 bits architectures. 10^10 > 2^32. It can therefore do a 9 iteration loop unroll. But the intended maximum was 10 iterations !.
What might happen is that you get a relative jump to a assembly instruction (9-N) with N==10, so an offset of -1, which is the jump instruction itself. Oops. This is a perfectly valid loop optimization for well-defined C++, but the example given turns into a tight infinite loop.

Any signed integer overflow results in undefined behaviour, regardless of whether or not the overflowed value is or might be read.
Maybe in your use-case you can to lift the first iteration out of the loop, turning this
int result = 0;
int factor = 1;
for (int n = 0; n < 10; ++n) {
result += n + factor;
factor *= 10;
}
// factor "is" 10^10 > INT_MAX, UB
into this
int factor = 1;
int result = 0 + factor; // first iteration
for (int n = 1; n < 10; ++n) {
factor *= 10;
result += n + factor;
}
// factor is 10^9 < INT_MAX
With optimization enabled, the compiler might unroll the second loop above into one conditional jump.

This is UB; in ISO C++ terms the entire behaviour of the entire program is completely unspecified for an execution that eventually hits UB. The classic example is as far as the C++ standard cares, it can make demons fly out of your nose. (I recommend against using an implementation where nasal demons are a real possibility). See other answers for more details.
Compilers can "cause trouble" at compile time for paths of execution they can see leading to compile-time-visible UB, e.g. assume those basic blocks are never reached.
See also What Every C Programmer Should Know About Undefined Behavior (LLVM blog). As explained there, signed-overflow UB lets compilers prove that for(... i <= n ...) loops are not infinite loops, even for unknown n. It also lets them "promote" int loop counters to pointer width instead of redoing sign-extension. (So the consequence of UB in that case could be accessing outside the low 64k or 4G elements of an array, if you were expecting signed wrapping of i into its value range.)
In some cases compilers will emit an illegal instruction like x86 ud2 for a block that provably causes UB if ever executed. (Note that a function might not ever be called, so compilers can't in general go berserk and break other functions, or even possible paths through a function that don't hit UB. i.e. the machine code it compiles to must still work for all inputs that don't lead to UB.)
Probably the most efficient solution is to manually peel the last iteration so the unneeded factor*=10 can be avoided.
int result = 0;
int factor = 1;
for (... i < n-1) { // stop 1 iteration early
result = ...
factor *= 10;
}
result = ... // another copy of the loop body, using the last factor
// factor *= 10; // and optimize away this dead operation.
return result;
Or if the loop body is large, consider simply using an unsigned type for factor. Then you can let the unsigned multiply overflow and it will just do well-defined wrapping to some power of 2 (the number of value bits in the unsigned type).
This is fine even if you use it with signed types, especially if your unsigned->signed conversion never overflows.
Conversion between unsigned and 2's complement signed is free (same bit-pattern for all values); the modulo wrapping for int -> unsigned specified by the C++ standard simplifies to just using the same bit-pattern, unlike for one's complement or sign/magnitude.
And unsigned->signed is similarly trivial, although it is implementation-defined for values larger than INT_MAX. If you aren't using the huge unsigned result from the last iteration, you have nothing to worry about. But if you are, see Is conversion from unsigned to signed undefined?. The value-doesn't-fit case is implementation-defined, which means that an implementation must pick some behaviour; sane ones just truncate (if necessary) the unsigned bit pattern and use it as signed, because that works for in-range values the same way with no extra work. And it's definitely not UB. So big unsigned values can become negative signed integers. e.g. after int x = u; gcc and clang don't optimize away x>=0 as always being true, even without -fwrapv, because they defined the behaviour.

If you can tolerate a few additional assembly instructions in the loop, instead of
int factor = 1;
for (int j = 0; j < n; ++j) {
...
factor *= 10;
}
you can write:
int factor = 0;
for (...) {
factor = 10 * factor + !factor;
...
}
to avoid the last multiplication. !factor will not introduce a branch:
xor ebx, ebx
L1:
xor eax, eax
test ebx, ebx
lea edx, [rbx+rbx*4]
sete al
add ebp, 1
lea ebx, [rax+rdx*2]
mov edi, ebx
call consume(int)
cmp r12d, ebp
jne .L1
This code
int factor = 0;
for (...) {
factor = factor ? 10 * factor : 1;
...
}
also results in branchless assembly after optimization:
mov ebx, 1
jmp .L1
.L2:
lea ebx, [rbx+rbx*4]
add ebx, ebx
.L1:
mov edi, ebx
add ebp, 1
call consume(int)
cmp r12d, ebp
jne .L2
(Compiled with GCC 8.3.0 -O3)

You didn't show what's in the parentheses of the for statement, but I'm going to assume it's something like this:
for (int n = 0; n < 10; ++n) {
result = ...
factor *= 10;
}
You can simply move the counter increment and loop termination check into the body:
for (int n = 0; ; ) {
result = ...
if (++n >= 10) break;
factor *= 10;
}
The number of assembly instructions in the loop will remain the same.
Inspired by Andrei Alexandrescu's presentation "Speed Is Found In The Minds of People".

Consider the function:
unsigned mul_mod_65536(unsigned short a, unsigned short b)
{
return (a*b) & 0xFFFFu;
}
According to the published Rationale, the authors of the Standard would have expected that if this function were invoked on (e.g.) a commonplace 32-bit computer with arguments of 0xC000 and 0xC000, promoting the operands of * to signed int would cause the computation to yield -0x10000000, which when converted to unsigned would yield 0x90000000u--the same answer as if they had made unsigned short promote to unsigned. Nonetheless, gcc will sometimes optimize that function in ways that would behave nonsensically if an overflow occurs. Any code where some combination of inputs could cause an overflow must be processed with -fwrapv option unless it would be acceptable to allow creators of deliberately-malformed input to execute arbitrary code of their choosing.

Why not this:
int result = 0;
int factor = 10;
for (...) {
factor *= 10;
result = ...
}
return result;

There are many different faces of Undefined Behavior, and what's acceptable depends on the usage.
tight inner-loop that consumes a large chunk of the total CPU time in a real-time graphics application
That, by itself, is a bit of an unusual thing, but be that as it may... if this is indeed the case, then the UB is most probably within the realm "allowable, acceptable". Graphics programming is notorious for hacks and ugly stuff. As long as it "works" and it doesn't take longer than 16.6ms to produce a frame, usually, nobody cares. But still, be aware of what it means to invoke UB.
First, there is the standard. From that point of view, there's nothing to discuss and no way to justify, your code is simply invalid. There are no ifs or whens, it just isn't a valid code. You might as well say that's middle-finger-up from your point of view, and 95-99% of the time you'll be good to go anyway.
Next, there's the hardware side. There are some uncommon, weird architectures where this is a problem. I'm saying "uncommon, weird" because on the one architecture that makes up 80% of all computers (or the two architectures that together make up 95% of all computers) overflow is a "yeah, whatever, don't care" thing on the hardware level. You sure do get a garbage (although still predictable) result, but no evil things happen.
That is not the case on every architecture, you might very well get a trap on overflow (though seeing how you speak of a graphics application, the chances of being on such an odd architecture are rather small). Is portability an issue? If it is, you may want to abstain.
Last, there is the compiler/optimizer side. One reason why overflow is undefined is that simply leaving it at that was easiest to cope with hardware once upon a time. But another reason is that e.g. x+1 is guaranteed to always be larger than x, and the compiler/optimizer can exploit this knowledge. Now, for the previously mentioned case, compilers are indeed known to act this way and simply strip out complete blocks (there existed a Linux exploit some years ago which was based on the compiler having dead-stripped some validation code because of exactly this).
For your case, I would seriously doubt that the compiler does some special, odd, optimizations. However, what do you know, what do I know. When in doubt, try it out. If it works, you are good to go.
(And finally, there's of course code audit, you might have to waste your time discussing this with an auditor if you're unlucky.)

Related

How to let GCC compiler turn variable-division into mul(if faster)

int a, b;
scanf("%d %d", &a, &b);
printf("%d\n", (unsigned int)a/(unsigned char)b);
When compiling, I got
...
::00401C1E:: C70424 24304000 MOV DWORD PTR [ESP],403024 %d %d
::00401C25:: E8 36FFFFFF CALL 00401B60 scanf
::00401C2A:: 0FB64C24 1C MOVZX ECX,BYTE PTR [ESP+1C]
::00401C2F:: 8B4424 18 MOV EAX,[ESP+18]
::00401C33:: 31D2 XOR EDX,EDX
::00401C35:: F7F1 DIV ECX
::00401C37:: 894424 04 MOV [ESP+4],EAX
::00401C3B:: C70424 2A304000 MOV DWORD PTR [ESP],40302A %d\x0A
::00401C42:: E8 21FFFFFF CALL 00401B68 printf
Will it be faster if the DIV turn into MUL and use an array to store the mulvalue? If so, how to let the compiler do the optimization?
int main() {
uint a, s=0, i, t;
scanf("%d", &a);
diviuint aa = a;
t = clock();
for (i=0; i<1000000000; i++)
s += i/a;
printf("Result:%10u\n", s);
printf("Time:%12u\n", clock()-t);
return 0;
}
where diviuint(a) make a memory of 1/a and use multiple instead
Using s+=i/aa makes the speed 2 times of s+=i/a
You are correct that finding the multiplicative inverse may be worth it if integer division inside a loop is unavoidable. gcc and clang won't do this for you with run-time constants, though; only compile-time constants. It's too expensive (in code-size) for the compiler to do without being sure it's needed, and the perf gains aren't as big with non compile-time constants. (I'm not confident a speedup will always be possible, depending on how good integer division is on the target microarchitecture.)
Using a multiplicative inverse
If you can't transform things to pull the divide out of the loop, and it runs many iterations, and a significant increase in code-size is with the performance gain (e.g. you aren't bottlenecked on cache misses that hide the div latency), then you might get a speedup from doing for run-time constants what the compiler does for compile-time constants.
Note that different constants need different shifts of the high half of the full-multiply, and some constants need more different shifts than others. (Another way of saying that some of the shift-counts are zero for some constants). So non-compile-time-constant divide-by-multiplying code needs all the shifts, and the shift counts have to be variable-count. (On x86, this is more expensive than immediate-count shifts).
libdivide has an implementation of the necessary math. You can use it to do SIMD-vectorized division, or for scalar, I think. This will definitely provide a big speedup over unpacking to scalar and doing integer division there. I haven't used it myself.
(Intel SSE/AVX doesn't do integer-division in hardware, but provides a variety of multiplies, and fairly efficient variable-count shift instructions. For 16bit elements, there's an instruction that produces only the high half of the multiply. For 32bit elements, there's a widening multiply, so you'd need a shuffle with that.)
Anyway, you could use libdivide to vectorize that add loop, with a horizontal sum at the end.
Other ways to get the div out of the loop
for (i=0; i<1000000000; i++)
s += i/a;
In your example, you might get better results from using a uint128_t s accumulator and dividing by a outside the loop. A 64bit add/adc pair is pretty cheap. (It wouldn't give identical results, though, because integer division truncates instead of rounding to nearest.)
I think you can account for that by looping with i += a; tmp++, and doing s += tmp*a, to combine all the adds from iterations where i/a is the same. So s += 1 * a accounts for all the iterations from i = [a .. a*2-1]. Obviously that was just a trivial example, and looping more efficiently is usually not actually possible. It's off-topic for this question, but worth saying anyway: Look for big optimizations by re-structuring code or taking advantage of some math before trying to speed up doing the exact same thing faster. Speaking of math, you can use the sum(0..n) = n * (n+1) / 2 formula here, because we can factor a out of a*1 + a*2 + a*3 ... a*max. I may have an off-by-one here, but I'm confident a closed-form simple constant time calculation will give the same answer as the loop for any a:
uint32_t n = 1000000000 / a;
uint32_t s = a * n*(n+1)/2 + 1000000000 % a;
If you just needed i/a in a loop, it might be worth it to do something like:
// another optimization for an unlikely case
for (uint32_t i=0, remainder=0, i_over_a=0 ; i < n ; i++) {
// use i_over_a
++remainder;
if (remainder == a) { // if you don't need the remainder in the loop, it could save an insn or two to count down from a to 0 instead of up from 0 to a, e.g. on x86. But then you need a clever variable name other than remainder.
remainder = 0;
++i_over_a;
}
}
Again, this is unlikely: it only works if you're dividing the loop counter by a constant. However, it should work well. Either a is large so branch mispredicts will be infrequent, or a is (hopefully) small enough for a good branch predictor to recognize the repeating pattern of a-1 branches one way, then 1 branch the other way. The worst-case a value might be 33 or 65 or something, depending on microarchitecture. Branchless asm is probably possible but not worth it. e.g. handle ++i_over_a with an add-with-carry and a conditional move for zeroing. (e.g. x86 pseudo-code cmp a-1, remainder / cmovc remainder, 0 / adc i_over_a, 0. The b (below) condition is just CF==1, same as the c (carry) condition. The branchless asm would be simplified by decrementing from a to 0. (don't need a zeroed reg for cmov, and could have a in a reg instead of a-1))
Replacing DIV with MUL may make sense (but doesn't have to in all cases) when one of the values is known at compile time. When both are user inputs, you don't know what's the range, so all usual tricks will not work.
Basically you need to handle both a and b between INT_MAX and INT_MIN. There's no space left for scaling them up/down. Even if you wanted to extend them to larger types, it would probably take longer time just to invert b and check that the result will be consistent.
The only way to KNOW if div or mul is faster is by testing both in a benchmark [obviously, if you use your above code, you'd mostly measure the time of read/write of the inputs and results, not the actual divide instruction, so you need something where you can isolate the divide instruction(s) from the input and output].
My guess would be that on slightly older processors, mul is a bit faster, on modern processors, div will be as fast as, if not faster than, a lookup of 256 int values.
If you have ONE target system, then it's plausible to test this. If you have several different systems you want to run on, you will have to ensure the "improved code" is faster on at least some of them - and not significantly slower on the rest.
Note also that you would introduce a dependency, which may in itself slow down the sequence of operations - modern CPU's are pretty good at "hiding" latency as long as there are other instructions to execute [so you should use this in an "as realistic scenario as possible"].
There is a wrong assumption in the question. The multiplicative inverse of an integer greater than 1 is a fraction less than one. These don't exist in the world of integers. A lookup table doesn't work because you can't lookup what doesn't exist. Even if you "scale" the dividend the results will not be correct in the sense of being the same as an integer division. Take this example:
printf("%x %x\n", 0x10/0x9, 0x30/0x9);
// prints: 1 5
Assuming a multiplicative inverse existed, both terms are divided by the same divisor (9) so must have the same lookup table value (multiplicative inverse). Any fixed lookup value corresponding to the divisor (9) multiplied by an integer will be precisely 3 times greater in the second term relative to the first term. As you can see from the example, the result of an actual integer division is a 5, not a 3.
You can approximate things by using a scaled lookup table. For instance a lookup table that is the multiplicative inverse when the result is divided by 2^16. You would then multiply by the lookup table value and shift the result 16 bits to the right. Time consuming and requires a 1024 byte lookup table. Even so, this would not produce the same results as an integer divide. A compiler optimization is not going to produce "approximate" results of an integer division.

Binary Addition without overflow wrap-around in C/C++

I know that when overflow occurs in C/C++, normal behavior is to wrap-around. For example, INT_MAX+1 is an overflow.
Is possible to modify this behavior, so binary addition takes place as normal addition and there is no wraparound at the end of addition operation ?
Some Code so this would make sense. Basically, this is one bit (full) added, it adds bit by bit in 32
int adder(int x, int y)
{
int sum;
for (int i = 0; i < 31; i++)
{
sum = x ^ y;
int carry = x & y;
x = sum;
y = carry << 1;
}
return sum;
}
If I try to adder(INT_MAX, 1); it actually overflows, even though, I amn't using + operator.
Thanks !
Overflow means that the result of an addition would exceed std::numeric_limits<int>::max() (back in C days, we used INT_MAX). Performing such an addition results in undefined behavior. The machine could crash and still comply with the C++ standard. Although you're more likely to get INT_MIN as a result, there's really no advantage to depending on any result at all.
The solution is to perform subtraction instead of addition, to prevent overflow and take a special case:
if ( number > std::numeric_limits< int >::max() - 1 ) { // ie number + 1 > max
// fix things so "normal" math happens, in this case saturation.
} else {
++ number;
}
Without knowing the desired result, I can't be more specific about the it. The performance impact should be minimal, as a rarely-taken branch can usually be retired in parallel with subsequent instructions without delaying them.
Edit: To simply do math without worrying about overflow or handling it yourself, use a bignum library such as GMP. It's quite portable, and usually the best on any given platform. It has C and C++ interfaces. Do not write your own assembly. The result would be unportable, suboptimal, and the interface would be your responsibility!
No, you have to add them manually to check for overflow.
What do you want the result of INT_MAX + 1 to be? You can only fit INT_MAX into an int, so if you add one to it, the result is not going to be one greater. (Edit: On common platforms such as x86 it is going to wrap to the largest negative number: -(INT_MAX+1). The only way to get bigger numbers is to use a larger variable.
Assuming int is 4-bytes (as is typical on x86 compilers) and you are executing an add instruction (in 32-bit mode), the destination register simply does overflow -- it is out of bits and can't hold a larger value. It is a limitation of the hardware.
To get around this, you can hand-code, or use an aribitrarily-sized integer library that does the following:
First perform a normal add instruction on the lowest-order words. If overflow occurs, the Carry flag is set.
For each increasingly-higher-order word, use the adc instruction, which adds the two operands as usual, but takes into account the value of the Carry flag (as a value of 1.)
You can see this for a 64-bit value here.

Is dividing by zero accompanied with a runtime error ever useful in C++?

According to C++ Standard (5/5) dividing by zero is undefined behavior. Now consider this code (lots of useless statements are there to prevent the compiler from optimizing code out):
int main()
{
char buffer[1] = {};
int len = strlen( buffer );
if( len / 0 ) {
rand();
}
}
Visual C++ compiles the if-statement like this:
sub eax,edx
cdq
xor ecx,ecx
idiv eax,ecx
test eax,eax
je wmain+2Ah (40102Ah)
call rand
Clearly the compiler sees that the code is to divide by zero - it uses xor x,x pattern to zero out ecx which then serves the second operand in integer division. This code will definitely trigger an "integer division by zero" error at runtime.
IMO such cases (when the compiler knows that the code will divide by zero at all times) are worth a compile-time error - the Standard doesn't prohibit that. That would help diagnose such cases at compile time instead of at runtime.
However I talked to several other developers and they seem to disagree - their objection is "what if the author wanted to divide by zero to... emm... test error handling?"
Intentionally dividing by zero without compiler awareness is not that hard - using __declspec(noinline) Visual C++ specific function decorator:
__declspec(noinline)
void divide( int what, int byWhat )
{
if( what/byWhat ) {
rand();
}
}
void divideByZero()
{
divide( 0, 0 );
}
which is much more readable and maintainable. One can use that function when he "needs to test error handling" and have a nice compile-time error in all other cases.
Am I missing something? Is it necessary to allow emission of code that the compiler knows divides by zero?
There is probably code out there which has accidental division by zero in functions which are never called (e.g. because of some platform-specific macro expansion), and these would no longer compile with your compiler, making your compiler less useful.
Also, most division by zero errors that I've seen in real code are input-dependent, or at least are not really amenable to static analysis. Maybe it's not worth the effort of performing the check.
Dividing by 0 is undefined behavior because it might trigger, on certain platforms, a hardware exception. We could all wish for a better behaved hardware, but since nobody ever saw fit to have integers with -INF/+INF and NaN values, it's quite pointeless.
Now, because it's undefined behavior, interesting things may happen. I encourage you to read Chris Lattner's articles on undefined behavior and optimizations, I'll just give a quick example here:
int foo(char* buf, int i) {
if (5 / i == 3) {
return 1;
}
if (buf != buf + i) {
return 2;
}
return 0;
}
Because i is used as a divisor, then it is not 0. Therefore, the second if is trivially true and can be optimized away.
In the face of such transformations, anyone hoping for a sane behavior of a division by 0... will be harshly disappointed.
In the case of integral types (int, short, long, etc.) I can't think of any uses for intentional divide by zero offhand.
However, for floating point types on IEEE-compliant hardware, explicit divide by zero is tremendously useful. You can use it to produce positive & negative infinity (+/- 1/0), and not a number (NaN, 0/0) values, which can be quite helpful.
In the case of sorting algorithms, you can use the infinities as initial values representing greater or less than all possible values.
For data analysis purposes, you can use NaNs to indicate missing or invalid data, which can then be handled gracefully. Matlab, for example, uses explicit NaN values to suppress missing data in plots, etc.
Although you can access these values through macros and std::numeric_limits (in C++), it is useful to be able to create them on your own (and allows you to avoid lots of "special case" code). It also allows implementors of the standard library to avoid resorting to hackery (such as manual assembly of the correct FP bit sequence) to provide these values.
If the compiler detects a division-by-0, there is absolutely nothing wrong with a compiler error. The developers you talked to are wrong - you could apply that logic to every single compile error. There is no point in ever dividing by 0.
Detecting divisions by zero at compile-time is the sort of thing that you'd want to have be a compiler warning. That's definitely a nice idea.
I don't keep no company with Microsoft Visual C++, but G++ 4.2.1 does do such checking. Try compiling:
#include <iostream>
int main() {
int x = 1;
int y = x / 0;
std::cout << y;
return 0;
}
And it will tell you:
test.cpp: In function ‘int main()’:
test.cpp:5: warning: division by zero in ‘x / 0’
But considering it an error is a slippery slope that the savvy know not to spend too much of their spare time climbing. Consider why G++ doesn't have anything to say when I write:
int main() {
while (true) {
}
return 0;
}
Do you think it should compile that, or give an error? Should it always give a warning? If you think it must intervene on all such cases, I eagerly await your copy of the compiler you've written that only compiles programs that guarantee successful termination! :-)

Fast multiplication/division by 2 for floats and doubles (C/C++)

In the software I'm writing, I'm doing millions of multiplication or division by 2 (or powers of 2) of my values. I would really like these values to be int so that I could access the bitshift operators
int a = 1;
int b = a<<24
However, I cannot, and I have to stick with doubles.
My question is : as there is a standard representation of doubles (sign, exponent, mantissa), is there a way to play with the exponent to get fast multiplications/divisions by a power of 2?
I can even assume that the number of bits is going to be fixed (the software will work on machines that will always have 64 bits long doubles)
P.S : And yes, the algorithm mostly does these operations only. This is the bottleneck (it's already multithreaded).
Edit : Or am I completely mistaken and clever compilers already optimize things for me?
Temporary results (with Qt to measure time, overkill, but I don't care):
#include <QtCore/QCoreApplication>
#include <QtCore/QElapsedTimer>
#include <QtCore/QDebug>
#include <iostream>
#include <math.h>
using namespace std;
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
while(true)
{
QElapsedTimer timer;
timer.start();
int n=100000000;
volatile double d=12.4;
volatile double D;
for(unsigned int i=0; i<n; ++i)
{
//D = d*32; // 200 ms
//D = d*(1<<5); // 200 ms
D = ldexp (d,5); // 6000 ms
}
qDebug() << "The operation took" << timer.elapsed() << "milliseconds";
}
return a.exec();
}
Runs suggest that D = d*(1<<5); and D = d*32; run in the same time (200 ms) whereas D = ldexp (d,5); is much slower (6000 ms). I know that this is a micro benchmark, and that suddenly, my RAM has exploded because Chrome has suddenly asked to compute Pi in my back every single time I run ldexp(), so this benchmark is worth nothing. But I'll keep it nevertheless.
On the other had, I'm having trouble doing reinterpret_cast<uint64_t *> because there's a const violation (seems the volatile keyword interferes)
This is one of those highly-application specific things. It may help in some cases and not in others. (In the vast majority of cases, a straight-forward multiplication is still best.)
The "intuitive" way of doing this is just to extract the bits into a 64-bit integer and add the shift value directly into the exponent. (this will work as long as you don't hit NAN or INF)
So something like this:
union{
uint64 i;
double f;
};
f = 123.;
i += 0x0010000000000000ull;
// Check for zero. And if it matters, denormals as well.
Note that this code is not C-compliant in any way, and is shown just to illustrate the idea. Any attempt to implement this should be done directly in assembly or SSE intrinsics.
However, in most cases the overhead of moving the data from the FP unit to the integer unit (and back) will cost much more than just doing a multiplication outright. This is especially the case for pre-SSE era where the value needs to be stored from the x87 FPU into memory and then read back into the integer registers.
In the SSE era, the Integer SSE and FP SSE use the same ISA registers (though they still have separate register files). According the Agner Fog, there's a 1 to 2 cycle penalty for moving data between the Integer SSE and FP SSE execution units. So the cost is much better than the x87 era, but it's still there.
All-in-all, it will depend on what else you have on your pipeline. But in most cases, multiplying will still be faster. I've run into this exact same problem before so I'm speaking from first-hand experience.
Now with 256-bit AVX instructions that only support FP instructions, there's even less of an incentive to play tricks like this.
How about ldexp?
Any half-decent compiler will generate optimal code on your platform.
But as #Clinton points out, simply writing it in the "obvious" way should do just as well. Multiplying and dividing by powers of two is child's play for a modern compiler.
Directly munging the floating point representation, besides being non-portable, will almost certainly be no faster (and might well be slower).
And of course, you should not waste time even thinking about this question unless your profiling tool tells you to. But the kind of people who listen to this advice will never need it, and the ones who need it will never listen.
[update]
OK, so I just tried ldexp with g++ 4.5.2. The cmath header inlines it as a call to __builtin_ldexp, which in turn...
...emits a call to the libm ldexp function. I would have thought this builtin would be trivial to optimize, but I guess the GCC developers never got around to it.
So, multiplying by 1 << p is probably your best bet, as you have discovered.
You can pretty safely assume IEEE 754 formatting, the details of which can get pretty gnarley (esp. when you get into subnormals). In the common cases, however, this should work:
const int DOUBLE_EXP_SHIFT = 52;
const unsigned long long DOUBLE_MANT_MASK = (1ull << DOUBLE_EXP_SHIFT) - 1ull;
const unsigned long long DOUBLE_EXP_MASK = ((1ull << 63) - 1) & ~DOUBLE_MANT_MASK;
void unsafe_shl(double* d, int shift) {
unsigned long long* i = (unsigned long long*)d;
if ((*i & DOUBLE_EXP_MASK) && ((*i & DOUBLE_EXP_MASK) != DOUBLE_EXP_MASK)) {
*i += (unsigned long long)shift << DOUBLE_EXP_SHIFT;
} else if (*i) {
*d *= (1 << shift);
}
}
EDIT: After doing some timing, this method is oddly slower than the double method on my compiler and machine, even stripped to the minimum executed code:
double ds[0x1000];
for (int i = 0; i != 0x1000; i++)
ds[i] = 1.2;
clock_t t = clock();
for (int j = 0; j != 1000000; j++)
for (int i = 0; i != 0x1000; i++)
#if DOUBLE_SHIFT
ds[i] *= 1 << 4;
#else
((unsigned int*)&ds[i])[1] += 4 << 20;
#endif
clock_t e = clock();
printf("%g\n", (float)(e - t) / CLOCKS_PER_SEC);
In the DOUBLE_SHIFT completes in 1.6 seconds, with an inner loop of
movupd xmm0,xmmword ptr [ecx]
lea ecx,[ecx+10h]
mulpd xmm0,xmm1
movupd xmmword ptr [ecx-10h],xmm0
Versus 2.4 seconds otherwise, with an inner loop of:
add dword ptr [ecx],400000h
lea ecx, [ecx+8]
Truly unexpected!
EDIT 2: Mystery solved! One of the changes for VC11 is now it always vectorizes floating point loops, effectively forcing /arch:SSE2, though VC10, even with /arch:SSE2 is still worse with 3.0 seconds with an inner loop of:
movsd xmm1,mmword ptr [esp+eax*8+38h]
mulsd xmm1,xmm0
movsd mmword ptr [esp+eax*8+38h],xmm1
inc eax
VC10 without /arch:SSE2 (even with /arch:SSE) is 5.3 seconds... with 1/100th of the iterations!!, inner loop:
fld qword ptr [esp+eax*8+38h]
inc eax
fmul st,st(1)
fstp qword ptr [esp+eax*8+30h]
I knew the x87 FP stack was aweful, but 500 times worse is kinda ridiculous. You probably won't see these kinds of speedups converting, i.e. matrix ops to SSE or int hacks, since this is the worst case loading into the FP stack, doing one op, and storing from it, but it's a good example for why x87 is not the way to go for anything perf. related.
The fastest way to do this is probably:
x *= (1 << p);
This sort of thing may simply be done by calling an machine instruction to add p to the exponent. Telling the compiler to instead extract the some bits with a mask and doing something manually to it will probably make things slower, not faster.
Remember, C/C++ is not assembly language. Using a bitshift operator does not necessarily compile to a bitshift assembly operation, not does using multiplication necessarily compile to multiplication. There's all sorts of weird and wonderful things going on like what registers are being used and what instructions can be run simultaneously which I'm not smart enough to understand. But your compiler, with many man years of knowledge and experience and lots of computational power, is much better at making these judgements.
p.s. Keep in mind, if your doubles are in an array or some other flat data structure, your compiler might be really smart and use SSE to multiple 2, or even 4 doubles at the same time. However, doing a lot of bit shifting is probably going to confuse your compiler and prevent this optimisation.
Since c++17 you can also use hexadecimal floating literals. That way you can multiply by higher powers of 2. For instance:
d *= 0x1p64;
will multiply d by 2^64. I use it to implement my fast integer arithmetic in a conversion to double.
What other operations does this algorithm require? You might be able to break your floats into int pairs (sign/mantissa and magnitude), do your processing, and reconstitute them at the end.
Multiplying by 2 can be replaced by an addition: x *= 2 is equivalent to x += x.
Division by 2 can be replaced by multiplication by 0.5. Multiplication is usually significantly faster than division.
Although there is little/no practical benefit to treating powers of two specially for float of double types there is a case for this for double-double types. Double-double multiplication and division is complicated in general but is trivial for multiplying and dividing by a power of two.
E.g. for
typedef struct {double hi; double lo;} doubledouble;
doubledouble x;
x.hi*=2, x.lo*=2; //multiply x by 2
x.hi/=2, x.lo/=2; //divide x by 2
In fact I have overloaded << and >> for doubledouble so that it's analogous to integers.
//x is a doubledouble type
x << 2 // multiply x by four;
x >> 3 // divide x by eight.
Depending on what you're multiplying, if you have data that is recurring enough, a look up table might provide better performance, at the expense of memory.

C and C++: Array element access pointer vs int

Is there a performance difference if you either do myarray[ i ] or store the adress of myarray[ i ] in a pointer?
Edit: The pointers are all calculated during an unimportant step in my program where performance is no criteria. During the critical parts the pointers remain static and are not modified. Now the question is if these static pointers are faster than using myarray[ i ] all the time.
For this code:
int main() {
int a[100], b[100];
int * p = b;
for ( unsigned int i = 0; i < 100; i++ ) {
a[i] = i;
*p++ = i;
}
return a[1] + b[2];
}
when built with -O3 optimisation in g++, the statement:
a[i] = i;
produced the assembly output:
mov %eax,(%ecx,%eax,4)
and this statement:
*p++ = i;
produced:
mov %eax,(%edx,%eax,4)
So in this case there was no difference between the two. However, this is not and cannot be a general rule - the optimiser might well generate completely different code for even a slightly different input.
It will probably make no difference at all. The compiler will usually be smart enough to know when you are using an expression more than once and create a temporary itself, if appropriate.
Compilers can do surprising optimizations; the only way to know is to read the generated assembly code.
With GCC, use -S, with -masm=intel for Intel syntax.
With VC++, use /FA (IIRC).
You should also enable optimizations: -O2 or -O3 with GCC, and /O2 with VC++.
I prefer using myarray[ i ] since it is more clear and the compiler has easier time compiling this to optimized code.
When using pointers it is more complex for the compiler to optimize this code since it's harder to know exactly what you're doing with the pointer.
There should not be much different but by using indexing you avoid all types of different pitfalls that the compiler's optimizer is prone to (aliasing being the most important one) and thus I'd say the indexing case should be easier to handle for the compiler. This doesn't mean that you should take care of aforementioned things before the loop, but pointers in a loop generally just adds to the complexity.
Yes. Having a pointer the address won't be calculated by using the initial address of the array. It will accessed directly. So you have a little performance improve if you save the address in a pointer.
But the compiler will usually optimize the code and use the pointer in both cases (if you have statical arrays)
For dynamic arrays (created with new) the pointer will offer you more performance as the compiler cannot optimize array accesses at compile time.
There will be no substantial difference. Premature optimization is the root of all evil - get a profiler before checking micro-optimizations like this. Also, the myarray[i] is more portable to custom types, such as a std::vector.
Okay so your questions is, whats faster:
int main(int argc, char **argv)
{
int array[20];
array[0] = 0;
array[1] = 1;
int *value_1 = &array[1];
printf("%d", *value_1);
printf("%d", array[1]);
printf("%d", *(array + 1));
}
Like someone else already pointed out, compilers can do clever optimization. Of course this is depending on where an expression is used, but normally you shouldn't care about those subtle differences. All your assumption can be proven wrong by the compiler. Today you shouldn't need to care about such differences.
For example the above code produces the following (only snippet):
mov [ebp+var_54], 1 #store 1
lea eax, [ebp+var_58] # load the address of array[0]
add eax, 4 # add 4 (size of int)
mov [ebp+var_5C], eax
mov eax, [ebp+var_5C]
mov eax, [eax]
mov [esp+88h+var_84], eax
mov [esp+88h+var_88], offset unk_403000 # points to %d
call printf
mov eax, [ebp+var_54]
mov [esp+88h+var_84], eax
mov [esp+88h+var_88], offset unk_403000
call printf
mov eax, [ebp+var_54]
mov [esp+88h+var_84], eax
mov [esp+88h+var_88], offset unk_403000
call printf
Short answer: the only way to know for sure is to code up both versions and compare performance. I would personally be surprised if there was a measureable difference unless you were doing a lot of array accesses in a really tight loop. If this is something that happens once or twice over the lifetime of the program, or depends on user input, it's not worth worrying about.
Remember that the expression a[i] is evaluated as *(a+i), which is an addition plus a dereference, whereas *p is just a dereference. Depending on how the code is structured, though, it may not make a difference. Assume the following:
int a[N]; // for any arbitrary N > 1
int *p = a;
size_t i;
for (i = 0; i < N; i++)
printf("a[%d] = %d\n", i, a[i]);
for (i = 0; i < N; i++)
printf("*(%p) = %d\n", (void*) p, *p++);
Now we're comparing a[i] to *p++, which is a dereference plus a postincrement (in addition to the i++ in the loop control); that may turn out to be a more expensive operation than the array subscript. Not to mention we've introduced another variable that's not strictly necessary; we're trading a little space for what may or may not be an improvement in speed. It really depends on the compiler, the structure of the code, optimization settings, OS, and CPU.
Worry about correctness first, then worry about readability/maintainability, then worry about safety/reliability, then worry about performance. Unless you're failing to meet a hard performance requirement, focus on making your intent clear and easy to understand. It doesn't matter how fast your code is if it gives you the wrong answer or performs the wrong action, or if it crashes horribly at the first hint of bad input, or if you can't fix bugs or add new features without breaking something.
Yes.. when storing myarray[i] pointer it will perform better (if used on large scale...)
Why??
It will save you an addition and may be a multiplication (or a shift..)
Many compilers may optimize that for you in case of static memory allocation.
If you are using dynamic memory allocation, the compiler will not optimize it, because it is in runtime!