How to let GCC compiler turn variable-division into mul(if faster) - c++

int a, b;
scanf("%d %d", &a, &b);
printf("%d\n", (unsigned int)a/(unsigned char)b);
When compiling, I got
...
::00401C1E:: C70424 24304000 MOV DWORD PTR [ESP],403024 %d %d
::00401C25:: E8 36FFFFFF CALL 00401B60 scanf
::00401C2A:: 0FB64C24 1C MOVZX ECX,BYTE PTR [ESP+1C]
::00401C2F:: 8B4424 18 MOV EAX,[ESP+18]
::00401C33:: 31D2 XOR EDX,EDX
::00401C35:: F7F1 DIV ECX
::00401C37:: 894424 04 MOV [ESP+4],EAX
::00401C3B:: C70424 2A304000 MOV DWORD PTR [ESP],40302A %d\x0A
::00401C42:: E8 21FFFFFF CALL 00401B68 printf
Will it be faster if the DIV turn into MUL and use an array to store the mulvalue? If so, how to let the compiler do the optimization?
int main() {
uint a, s=0, i, t;
scanf("%d", &a);
diviuint aa = a;
t = clock();
for (i=0; i<1000000000; i++)
s += i/a;
printf("Result:%10u\n", s);
printf("Time:%12u\n", clock()-t);
return 0;
}
where diviuint(a) make a memory of 1/a and use multiple instead
Using s+=i/aa makes the speed 2 times of s+=i/a

You are correct that finding the multiplicative inverse may be worth it if integer division inside a loop is unavoidable. gcc and clang won't do this for you with run-time constants, though; only compile-time constants. It's too expensive (in code-size) for the compiler to do without being sure it's needed, and the perf gains aren't as big with non compile-time constants. (I'm not confident a speedup will always be possible, depending on how good integer division is on the target microarchitecture.)
Using a multiplicative inverse
If you can't transform things to pull the divide out of the loop, and it runs many iterations, and a significant increase in code-size is with the performance gain (e.g. you aren't bottlenecked on cache misses that hide the div latency), then you might get a speedup from doing for run-time constants what the compiler does for compile-time constants.
Note that different constants need different shifts of the high half of the full-multiply, and some constants need more different shifts than others. (Another way of saying that some of the shift-counts are zero for some constants). So non-compile-time-constant divide-by-multiplying code needs all the shifts, and the shift counts have to be variable-count. (On x86, this is more expensive than immediate-count shifts).
libdivide has an implementation of the necessary math. You can use it to do SIMD-vectorized division, or for scalar, I think. This will definitely provide a big speedup over unpacking to scalar and doing integer division there. I haven't used it myself.
(Intel SSE/AVX doesn't do integer-division in hardware, but provides a variety of multiplies, and fairly efficient variable-count shift instructions. For 16bit elements, there's an instruction that produces only the high half of the multiply. For 32bit elements, there's a widening multiply, so you'd need a shuffle with that.)
Anyway, you could use libdivide to vectorize that add loop, with a horizontal sum at the end.
Other ways to get the div out of the loop
for (i=0; i<1000000000; i++)
s += i/a;
In your example, you might get better results from using a uint128_t s accumulator and dividing by a outside the loop. A 64bit add/adc pair is pretty cheap. (It wouldn't give identical results, though, because integer division truncates instead of rounding to nearest.)
I think you can account for that by looping with i += a; tmp++, and doing s += tmp*a, to combine all the adds from iterations where i/a is the same. So s += 1 * a accounts for all the iterations from i = [a .. a*2-1]. Obviously that was just a trivial example, and looping more efficiently is usually not actually possible. It's off-topic for this question, but worth saying anyway: Look for big optimizations by re-structuring code or taking advantage of some math before trying to speed up doing the exact same thing faster. Speaking of math, you can use the sum(0..n) = n * (n+1) / 2 formula here, because we can factor a out of a*1 + a*2 + a*3 ... a*max. I may have an off-by-one here, but I'm confident a closed-form simple constant time calculation will give the same answer as the loop for any a:
uint32_t n = 1000000000 / a;
uint32_t s = a * n*(n+1)/2 + 1000000000 % a;
If you just needed i/a in a loop, it might be worth it to do something like:
// another optimization for an unlikely case
for (uint32_t i=0, remainder=0, i_over_a=0 ; i < n ; i++) {
// use i_over_a
++remainder;
if (remainder == a) { // if you don't need the remainder in the loop, it could save an insn or two to count down from a to 0 instead of up from 0 to a, e.g. on x86. But then you need a clever variable name other than remainder.
remainder = 0;
++i_over_a;
}
}
Again, this is unlikely: it only works if you're dividing the loop counter by a constant. However, it should work well. Either a is large so branch mispredicts will be infrequent, or a is (hopefully) small enough for a good branch predictor to recognize the repeating pattern of a-1 branches one way, then 1 branch the other way. The worst-case a value might be 33 or 65 or something, depending on microarchitecture. Branchless asm is probably possible but not worth it. e.g. handle ++i_over_a with an add-with-carry and a conditional move for zeroing. (e.g. x86 pseudo-code cmp a-1, remainder / cmovc remainder, 0 / adc i_over_a, 0. The b (below) condition is just CF==1, same as the c (carry) condition. The branchless asm would be simplified by decrementing from a to 0. (don't need a zeroed reg for cmov, and could have a in a reg instead of a-1))

Replacing DIV with MUL may make sense (but doesn't have to in all cases) when one of the values is known at compile time. When both are user inputs, you don't know what's the range, so all usual tricks will not work.
Basically you need to handle both a and b between INT_MAX and INT_MIN. There's no space left for scaling them up/down. Even if you wanted to extend them to larger types, it would probably take longer time just to invert b and check that the result will be consistent.

The only way to KNOW if div or mul is faster is by testing both in a benchmark [obviously, if you use your above code, you'd mostly measure the time of read/write of the inputs and results, not the actual divide instruction, so you need something where you can isolate the divide instruction(s) from the input and output].
My guess would be that on slightly older processors, mul is a bit faster, on modern processors, div will be as fast as, if not faster than, a lookup of 256 int values.
If you have ONE target system, then it's plausible to test this. If you have several different systems you want to run on, you will have to ensure the "improved code" is faster on at least some of them - and not significantly slower on the rest.
Note also that you would introduce a dependency, which may in itself slow down the sequence of operations - modern CPU's are pretty good at "hiding" latency as long as there are other instructions to execute [so you should use this in an "as realistic scenario as possible"].

There is a wrong assumption in the question. The multiplicative inverse of an integer greater than 1 is a fraction less than one. These don't exist in the world of integers. A lookup table doesn't work because you can't lookup what doesn't exist. Even if you "scale" the dividend the results will not be correct in the sense of being the same as an integer division. Take this example:
printf("%x %x\n", 0x10/0x9, 0x30/0x9);
// prints: 1 5
Assuming a multiplicative inverse existed, both terms are divided by the same divisor (9) so must have the same lookup table value (multiplicative inverse). Any fixed lookup value corresponding to the divisor (9) multiplied by an integer will be precisely 3 times greater in the second term relative to the first term. As you can see from the example, the result of an actual integer division is a 5, not a 3.
You can approximate things by using a scaled lookup table. For instance a lookup table that is the multiplicative inverse when the result is divided by 2^16. You would then multiply by the lookup table value and shift the result 16 bits to the right. Time consuming and requires a 1024 byte lookup table. Even so, this would not produce the same results as an integer divide. A compiler optimization is not going to produce "approximate" results of an integer division.

Related

Getting the high half and low half of a full integer multiply

I start with three values A,B,C (unsigned 32bit integer). And i have to obtain two values D,E (unsigned 32 bit integer also). Where
D = high(A*C);
E = low(A*C) + high(B*C);
I expect that multiply of two 32bit uint produce 64bit result. "high" and "low" is just my covnention for mark the first 32 bits and the last 32 bits in 64bit result of multiply.
I try to obtain optimized code of some allready functional one. I have a short part of the code in huge loop which is just few command lines, however it consumes almost all of computational time (physical simulation for couple of hours computing). That's the reason why i try to optimized this little part and rest of the code could remain more "user-well-arranged".
There is some SSE instructions that are fit for compute mentioned routine. The gcc compiler probably do optimized work. However i do not reject an option to write some piece of code in SSE intructions directly, if it will be necessary.
Be patient with my low experience with SSE please. I will try to write an algorithm for SSE just symbolically. There will be probably some mistakes with ordering masks or understanding the structure.
Store four 32-bit integers into one 128-bit register in order: A,B,C,C.
Apply instruction (probably pmuludq) into mentioned 128-bit register which multiply pairs of 32-bit integeres and return pairs of 64-bit integers as result. So it shoudld calculate multiply of A*C and multiply of B*C simultaneously and return two 64-bit values.
I expect that i have new 128bit register values P,Q,R,S (four 32-bit blocs) where P,Q is 64-bit result of A*C and R,S is 64-bit result of B*C. Then i continue with rearrange values at register into order P,Q,0,R
Take first 64 bits P,Q and add second 64 bits 0,R. The result is a new 64 bits value.
Read first 32 bits of the result as D and last 32 bits of the result as E.
This algorithm should return correct values for E and D.
My question:
Is there a static code in c++ which generate similar SSE routine as mentioned 1-5 SSE algorithm? I preffer solutions with higher performance. If the algorithm is problematic for standart c++ commands, is there a way how to write an algorithm in SSE?
I use TDM-GCC 4.9.2 64-bit compiler.
(note: Question was modified after advice)
(note2: I have an inspiration in this http://sci.tuomastonteri.fi/programming/sse for using SSE for obtain better performance)
You don't need vectors for this unless you have multiple inputs to process in parallel. clang and gcc already do a good job of optimizing the "normal" way to write your code: cast to twice the size, multiply, then shift to get the high half. Compilers recognize this pattern.
They notice that the operands started out as 32bit, so the upper halves are all zero after casting to 64b. Thus, they can use x86's mul insn to do a 32b*32b->64b multiply, instead of doing a full extended-precision 64b multiply. In 64bit mode, they do the same thing with a __uint128_t version of your code.
Both of these functions compile to fairly good code (one mul or imul per multiply).. gcc -m32 doesn't support 128b types, but I won't get into that because 1. you only asked about full multiplies of 32bit values, and 2. you should always use 64bit code when you want something to run fast. If you are doing full-multiplies where the result doesn't fit in a register, clang will avoid a lot of extra mov instructions, because gcc is silly about this. This little test function made a good test-case for that gcc bug report.
That godbolt link includes a function that calls this in a loop, storing the result in an array. It auto-vectorizes with a bunch of shuffling, but still looks like a speedup if you have multiple inputs to process in parallel. A different output format might take less shuffling after the multiply, like maybe storing separate arrays for D and E.
I'm including the 128b version to show that compilers can handle this even when it's not trivial (e.g. just do a 64bit imul instruction to do a 64*64->64b multiply on the 32bit inputs, after zeroing any upper bits that might be sitting in the input registers on function entry.)
When targeting Haswell CPUs and newer, gcc and clang can use the mulx BMI2 instruction. (I used -mno-bmi2 -mno-avx2 in the godbolt link to keep the asm simpler. If you do have a Haswell CPU, just use -O3 -march=haswell.) mulx dest1, dest2, src1 does dest1:dest2 = rdx * src1 while mul src1 does rdx:rax = rax * src1. So mulx has two read-only inputs (one implicit: edx/rdx), and two write-only outputs. This lets compilers do full-multiplies with fewer mov instructions to get data into and out of the implicit registers for mul. This is only a small speedup, esp. since 64bit mulx has 4 cycle latency instead of 3, on Haswell. (Strangely, 64bit mul and mulx are slightly cheaper than 32bit mul and mulx.)
// compiles to good code: you can and should do this sort of thing:
#include <stdint.h>
struct DE { uint32_t D,E; };
struct DE f_structret(uint32_t A, uint32_t B, uint32_t C) {
uint64_t AC = A * (uint64_t)C;
uint64_t BC = B * (uint64_t)C;
uint32_t D = AC >> 32; // high half
uint32_t E = AC + (BC >> 32); // We could cast to uint32_t before adding, but don't need to
struct DE retval = { D, E };
return retval;
}
#ifdef __SIZEOF_INT128__ // IDK the "correct" way to detect __int128_t support
struct DE64 { uint64_t D,E; };
struct DE64 f64_structret(uint64_t A, uint64_t B, uint64_t C) {
__uint128_t AC = A * (__uint128_t)C;
__uint128_t BC = B * (__uint128_t)C;
uint64_t D = AC >> 64; // high half
uint64_t E = AC + (BC >> 64);
struct DE64 retval = { D, E };
return retval;
}
#endif
If I understand it correctly, you want to compute number of potential overflows in A*B. If yes then you have 2 good options - the "use twice as big variable" (write 128bit math function for uint64 - it's not that hard (or wait for me to post it tomorrow)), and the "use floating point type":
(float(A)*float(B))/float(C)
as the loss of precision is minimal (assuming float is 4 bytes, double 8 bytes, and long double 16 bytes long) , and both float and uint32 require 4 bytes of memory (use double for uint64_t as it should be 8 bytes long):
#include <iostream>
#include <conio.h>
#include <stdint.h>
using namespace std;
int main()
{
uint32_t a(-1), b(-1);
uint64_t result1;
float result2;
result1 = uint64_t(a)*uint64_t(b)/4294967296ull; // >>32 would be faster and less memory consuming
result2 = float(a)*float(b)/4294967296.0f;
cout.precision(20);
cout<<result1<<'\n'<<result2;
getch();
return 0;
}
Produces:
4294967294
4294967296
But if you want really precise and correct answer I'd suggest using twice as big type for computing
Now that I think of it - you could use long double for uint64 and double for uint32 instead of writing function for uint64, but I don't think it's guaranteed that long double will be 128bit, and you'll have to check it. I'd go for more universal option.
EDIT:
You can write function to calculate that without using anything more
than A, B and result variable which would be of the same type as A.
Just add rightmost bit of (where Z equals B*(A>>pass_number&1)) Z<<0,
Z<<1, Z<<2 (...) Z<<X in first pass, Z<<-1, Z<<0, Z<<1 (...) Z<<(X-1)
for second (there should be X passes), while right shifting the result
by 1 (the just computed bit becomes irrelevant to us after it's
computed as it won't participate in calculation anymore, and it would
be erased anyway after dividing by 2^X (doing >>X)
(had to place in the "code" as I'm new here and couldn't find another way to prevent formatting script from eating half of it)
It's just a quick idea. You'll have to check it's correctness (sorry, but I'm really tired right now - but the result shouldn't overflow at any point of calculation, as the maximum carry would have value of 2X if I'm correct, and the algorithm itself seems to be good).
I will write code for that tomorrow if you'll still be in need of help.

Karatsuba multiplication improvement

I have implemented Karatsuba multiplication algorithm for my educational goals. Now I am looking for further improvments. I have implemented some kind of long arithmetic and it works well whether I do not use the base of integer representation more than 100.
With base 10 and compiling with clang++ -O3 multiplication of two random integers in range [10^50000, 10^50001] takes:
Naive algorithm took me 1967 cycles (1.967 seconds)
Karatsuba algorithm took me 400 cycles (0.4 seconds)
And the same numbers with base 100:
Naive algorithm took me 409 cycles (0.409 seconds)
Karatsuba algorithm took me 140 cycles (0.14 seconds)
Is there a way for improve this results?
Now I use such function to finalize my result:
void finalize(vector<int>& res) {
for (int i = 0; i < res.size(); ++i) {
res[i + 1] += res[i] / base;
res[i] %= base;
}
}
As you can see each step it calculates carry and push it to the next digit. And if I take base >=1000 the result will be overflowed.
If you see at my code I use vectors of int to represent long integer. According to my base a number will divide in separate parts of vector.
Now I see several options:
to use long long type for vector, but it might also be overflowed for vast length integers
implement representation of carry in long arithmetic
After I had saw some coments I decided to expand the issue. Assume that we want to represent our long integer as a vector of ints. For instanse:
ULLONG_MAX = 18446744073709551615
And for input we pass 210th Fibonacci number 34507973060837282187130139035400899082304280 which does not fit to any stadard type. If we represent it in a vector of int with base 10000000 it will be like:
v[0]: 2304280
v[1]: 89908
v[2]: 1390354
v[3]: 2187130
v[4]: 6083728
v[5]: 5079730
v[6]: 34
And when we do multiplication we may get (for simplicity let it be two identical numbers)
(34507973060837282187130139035400899082304280)^2:
v[0] * v[0] = 5309706318400
...
v[0] * v[4] = 14018612755840
...
It was only the first row and we have to do the six steps like that. Certainly, some step will cause overflow during multiplication or after carry calculation.
If I missed something, please, let me know and I will change it.
If you want to see full version, it is on my github
Base 2^64 and base 2^32 are the most popular bases for doing high precision arithmetic. Usually, the digits are stored in an unsigned integral type, because they have well-behaved semantics with regard to overflow.
For example, one can detect the carry from an addition as follows:
uint64_t x, y; // initialize somehow
uint64_t sum = x + y;
uint64_t carry = sum < x; // 1 if true, 0 if false
Also, assembly languages usually have a few "add with carry" instructions; if you can write inline assembly (or have access to intrinsics) you can take advantage of these.
For multiplication, most computers have machine instructions that can compute a one machine word -> two machine word product; sometimes, the instructions to get the two halves are called "multiply hi" and "multiply low". You need to write assembly to get them, although many compilers offer larger integer types whose use would let you access these instructions: e.g. in gcc you can implement multiply hi as
uint64_t mulhi(uint64_t x, uint64_t y)
{
return ((__uint128_t) x * y) >> 64;
}
When people can't use this, they do multiplication in 2^32 instead, so that they can use the same approach to implement a portable mulhi instruction, using uint64_t as the double-digit type.
If you want to write efficient code, you really need to take advantage of these bigger multiply instructions. Multiplying digits in base 2^32 is more than ninety times more powerful than multiplying digits in base 10. Multiplying digits in base 2^64 is four times more powerful than that. And your computer can probably do these more quickly than whatever you implement for base 10 multiplication.

Fast multiplication/division by 2 for floats and doubles (C/C++)

In the software I'm writing, I'm doing millions of multiplication or division by 2 (or powers of 2) of my values. I would really like these values to be int so that I could access the bitshift operators
int a = 1;
int b = a<<24
However, I cannot, and I have to stick with doubles.
My question is : as there is a standard representation of doubles (sign, exponent, mantissa), is there a way to play with the exponent to get fast multiplications/divisions by a power of 2?
I can even assume that the number of bits is going to be fixed (the software will work on machines that will always have 64 bits long doubles)
P.S : And yes, the algorithm mostly does these operations only. This is the bottleneck (it's already multithreaded).
Edit : Or am I completely mistaken and clever compilers already optimize things for me?
Temporary results (with Qt to measure time, overkill, but I don't care):
#include <QtCore/QCoreApplication>
#include <QtCore/QElapsedTimer>
#include <QtCore/QDebug>
#include <iostream>
#include <math.h>
using namespace std;
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
while(true)
{
QElapsedTimer timer;
timer.start();
int n=100000000;
volatile double d=12.4;
volatile double D;
for(unsigned int i=0; i<n; ++i)
{
//D = d*32; // 200 ms
//D = d*(1<<5); // 200 ms
D = ldexp (d,5); // 6000 ms
}
qDebug() << "The operation took" << timer.elapsed() << "milliseconds";
}
return a.exec();
}
Runs suggest that D = d*(1<<5); and D = d*32; run in the same time (200 ms) whereas D = ldexp (d,5); is much slower (6000 ms). I know that this is a micro benchmark, and that suddenly, my RAM has exploded because Chrome has suddenly asked to compute Pi in my back every single time I run ldexp(), so this benchmark is worth nothing. But I'll keep it nevertheless.
On the other had, I'm having trouble doing reinterpret_cast<uint64_t *> because there's a const violation (seems the volatile keyword interferes)
This is one of those highly-application specific things. It may help in some cases and not in others. (In the vast majority of cases, a straight-forward multiplication is still best.)
The "intuitive" way of doing this is just to extract the bits into a 64-bit integer and add the shift value directly into the exponent. (this will work as long as you don't hit NAN or INF)
So something like this:
union{
uint64 i;
double f;
};
f = 123.;
i += 0x0010000000000000ull;
// Check for zero. And if it matters, denormals as well.
Note that this code is not C-compliant in any way, and is shown just to illustrate the idea. Any attempt to implement this should be done directly in assembly or SSE intrinsics.
However, in most cases the overhead of moving the data from the FP unit to the integer unit (and back) will cost much more than just doing a multiplication outright. This is especially the case for pre-SSE era where the value needs to be stored from the x87 FPU into memory and then read back into the integer registers.
In the SSE era, the Integer SSE and FP SSE use the same ISA registers (though they still have separate register files). According the Agner Fog, there's a 1 to 2 cycle penalty for moving data between the Integer SSE and FP SSE execution units. So the cost is much better than the x87 era, but it's still there.
All-in-all, it will depend on what else you have on your pipeline. But in most cases, multiplying will still be faster. I've run into this exact same problem before so I'm speaking from first-hand experience.
Now with 256-bit AVX instructions that only support FP instructions, there's even less of an incentive to play tricks like this.
How about ldexp?
Any half-decent compiler will generate optimal code on your platform.
But as #Clinton points out, simply writing it in the "obvious" way should do just as well. Multiplying and dividing by powers of two is child's play for a modern compiler.
Directly munging the floating point representation, besides being non-portable, will almost certainly be no faster (and might well be slower).
And of course, you should not waste time even thinking about this question unless your profiling tool tells you to. But the kind of people who listen to this advice will never need it, and the ones who need it will never listen.
[update]
OK, so I just tried ldexp with g++ 4.5.2. The cmath header inlines it as a call to __builtin_ldexp, which in turn...
...emits a call to the libm ldexp function. I would have thought this builtin would be trivial to optimize, but I guess the GCC developers never got around to it.
So, multiplying by 1 << p is probably your best bet, as you have discovered.
You can pretty safely assume IEEE 754 formatting, the details of which can get pretty gnarley (esp. when you get into subnormals). In the common cases, however, this should work:
const int DOUBLE_EXP_SHIFT = 52;
const unsigned long long DOUBLE_MANT_MASK = (1ull << DOUBLE_EXP_SHIFT) - 1ull;
const unsigned long long DOUBLE_EXP_MASK = ((1ull << 63) - 1) & ~DOUBLE_MANT_MASK;
void unsafe_shl(double* d, int shift) {
unsigned long long* i = (unsigned long long*)d;
if ((*i & DOUBLE_EXP_MASK) && ((*i & DOUBLE_EXP_MASK) != DOUBLE_EXP_MASK)) {
*i += (unsigned long long)shift << DOUBLE_EXP_SHIFT;
} else if (*i) {
*d *= (1 << shift);
}
}
EDIT: After doing some timing, this method is oddly slower than the double method on my compiler and machine, even stripped to the minimum executed code:
double ds[0x1000];
for (int i = 0; i != 0x1000; i++)
ds[i] = 1.2;
clock_t t = clock();
for (int j = 0; j != 1000000; j++)
for (int i = 0; i != 0x1000; i++)
#if DOUBLE_SHIFT
ds[i] *= 1 << 4;
#else
((unsigned int*)&ds[i])[1] += 4 << 20;
#endif
clock_t e = clock();
printf("%g\n", (float)(e - t) / CLOCKS_PER_SEC);
In the DOUBLE_SHIFT completes in 1.6 seconds, with an inner loop of
movupd xmm0,xmmword ptr [ecx]
lea ecx,[ecx+10h]
mulpd xmm0,xmm1
movupd xmmword ptr [ecx-10h],xmm0
Versus 2.4 seconds otherwise, with an inner loop of:
add dword ptr [ecx],400000h
lea ecx, [ecx+8]
Truly unexpected!
EDIT 2: Mystery solved! One of the changes for VC11 is now it always vectorizes floating point loops, effectively forcing /arch:SSE2, though VC10, even with /arch:SSE2 is still worse with 3.0 seconds with an inner loop of:
movsd xmm1,mmword ptr [esp+eax*8+38h]
mulsd xmm1,xmm0
movsd mmword ptr [esp+eax*8+38h],xmm1
inc eax
VC10 without /arch:SSE2 (even with /arch:SSE) is 5.3 seconds... with 1/100th of the iterations!!, inner loop:
fld qword ptr [esp+eax*8+38h]
inc eax
fmul st,st(1)
fstp qword ptr [esp+eax*8+30h]
I knew the x87 FP stack was aweful, but 500 times worse is kinda ridiculous. You probably won't see these kinds of speedups converting, i.e. matrix ops to SSE or int hacks, since this is the worst case loading into the FP stack, doing one op, and storing from it, but it's a good example for why x87 is not the way to go for anything perf. related.
The fastest way to do this is probably:
x *= (1 << p);
This sort of thing may simply be done by calling an machine instruction to add p to the exponent. Telling the compiler to instead extract the some bits with a mask and doing something manually to it will probably make things slower, not faster.
Remember, C/C++ is not assembly language. Using a bitshift operator does not necessarily compile to a bitshift assembly operation, not does using multiplication necessarily compile to multiplication. There's all sorts of weird and wonderful things going on like what registers are being used and what instructions can be run simultaneously which I'm not smart enough to understand. But your compiler, with many man years of knowledge and experience and lots of computational power, is much better at making these judgements.
p.s. Keep in mind, if your doubles are in an array or some other flat data structure, your compiler might be really smart and use SSE to multiple 2, or even 4 doubles at the same time. However, doing a lot of bit shifting is probably going to confuse your compiler and prevent this optimisation.
Since c++17 you can also use hexadecimal floating literals. That way you can multiply by higher powers of 2. For instance:
d *= 0x1p64;
will multiply d by 2^64. I use it to implement my fast integer arithmetic in a conversion to double.
What other operations does this algorithm require? You might be able to break your floats into int pairs (sign/mantissa and magnitude), do your processing, and reconstitute them at the end.
Multiplying by 2 can be replaced by an addition: x *= 2 is equivalent to x += x.
Division by 2 can be replaced by multiplication by 0.5. Multiplication is usually significantly faster than division.
Although there is little/no practical benefit to treating powers of two specially for float of double types there is a case for this for double-double types. Double-double multiplication and division is complicated in general but is trivial for multiplying and dividing by a power of two.
E.g. for
typedef struct {double hi; double lo;} doubledouble;
doubledouble x;
x.hi*=2, x.lo*=2; //multiply x by 2
x.hi/=2, x.lo/=2; //divide x by 2
In fact I have overloaded << and >> for doubledouble so that it's analogous to integers.
//x is a doubledouble type
x << 2 // multiply x by four;
x >> 3 // divide x by eight.
Depending on what you're multiplying, if you have data that is recurring enough, a look up table might provide better performance, at the expense of memory.

How efficient is an if statement compared to a test that doesn't use an if? (C++)

I need a program to get the smaller of two numbers, and I'm wondering if using a standard "if x is less than y"
int a, b, low;
if (a < b) low = a;
else low = b;
is more or less efficient than this:
int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));
(or the variation of putting int delta = a - b at the top and rerplacing instances of a - b with that).
I'm just wondering which one of these would be more efficient (or if the difference is too miniscule to be relevant), and the efficiency of if-else statements versus alternatives in general.
(Disclaimer: the following deals with very low-level optimizations that are most often not necessary. If you keep reading, you waive your right to complain that computers are fast and there is never any reason to worry about this sort of thing.)
One advantage of eliminating an if statement is that you avoid branch prediction penalties.
Branch prediction penalties are generally only a problem when the branch is not easily predicted. A branch is easily predicted when it is almost always taken/not taken, or it follows a simple pattern. For example, the branch in a loop statement is taken every time except the last one, so it is easily predicted. However, if you have code like
a = random() % 10
if (a < 5)
print "Less"
else
print "Greater"
then this branch is not easily predicted, and will frequently incur the prediction penalty associated with clearing the cache and rolling back instructions that were executed in the wrong part of the branch.
One way to avoid these kinds of penalties is to use the ternary (?:) operator. In simple cases, the compiler will generate conditional move instructions rather than branches.
So
int a, b, low;
if (a < b) low = a;
else low = b;
becomes
int a, b, low;
low = (a < b) ? a : b
and in the second case a branching instruction is not necessary. Additionally, it is much clearer and more readable than your bit-twiddling implementation.
Of course, this is a micro-optimization which is unlikely to have significant impact on your code.
Simple answer: One conditional jump is going to be more efficient than two subtractions, one addition, a bitwise and, and a shift operation combined. I've been sufficiently schooled on this point (see the comments) that I'm no longer even confident enough to say that it's usually more efficient.
Pragmatic answer: Either way, you're not paying nearly as much for the extra CPU cycles as you are for the time it takes a programmer to figure out what that second example is doing. Program for readability first, efficiency second.
Compiling this on gcc 4.3.4, amd64 (core 2 duo), Linux:
int foo1(int a, int b)
{
int low;
if (a < b) low = a;
else low = b;
return low;
}
int foo2(int a, int b)
{
int low;
low = b + ((a - b) & ((a - b) >> 31));
return low;
}
I get:
foo1:
cmpl %edi, %esi
cmovle %esi, %edi
movl %edi, %eax
ret
foo2:
subl %esi, %edi
movl %edi, %eax
sarl $31, %eax
andl %edi, %eax
addl %esi, %eax
ret
...which I'm pretty sure won't count for branch predictions, since the code doesn't jump. Also, the non if-statement version is 2 instructions longer. I think I will continue coding, and let the compiler do it's job.
Like with any low-level optimization, test it on the target CPU/board setup.
On my compiler (gcc 4.5.1 on x86_64), the first example becomes
cmpl %ebx, %eax
cmovle %eax, %esi
The second example becomes
subl %eax, %ebx
movl %ebx, %edx
sarl $31, %edx
andl %ebx, %edx
leal (%rdx,%rax), %esi
Not sure if the first one is faster in all cases, but I would bet it is.
The biggest problem is that your second example won't work on 64-bit machines.
However, even neglecting that, modern compilers are smart enough to consider branchless prediction in every case possible, and compare the estimated speeds. So, you second example will most likely actually be slower
There will be no difference between the if statement and using a ternary operator, as even most dumb compilers are smart enough to recognize this special case.
[Edit] Because I think this is such an interesting topic, I've written a blog post on it.
Either way, the assembly will only be a few instructions and either way it'll take picoseconds for those instructions to execute.
I would profile the application an concentrate your optimization efforts to something more worthwhile.
Also, the time saved by this type of optimization will not be worth the time wasted by anyone trying to maintain it.
For simple statements like this, I find the ternary operator very intuitive:
low = (a < b) ? a : b;
Clear and concise.
For something as simple as this, why not just experiment and try it out?
Generally, you'd profile first, identify this as a hotspot, experiment with a change, and view the result.
I wrote a simple program that compares both techniques passing in random numbers (so that we don't see perfect branch prediction) with Visual C++ 2010. The difference between the approaches on my machine for 100,000,000 iteration? Less than 50ms total, and the if version tended to be faster. Looking at the codegen, the compiler successfully converted the simple if to a cmovl instruction, avoiding a branch altogether.
One thing to be wary of when you get into really bit-fiddly kinds of hacks is how they may interact with compiler optimizations that take place after inlining. For example, the readable procedure
int foo (int a, int b) {
return ((a < b) ? a : b);
}
is likely to be compiled into something very efficient in any case, but in some cases it may be even better. Suppose, for example, that someone writes
int bar = foo (x, x+3);
After inlining, the compiler will recognize that 3 is positive, and may then make use of the fact that signed overflow is undefined to eliminate the test altogether, to get
int bar = x;
It's much less clear how the compiler should optimize your second implementation in this context. This is a rather contrived example, of course, but similar optimizations actually are important in practice. Of course you shouldn't accept bad compiler output when performance is critical, but it's likely wise to see if you can find clear code that produces good output before you resort to code that the next, amazingly improved, version of the compiler won't be able to optimize to death.
One thing I will point out that I haven't noticed mention that an optimization like this can easily be overwhelmed by other issues. For example, if you are running this routine on two large arrays of numbers (or worse yet, pairs of number scattered in memory), the cost of fetching the values on today's CPUs can easily stall the CPU's execution pipelines.
I'm just wondering which one of these
would be more efficient (or if the
difference is to miniscule to be
relevant), and the efficiency of
if-else statements versus alternatives
in general.
Desktop/server CPUs are optimized for pipelining. Second is theoretically faster because CPU doesn't have to branch and can utilize multiple ALUs to evaluate parts of expression in parallel. More non-branching code with intermixed independent operations are best for such CPUs. (But even that is negated now by modern "conditional" CPU instructions which allow to make the first code branch-less too.)
On embedded CPUs branching if often less expensive (relatively to everything else), nor they have many spare ALUs to evaluate operations out-of-order (that's if they support out-of-order execution at all). Less code/data is better - caches are small too. (I have even seen uses of buble-sort in embedded applications: the algorithm uses least of memory/code and fast enough for small amounts of information.)
Important: do not forget about the compiler optimizations. Using many tricks, the compilers sometimes can remove the branching themselves: inlining, constant propagation, refactoring, etc.
But in the end I would say that yes, the difference is minuscule to be relevant. In long term, readable code wins.
The way things go on the CPU front, it is more rewarding to invest time now in making the code multi-threaded and OpenCL capable.
Why low = a; in the if and low = a; in the else? And, why 31? If 31 has anything to do with CPU word size, what if the code is to be run on a CPU of different size?
The if..else way looks more readable. I like programs to be as readable to humans as they are to the compilers.
profile results with gcc -o foo -g -p -O0, Solaris 9 v240
%Time Seconds Cumsecs #Calls msec/call Name
36.8 0.21 0.21 8424829 0.0000 foo2
28.1 0.16 0.37 1 160. main
17.5 0.10 0.4716850667 0.0000 _mcount
17.5 0.10 0.57 8424829 0.0000 foo1
0.0 0.00 0.57 4 0. atexit
0.0 0.00 0.57 1 0. _fpsetsticky
0.0 0.00 0.57 1 0. _exithandle
0.0 0.00 0.57 1 0. _profil
0.0 0.00 0.57 1000 0.000 rand
0.0 0.00 0.57 1 0. exit
code:
int
foo1 (int a, int b, int low)
{
if (a < b)
low = a;
else
low = b;
return low;
}
int
foo2 (int a, int b, int low)
{
low = (a < b) ? a : b;
return low;
}
int main()
{
int low=0;
int a=0;
int b=0;
int i=500;
while (i--)
{
for(a=rand(), b=rand(); a; a--)
{
low=foo1(a,b,low);
low=foo2(a,b,low);
}
}
return 0;
}
Based on data, in the above environment, the exact opposite of several beliefs stated here were not found to be true. Note the 'in this environment' If construct was faster than ternary ? : construct
I had written ternary logic simulator not so long ago, and this question was viable to me, as it directly affects my interpretator execution speed; I was required to simulate tons and tons of ternary logic gates as fast as possible.
In a binary-coded-ternary system one trit is packed in two bits. Most significant bit means negative and least significant means positive one. Case "11" should not occur, but it must be handled properly and threated as 0.
Consider inline int bct_decoder( unsigned bctData ) function, which should return our formatted trit as regular integer -1, 0 or 1; As i observed there are 4 approaches: i called them "cond", "mod", "math" and "lut"; Lets investigate them
First is based on jz|jnz and jl|jb conditional jumps, thus cond. Its performance is not good at all, because relies on a branch predictor. And even worse - it varies, because it is unknown if there will be one branch or two a priori. And here is an example:
inline int bct_decoder_cond( unsigned bctData ) {
unsigned lsB = bctData & 1;
unsigned msB = bctData >> 1;
return
( lsB == msB ) ? 0 : // most possible -> make zero fastest branch
( lsB > msB ) ? 1 : -1;
}
This is slowest version, it could involve 2 branches in worst case and this is something where binary logic fails. On my 3770k it prodices around 200MIPS on average on random data. (here and after - each test is average from 1000 tries on randomly filled 2mb dataset)
Next one relies on modulo operator and its speed is somewhere in between first and third, but is definetely faster - 600 MIPS:
inline int bct_decoder_mod( unsigned bctData ) {
return ( int )( ( bctData + 1 ) % 3 ) - 1;
}
Next one is branchless approach, which involves only maths, thus math; it does not assume jump instrunctions at all:
inline int bct_decoder_math( unsigned bctData ) {
return ( int )( bctData & 1 ) - ( int )( bctData >> 1 );
}
This does what is should, and behaves really great. To compare, performance estimate is 1000 MIPS, and it is 5x faster than branched version. Probably branched version is slowed down due to lack of native 2-bit signed int support. But in my application it is quite good version in itself.
If this is not enough then we can go futher, having something special. Next is called lookup table approach:
inline int bct_decoder_lut( unsigned bctData ) {
static const int decoderLUT[] = { 0, 1, -1, 0 };
return decoderLUT[ bctData & 0x3 ];
}
In my case one trit occupied only 2 bits, so lut table was only 2b*4 = 8 bytes, and was worth trying. It fits in cache and works blazing fast at 1400-1600 MIPS, here is where my measurement accuracy is going down. And that is is 1.5x speedup from fast math approach. That's because you just have precalculated result and single AND instruction. Sadly caches are small and (if your index length is greater than several bits) you simply cannot use it.
So i think i answered your question, on what what could branched/branchless code be like. Answer is much better and with detailed samples, real world application and real performance measurements results.
Updated answer taking the current (2018) state of compiler vectorization. Please see danben's answer for the general case where vectorization is not a concern.
TLDR summary: avoiding ifs can help with vectorization.
Because SIMD would be too complex to allow branching on some elements, but not others, any code containing an if statement will fail to be vectorized unless the compiler knows a "superoptimization" technique that can rewrite it into a branchless set of operations. I don't know of any compilers that are doing this as an integrated part of the vectorization pass (Clang does some of this independently, but not specificly toward helping vectorization AFAIK)
Using the OP's provided example:
int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));
Many compilers can vectorize this to be something approximately equivalent to:
__m128i low128i(__m128i a, __m128i b){
__m128i diff, tmp;
diff = _mm_sub_epi32(a,b);
tmp = _mm_srai_epi32(diff, 31);
tmp = _mm_and_si128(tmp,diff);
return _mm_add_epi32(tmp,b);
}
This optimization would require the data to be layed out in a fashion that would allow for it, but it could be extended to __m256i with avx2 or __m512i with avx512 (and even unroll loops further to take advantage of additional registers) or other simd instructions on other architectures. Another plus is that these instructions are all low latency, high-throughput instructions (latencies of ~1 and reciprocal throughputs in the range of 0.33 to 0.5 - so really fast relative to non-vectorized code)
I see no reason why compilers couldn't optimize an if statement to a vectorized conditional move (except that the corresponding x86 operations only work on memory locations and have low throughput and other architectures like arm may lack it entirely) but it could be done by doing something like:
void lowhi128i(__m128i *a, __m128i *b){ // does both low and high
__m128i _a=*a, _b=*b;
__m128i lomask = _mm_cmpgt_epi32(_a,_b),
__m128i himask = _mm_cmpgt_epi32(_b,_a);
_mm_maskmoveu_si128(_b,lomask,a);
_mm_maskmoveu_si128(_a,himask,b);
}
However this would have a much higher latency due to memory reads and writes and lower throughput (higher/worse reciprocal throughput) than the example above.
Unless you're really trying to buckle down on efficiency, I don't think this is something you need to worry about.
My simple thought though is that the if would be quicker because it's comparing one thing, while the other code is doing several operations. But again, I imagine that the difference is minuscule.
If it is for Gnu C++, try this
int min = i <? j;
I have not profiled it but I think it is definitely the one to beat.

Is it possible to roll a significantly faster version of sqrt

In an app I'm profiling, I found that in some scenarios this function is able to take over 10% of total execution time.
I've seen discussion over the years of faster sqrt implementations using sneaky floating-point trickery, but I don't know if such things are outdated on modern CPUs.
MSVC++ 2008 compiler is being used, for reference... though I'd assume sqrt is not going to add much overhead though.
See also here for similar discussion on modf function.
EDIT: for reference, this is one widely-used method, but is it actually much quicker? How many cycles is SQRT anyway these days?
Yes, it is possible even without trickery:
sacrifice accuracy for speed: the sqrt algorithm is iterative, re-implement with fewer iterations.
lookup tables: either just for the start point of the iteration, or combined with interpolation to get you all the way there.
caching: are you always sqrting the same limited set of values? if so, caching can work well. I've found this useful in graphics applications where the same thing is being calculated for lots of shapes the same size, so results can be usefully cached.
Hello from 11 years in the future.
Considering this still gets occasional votes, I thought I'd add a note about performance, which now even more than then is dramatically limited by memory accesses. You absolutely must use a realistic benchmark (ideally, your whole application) when optimising something like this - the memory access patterns of your application will have a dramatic effect on solutions like lookup tables and caches, and just comparing 'cycles' for your optimised version will lead you wildly astray: it is also very difficult to assign program time to individual instructions, and your profiling tool may mislead you here.
On a related note, consider using simd/vectorised instructions for calculating square roots, like _mm512_sqrt_ps or similar, if they suit your use case.
Take a look at section 15.12.3 of intel's optimisation reference manual, which describes approximation methods, with vectorised instructions, which would probably translate pretty well to other architectures too.
There's a great comparison table here:
http://assemblyrequired.crashworks.org/timing-square-root/
Long story short, SSE2's ssqrts is about 2x faster than FPU fsqrt, and an approximation + iteration is about 4x faster than that (8x overall).
Also, if you're trying to take a single-precision sqrt, make sure that's actually what you're getting. I've heard of at least one compiler that would convert the float argument to a double, call double-precision sqrt, then convert back to float.
You're very likely to gain more speed improvements by changing your algorithms than by changing their implementations: Try to call sqrt() less instead of making calls faster. (And if you think this isn't possible - the improvements for sqrt() you mention are just that: improvements of the algorithm used to calculate a square root.)
Since it is used very often, it is likely that your standard library's implementation of sqrt() is nearly optimal for the general case. Unless you have a restricted domain (e.g., if you need less precision) where the algorithm can take some shortcuts, it's very unlikely someone comes up with an implementation that's faster.
Note that, since that function uses 10% of your execution time, even if you manage to come up with an implementation that only takes 75% of the time of std::sqrt(), this still will only bring your execution time down by 2,5%. For most applications users wouldn't even notice this, except if they use a watch to measure.
How accurate do you need your sqrt to be? You can get reasonable approximations very quickly: see Quake3's excellent inverse square root function for inspiration (note that the code is GPL'ed, so you may not want to integrate it directly).
Don't know if you fixed this, but I've read about it before, and it seems that the fastest thing to do is replace the sqrt function with an inline assembly version;
you can see a description of a load of alternatives here.
The best is this snippet of magic:
double inline __declspec (naked) __fastcall sqrt(double n)
{
_asm fld qword ptr [esp+4]
_asm fsqrt
_asm ret 8
}
It's about 4.7x faster than the standard sqrt call with the same precision.
Here is a fast way with a look up table of only 8KB. Mistake is ~0.5% of the result. You can easily enlarge the table, thus reducing the mistake. Runs about 5 times faster than the regular sqrt()
// LUT for fast sqrt of floats. Table will be consist of 2 parts, half for sqrt(X) and half for sqrt(2X).
const int nBitsForSQRTprecision = 11; // Use only 11 most sagnificant bits from the 23 of float. We can use 15 bits instead. It will produce less error but take more place in a memory.
const int nUnusedBits = 23 - nBitsForSQRTprecision; // Amount of bits we will disregard
const int tableSize = (1 << (nBitsForSQRTprecision+1)); // 2^nBits*2 because we have 2 halves of the table.
static short sqrtTab[tableSize];
static unsigned char is_sqrttab_initialized = FALSE; // Once initialized will be true
// Table of precalculated sqrt() for future fast calculation. Approximates the exact with an error of about 0.5%
// Note: To access the bits of a float in C quickly we must misuse pointers.
// More info in: http://en.wikipedia.org/wiki/Single_precision
void build_fsqrt_table(void){
unsigned short i;
float f;
UINT32 *fi = (UINT32*)&f;
if (is_sqrttab_initialized)
return;
const int halfTableSize = (tableSize>>1);
for (i=0; i < halfTableSize; i++){
*fi = 0;
*fi = (i << nUnusedBits) | (127 << 23); // Build a float with the bit pattern i as mantissa, and an exponent of 0, stored as 127
// Take the square root then strip the first 'nBitsForSQRTprecision' bits of the mantissa into the table
f = sqrtf(f);
sqrtTab[i] = (short)((*fi & 0x7fffff) >> nUnusedBits);
// Repeat the process, this time with an exponent of 1, stored as 128
*fi = 0;
*fi = (i << nUnusedBits) | (128 << 23);
f = sqrtf(f);
sqrtTab[i+halfTableSize] = (short)((*fi & 0x7fffff) >> nUnusedBits);
}
is_sqrttab_initialized = TRUE;
}
// Calculation of a square root. Divide the exponent of float by 2 and sqrt() its mantissa using the precalculated table.
float fast_float_sqrt(float n){
if (n <= 0.f)
return 0.f; // On 0 or negative return 0.
UINT32 *num = (UINT32*)&n;
short e; // Exponent
e = (*num >> 23) - 127; // In 'float' the exponent is stored with 127 added.
*num &= 0x7fffff; // leave only the mantissa
// If the exponent is odd so we have to look it up in the second half of the lookup table, so we set the high bit.
const int halfTableSize = (tableSize>>1);
const int secondHalphTableIdBit = halfTableSize << nUnusedBits;
if (e & 0x01)
*num |= secondHalphTableIdBit;
e >>= 1; // Divide the exponent by two (note that in C the shift operators are sign preserving for signed operands
// Do the table lookup, based on the quaternary mantissa, then reconstruct the result back into a float
*num = ((sqrtTab[*num >> nUnusedBits]) << nUnusedBits) | ((e + 127) << 23);
return n;
}