Alternative to using % operator and / Operator in C++

Alternative to using % operator and / Operator in C++ - c++

It is told that modulo operator "%" and divide operator "/" are very inefficient in embedded C++.
How can I alternatively achieve the following expression:
a = b % c;
I understand that this can be achieved using the following logic:
a = b - c;
while (a >= c) {
a = a - c;
}
But my question is, is this code involving while loops efficient enough, compared to % operator?
Thanks,
Kirti

Division and modulus are indeed costly hardware operations, whatever you do (this is more related to hardware architecture than to languages or compilers), perhaps ten times slower than addition.
However, on current laptops or servers, and on high-end microcontrollers, cache misses are often much slower than divisions!
The GCC compiler is often able to optimize them, when the divisor is a constant.
Your naive loop is usually much more slower than using the hardware division instruction (or the library routine doing it, if not provided by hardware). I believe you are wrong in avoiding the division & replacing it with your loop.
You might tune your algorithms -e.g. by having power of twos- but I don't recommend using your code. Remember that premature optimization is evil so first try to get your program correct, then profile it to find the trouble spots.

Nothing is going to be considerably more efficient than the % operator. If there was a better way to do it, then any reasonable compiler would automatically convert it. When you're told that % and / are inefficient, that's just because those are difficult operations - if you need to perform a modulo, then do that.
There may be special cases when there are better ways - for example, mod a power of two can be written as a binary or - but those are probably optimized by your compiler.

That code will almost certainly be slower than however your processor/compiler decides to perform the divide/mod. Generally, shortcuts are pretty hard to come by for basic arithmetic operators, since the mcu/cpu designers and compiler programmers are pretty good at optimizing this for almost all applications.
One common shortcut in embedded devices (where every cycle/byte can make a difference) is to keep everything in terms of base-2 to use the bit shift operators to perform multiplication and division, and the bitwise and (&) to perform modulo.
Examples:
unsigned int x = 100;
unsigned int y1 = x << 4; // same as x * 2^4 = x*16
unsigned int y2 = x >> 6; // same as x / 2^6 = x/64
unsigned int y3 = x & 0x07; // same as x % 8

If the divisor is known at compile time, the operation can be transformed into a multiplication by a reciprocal, with some shifts, adds, and other fast operations. This will be faster on any modern processor, even if it implements division in hardware. Embedded targets usually have highly optimized routines for divide / modulo, since these operations are required by the standard.

If you have profiled your code carefully and found that a modulo operator is the major cost in an inner loop then there is an optimisation that might help. You might be already familiar with the trick for determining the sign of an integer using arithmetic left shifts (for 32 bit values):
sign = ( x >> 31 ) | 1;
This extends the sign bit across the word, so negative values yield -1 and positive values 0. Then bit 0 is set so that positive values result in 1.
If we're only incrementing values by a quantity that is less than the modulo then this same trick can be used to wrap the result:
val += inc;
val -= modulo & ( static_cast< int32_t >( ( ( modulo - 1 ) - val ) ) >> 31 );
Alternatively, if you are decrementing by values less than the modulo then the relevant code is:
int32_t signedVal = static_cast< int32_t >( val - dec );
val = signedVal + ( modulo & ( signedVal >> 31 ) );
I've added the static_cast operators because I was passing in uint32_t, but you might not find them necessary.
Does this help much as opposed to a simple % operator? That depends on your compiler and CPU architecture. I found a simple loop ran 60% faster on my i3 processor when compiled under VS2012, however on the ARM11 chip in the Raspberry Pi and compiling with GCC I only got a 20% improvement.

Division by a constant can be achieved by a shift if a power of 2 or a mul add shift combination for others.
http:// masm32.com/board/index.php?topic=9937.0 has x86 assembly version as well as C source in download from first post. that generates this code for you.

Related

How to let GCC compiler turn variable-division into mul(if faster)

int a, b;
scanf("%d %d", &a, &b);
printf("%d\n", (unsigned int)a/(unsigned char)b);
When compiling, I got
...
::00401C1E:: C70424 24304000 MOV DWORD PTR [ESP],403024 %d %d
::00401C25:: E8 36FFFFFF CALL 00401B60 scanf
::00401C2A:: 0FB64C24 1C MOVZX ECX,BYTE PTR [ESP+1C]
::00401C2F:: 8B4424 18 MOV EAX,[ESP+18]
::00401C33:: 31D2 XOR EDX,EDX
::00401C35:: F7F1 DIV ECX
::00401C37:: 894424 04 MOV [ESP+4],EAX
::00401C3B:: C70424 2A304000 MOV DWORD PTR [ESP],40302A %d\x0A
::00401C42:: E8 21FFFFFF CALL 00401B68 printf
Will it be faster if the DIV turn into MUL and use an array to store the mulvalue? If so, how to let the compiler do the optimization?
int main() {
uint a, s=0, i, t;
scanf("%d", &a);
diviuint aa = a;
t = clock();
for (i=0; i<1000000000; i++)
s += i/a;
printf("Result:%10u\n", s);
printf("Time:%12u\n", clock()-t);
return 0;
}
where diviuint(a) make a memory of 1/a and use multiple instead
Using s+=i/aa makes the speed 2 times of s+=i/a

You are correct that finding the multiplicative inverse may be worth it if integer division inside a loop is unavoidable. gcc and clang won't do this for you with run-time constants, though; only compile-time constants. It's too expensive (in code-size) for the compiler to do without being sure it's needed, and the perf gains aren't as big with non compile-time constants. (I'm not confident a speedup will always be possible, depending on how good integer division is on the target microarchitecture.)
Using a multiplicative inverse
If you can't transform things to pull the divide out of the loop, and it runs many iterations, and a significant increase in code-size is with the performance gain (e.g. you aren't bottlenecked on cache misses that hide the div latency), then you might get a speedup from doing for run-time constants what the compiler does for compile-time constants.
Note that different constants need different shifts of the high half of the full-multiply, and some constants need more different shifts than others. (Another way of saying that some of the shift-counts are zero for some constants). So non-compile-time-constant divide-by-multiplying code needs all the shifts, and the shift counts have to be variable-count. (On x86, this is more expensive than immediate-count shifts).
libdivide has an implementation of the necessary math. You can use it to do SIMD-vectorized division, or for scalar, I think. This will definitely provide a big speedup over unpacking to scalar and doing integer division there. I haven't used it myself.
(Intel SSE/AVX doesn't do integer-division in hardware, but provides a variety of multiplies, and fairly efficient variable-count shift instructions. For 16bit elements, there's an instruction that produces only the high half of the multiply. For 32bit elements, there's a widening multiply, so you'd need a shuffle with that.)
Anyway, you could use libdivide to vectorize that add loop, with a horizontal sum at the end.
Other ways to get the div out of the loop
for (i=0; i<1000000000; i++)
s += i/a;
In your example, you might get better results from using a uint128_t s accumulator and dividing by a outside the loop. A 64bit add/adc pair is pretty cheap. (It wouldn't give identical results, though, because integer division truncates instead of rounding to nearest.)
I think you can account for that by looping with i += a; tmp++, and doing s += tmp*a, to combine all the adds from iterations where i/a is the same. So s += 1 * a accounts for all the iterations from i = [a .. a*2-1]. Obviously that was just a trivial example, and looping more efficiently is usually not actually possible. It's off-topic for this question, but worth saying anyway: Look for big optimizations by re-structuring code or taking advantage of some math before trying to speed up doing the exact same thing faster. Speaking of math, you can use the sum(0..n) = n * (n+1) / 2 formula here, because we can factor a out of a*1 + a*2 + a*3 ... a*max. I may have an off-by-one here, but I'm confident a closed-form simple constant time calculation will give the same answer as the loop for any a:
uint32_t n = 1000000000 / a;
uint32_t s = a * n*(n+1)/2 + 1000000000 % a;
If you just needed i/a in a loop, it might be worth it to do something like:
// another optimization for an unlikely case
for (uint32_t i=0, remainder=0, i_over_a=0 ; i < n ; i++) {
// use i_over_a
++remainder;
if (remainder == a) { // if you don't need the remainder in the loop, it could save an insn or two to count down from a to 0 instead of up from 0 to a, e.g. on x86. But then you need a clever variable name other than remainder.
remainder = 0;
++i_over_a;
}
}
Again, this is unlikely: it only works if you're dividing the loop counter by a constant. However, it should work well. Either a is large so branch mispredicts will be infrequent, or a is (hopefully) small enough for a good branch predictor to recognize the repeating pattern of a-1 branches one way, then 1 branch the other way. The worst-case a value might be 33 or 65 or something, depending on microarchitecture. Branchless asm is probably possible but not worth it. e.g. handle ++i_over_a with an add-with-carry and a conditional move for zeroing. (e.g. x86 pseudo-code cmp a-1, remainder / cmovc remainder, 0 / adc i_over_a, 0. The b (below) condition is just CF==1, same as the c (carry) condition. The branchless asm would be simplified by decrementing from a to 0. (don't need a zeroed reg for cmov, and could have a in a reg instead of a-1))

Replacing DIV with MUL may make sense (but doesn't have to in all cases) when one of the values is known at compile time. When both are user inputs, you don't know what's the range, so all usual tricks will not work.
Basically you need to handle both a and b between INT_MAX and INT_MIN. There's no space left for scaling them up/down. Even if you wanted to extend them to larger types, it would probably take longer time just to invert b and check that the result will be consistent.

The only way to KNOW if div or mul is faster is by testing both in a benchmark [obviously, if you use your above code, you'd mostly measure the time of read/write of the inputs and results, not the actual divide instruction, so you need something where you can isolate the divide instruction(s) from the input and output].
My guess would be that on slightly older processors, mul is a bit faster, on modern processors, div will be as fast as, if not faster than, a lookup of 256 int values.
If you have ONE target system, then it's plausible to test this. If you have several different systems you want to run on, you will have to ensure the "improved code" is faster on at least some of them - and not significantly slower on the rest.
Note also that you would introduce a dependency, which may in itself slow down the sequence of operations - modern CPU's are pretty good at "hiding" latency as long as there are other instructions to execute [so you should use this in an "as realistic scenario as possible"].

There is a wrong assumption in the question. The multiplicative inverse of an integer greater than 1 is a fraction less than one. These don't exist in the world of integers. A lookup table doesn't work because you can't lookup what doesn't exist. Even if you "scale" the dividend the results will not be correct in the sense of being the same as an integer division. Take this example:
printf("%x %x\n", 0x10/0x9, 0x30/0x9);
// prints: 1 5
Assuming a multiplicative inverse existed, both terms are divided by the same divisor (9) so must have the same lookup table value (multiplicative inverse). Any fixed lookup value corresponding to the divisor (9) multiplied by an integer will be precisely 3 times greater in the second term relative to the first term. As you can see from the example, the result of an actual integer division is a 5, not a 3.
You can approximate things by using a scaled lookup table. For instance a lookup table that is the multiplicative inverse when the result is divided by 2^16. You would then multiply by the lookup table value and shift the result 16 bits to the right. Time consuming and requires a 1024 byte lookup table. Even so, this would not produce the same results as an integer divide. A compiler optimization is not going to produce "approximate" results of an integer division.

Safest and most efficient way to compute an integer operation that may overflow

Suppose we have 2 constants A & B and a variable i, all 64 bits integers. And we want to compute a simple common arithmetic operation such as:
i * A / B (1)
To simplify the problem, let's assume that variable i is always in the range [INT64_MIN*B/A, INT64_MAX*B/A], so that the final result of the arithmetic operation (1) does not overflow (i.e.: fits in the range [INT64_MIN, INT64_MAX]).
In addition, i is assumed to be more likely in the friendly range Range1 = [INT64_MIN/A, INT64_MAX/A] (i.e.: close to 0), however i may be (less likely) outside this range. In the first case, a trivial integer computation of i * A would not overflow (that's why we called the range friendly); and in the latter case, a trivial integer computation of i * A would overflow, leading to an erroneous result in computation of (1).
What would be the "safest" and "most efficient" way to compute operation (1) (where "safest" means: preserving exactness or at least a decent precision, and where "most efficient" means: lowest average computation time), provided i is more likely in the friendly range Range1.
At now, the solution currently implemented in the code is the following one :
(int64_t)((double)A / B * i)
which solution is quite safe (no overflow) though inaccurate (precision loss due to double significand 53 bits limitation) and quite fast because double division (double)A / B is precomputed at compile time, letting only a double multiplication to be computed at runtime.

If you cannot get better bounds on the ranges involved then you're best off following iammilind's advice to use __int128.
The reason is that otherwise you would have to implement the full logic of word to double-word multiplication and double-word by word division. The Intel and AMD processor manuals contain helpful information and ready-made code, but it gets quite involved, and using C/C++ instead of in assembler makes things doubly complicated.
All good compilers expose useful primitives as intrinsics. Microsoft's list doesn't seem to include a muldiv-like primitive but the __mul128 intrinsic gives you the two halves of the 128-bit product as two 64-bit integers. Based on that you can perform long division of two digits by one digit, where one 'digit' would be a 64-bit integer (usually called 'limb' because bigger than a digit but still only part of the whole). Still quite involved but lots better than using pure C/C++. However, portability-wise it is no better than using __int128 directly. At least that way the compiler implementers have already done all the hard work for you.
If your application domain can give you useful bounds, like that (u % d) * v will not overflow then you can use the identity
(u * v) / d = (u / d) * v + ((u % d) * v) / d
where / signifies integer division, as long as u is non-negative and d is positive (otherwise you might run afoul of the leeway allowed for the semantics of operator %).
In any case you may have to separate out the signs of the operands and use unsigned operations in order to find more useful mechanisms that you can exploit - or to circumvent sabotage by the compiler, like the saturating multiplication that you mentioned. Overflow of signed integer operations invokes undefined behaviour, compilers are free to do whatever they please. By contrast, overflow for unsigned types is well-defined.
Also, with unsigned types you can fall back on rules like that with s = a (+) b (where (+) is possibly-overflowing unsigned addition) you will have either s == a + b or s < a && s < b, which lets you detect overflow after the fact with cheap operations.
However, it is unlikely that you will get much farther on this road because the effort required quickly approaches - or even exceeds - the effort of implementing the double-limb operations I alluded to earlier. Only a thorough analysis of the application domain could give the information required for planning/deploying such shortcuts. In the general case and with the bounds you have given you're pretty much out of luck.

In order to provide a quantified answer to the question, I made a benchmark of different solutions as part of the ones proposed here in this post (thanks to comments and answers).
The benchmark measures computation time of different implementations, when i is inside the friendly range Range1 = [INT64_MIN/A, INT64_MAX/A], and when i is outside the friendly range (yet in the safe range Range2 = [INT64_MIN*B/A, INT64_MAX*B/A]).
Each implementation performs a "safe" (i.e. with no overflow) computation of the operation: i * A / B (except the 1st implementation, given as reference computation time). However, some implementations may return infrequent inaccurate computation result (which behavior is notified).
Some solutions proposed have not been tested or are not listed hereafter; these are: solution using __int128 (unsupported by ms vc compiler), yet boost int128_t has been used instead; solutions using extended 80 bits long double (unsupported by ms vc compiler); solution using InfInt (working and tested though too slow to be a decent competitor).
Time measurements are specified in ps/op (picoseconds per operation). Benchmark platform is an Intel Q6600#3GHz under Windows 7 x64, executable compiled with MS vc14, x64/Release target. The variables, constants and function referenced hereafter are defined as:
int64_t i;
const int64_t A = 1234567891;
const int64_t B = 4321987;
inline bool in_safe_range(int64_t i) { return (INT64_MIN/A <= i) && (i <= INT64_MAX/A); }
(i * A / B) [reference]
i in Range1: 1469 ps/op, i outside Range1: irrelevant (overflows)
((int64_t)((double)i * A / B))
i in Range1: 10613 ps/op, i outside Range1: 10606 ps/op
Note: infrequent inaccurate result (max error = 1 bit) in the whole range Range2
((int64_t)((double)A / B * i))
i in Range1: 1073 ps/op, i outside Range1: 1071 ps/op
Note: infrequent inaccurate result (max error = 1 bit) in the whole range Range2
Note: compiler likely precomputes (double)A / B resulting in the observed performance boost vs previous solution.
(!in_safe_range(i) ? (int64_t)((double)A / B * i) : (i * A / B))
i in Range1: 2009 ps/op, i outside Range1: 1606 ps/op
Note: rare inaccurate result (max error = 1 bit) outside Range1
((int64_t)((int128_t)i * A / B)) [boost int128_t]
i in Range1: 89924 ps/op, i outside Range1: 89289 ps/op
Note: boost int128_t performs dramatically bad on the bench platform (have no idea why)
((i / B) * A + ((i % B) * A) / B)
i in Range1: 5876 ps/op, i outside Range1: 5879 ps/op
(!in_safe_range(i) ? ((i / B) * A + ((i % B) * A) / B) : (i * A / B))
i in Range1: 1999 ps/op, i outside Range1: 6135 ps/op
Conclusion
a) If slight computation errors are acceptable in the whole range Range2, then solution (3) is the fastest one, even faster than the direct integer computation given as reference.
b) If computation errors are unacceptable in the friendly range Range1, yet acceptable outside this range, then solution (4) is the fastest one.
c) If computation errors are unacceptable in the whole range Range2, then solution (7) performs as well as solution (4) in the friendly range Range1, and remains decently fast outside this range.

I think you can detect the overflow before it happens. In your case of i * A / B, you are only worried about the i * A part because the division cannot overflow.
You can detect the overflow by performing test of bool overflow = i > INT64_MAX / A. You will have to do modify this depending on the sign of operands and result.

Some implementations permit __int128_t. Check if your implementation allows it, so that you can you may use it as placeholder instead of double. Refer below post:
Why isn't there int128_t?
If not very concerned about "fast"-ness, then for good portability I would suggest to use header only C++ library "InfInt".
It is pretty straight forward to use the library. Just create an
instance of InfInt class and start using it:
InfInt myint1 = "15432154865413186646848435184100510168404641560358";
InfInt myint2 = 156341300544608LL;
myint1 *= --myint2 - 3;
std::cout << myint1 << std::endl;

Not sure about value bounds, will (i / B) * A + (i % B) * A / B help?

Do most compilers transform % 2 into bit comparison? Is it really faster?

In programming, one often needs to check if a number is odd or even. For that, we usually use:
n % 2 == 0
However, my understanding is that the '%' operator actually performs a division and returns its remainder; therefore, for the case above, it would be faster to simply check the last bit instead. Let's say n = 5;
5 = 00000101
In order to check if the number is odd or even, we just need to check the last bit. If it's 1, the number is odd; otherwise, it is even. In programming, it would be expressed like this:
n & 1 == 0
In my understanding this would be faster than % 2 as no division is performed. A mere bit comparison is needed.
I have 2 questions then:
1) Is the second way really faster than the first (in all cases)?
2) If the answer for 1 is yes, are compilers (in all languages) smart enough to convert % 2 into a simple bit comparison? Or do we have to explicitly use the second way if we want the best performance?

Yes, a bit-test is much faster than integer division, by about a factor of 10 to 20, or even 100 for 128bit / 64bit = 64bit idiv on Intel. Esp. since x86 at least has a test instruction that sets condition flags based on the result of a bitwise AND, so you don't have to divide and then compare; the bitwise AND is the compare.
I decided to actually check the compiler output on Godbolt, and got a surprise:
It turns out that using n % 2 as a signed integer value (e.g. a return n % 2 from a function that return signed int) instead of just testing it for non-zero (if (n % 2)) sometimes produces slower code than return n & 1. This is because (-1 % 2) == -1, while (-1 & 1) == 1, so the compiler can't use a bitwise AND. Compilers still avoid integer division, though, and use some clever shift / and / add / sub sequence instead, because that's still cheaper than an integer division. (gcc and clang use different sequences.)
So if you want to return a truth value based on n % 2, your best bet is to do it with an unsigned type. This lets the compiler always optimize it to a single AND instruction. (On godbolt, you can flip to other architectures, like ARM and PowerPC, and see that the unsigned even (%) function and the int even_bit (bitwise &) function have the same asm code.)
Using a bool (which must be 0 or 1, not just any non-zero value) is another option, but the compiler will have to do extra work to return (bool) (n % 4) (or any test other than n%2). The bitwise-and version of that will be 0, 1, 2, or 3, so the compiler has to turn any non-zero value into a 1. (x86 has an efficient setcc instruction that sets a register to 0 or 1, depending on the flags, so it's still only 2 instructions instead of 1. clang/gcc use this, see aligned4_bool in the godbolt asm output.)
With any optimization level higher than -O0, gcc and clang optimize if (n%2) to what we expect. The other huge surprise is that icc 13 doesn't. I don't understand WTF icc thinks it's doing with all those branches.

The speed is equivalent.
The modulo version is generally guaranteed to work whether the integer is positive, negative or zero, regardless of the implementing language. The bitwise version is not.
Use what you feel is most readable.

Binary Addition without overflow wrap-around in C/C++

I know that when overflow occurs in C/C++, normal behavior is to wrap-around. For example, INT_MAX+1 is an overflow.
Is possible to modify this behavior, so binary addition takes place as normal addition and there is no wraparound at the end of addition operation ?
Some Code so this would make sense. Basically, this is one bit (full) added, it adds bit by bit in 32
int adder(int x, int y)
{
int sum;
for (int i = 0; i < 31; i++)
{
sum = x ^ y;
int carry = x & y;
x = sum;
y = carry << 1;
}
return sum;
}
If I try to adder(INT_MAX, 1); it actually overflows, even though, I amn't using + operator.
Thanks !

Overflow means that the result of an addition would exceed std::numeric_limits<int>::max() (back in C days, we used INT_MAX). Performing such an addition results in undefined behavior. The machine could crash and still comply with the C++ standard. Although you're more likely to get INT_MIN as a result, there's really no advantage to depending on any result at all.
The solution is to perform subtraction instead of addition, to prevent overflow and take a special case:
if ( number > std::numeric_limits< int >::max() - 1 ) { // ie number + 1 > max
// fix things so "normal" math happens, in this case saturation.
} else {
++ number;
}
Without knowing the desired result, I can't be more specific about the it. The performance impact should be minimal, as a rarely-taken branch can usually be retired in parallel with subsequent instructions without delaying them.
Edit: To simply do math without worrying about overflow or handling it yourself, use a bignum library such as GMP. It's quite portable, and usually the best on any given platform. It has C and C++ interfaces. Do not write your own assembly. The result would be unportable, suboptimal, and the interface would be your responsibility!

No, you have to add them manually to check for overflow.

What do you want the result of INT_MAX + 1 to be? You can only fit INT_MAX into an int, so if you add one to it, the result is not going to be one greater. (Edit: On common platforms such as x86 it is going to wrap to the largest negative number: -(INT_MAX+1). The only way to get bigger numbers is to use a larger variable.
Assuming int is 4-bytes (as is typical on x86 compilers) and you are executing an add instruction (in 32-bit mode), the destination register simply does overflow -- it is out of bits and can't hold a larger value. It is a limitation of the hardware.
To get around this, you can hand-code, or use an aribitrarily-sized integer library that does the following:
First perform a normal add instruction on the lowest-order words. If overflow occurs, the Carry flag is set.
For each increasingly-higher-order word, use the adc instruction, which adds the two operands as usual, but takes into account the value of the Carry flag (as a value of 1.)
You can see this for a 64-bit value here.

Is it possible to roll a significantly faster version of sqrt

In an app I'm profiling, I found that in some scenarios this function is able to take over 10% of total execution time.
I've seen discussion over the years of faster sqrt implementations using sneaky floating-point trickery, but I don't know if such things are outdated on modern CPUs.
MSVC++ 2008 compiler is being used, for reference... though I'd assume sqrt is not going to add much overhead though.
See also here for similar discussion on modf function.
EDIT: for reference, this is one widely-used method, but is it actually much quicker? How many cycles is SQRT anyway these days?

Yes, it is possible even without trickery:
sacrifice accuracy for speed: the sqrt algorithm is iterative, re-implement with fewer iterations.
lookup tables: either just for the start point of the iteration, or combined with interpolation to get you all the way there.
caching: are you always sqrting the same limited set of values? if so, caching can work well. I've found this useful in graphics applications where the same thing is being calculated for lots of shapes the same size, so results can be usefully cached.
Hello from 11 years in the future.
Considering this still gets occasional votes, I thought I'd add a note about performance, which now even more than then is dramatically limited by memory accesses. You absolutely must use a realistic benchmark (ideally, your whole application) when optimising something like this - the memory access patterns of your application will have a dramatic effect on solutions like lookup tables and caches, and just comparing 'cycles' for your optimised version will lead you wildly astray: it is also very difficult to assign program time to individual instructions, and your profiling tool may mislead you here.
On a related note, consider using simd/vectorised instructions for calculating square roots, like _mm512_sqrt_ps or similar, if they suit your use case.
Take a look at section 15.12.3 of intel's optimisation reference manual, which describes approximation methods, with vectorised instructions, which would probably translate pretty well to other architectures too.

There's a great comparison table here:
http://assemblyrequired.crashworks.org/timing-square-root/
Long story short, SSE2's ssqrts is about 2x faster than FPU fsqrt, and an approximation + iteration is about 4x faster than that (8x overall).
Also, if you're trying to take a single-precision sqrt, make sure that's actually what you're getting. I've heard of at least one compiler that would convert the float argument to a double, call double-precision sqrt, then convert back to float.

You're very likely to gain more speed improvements by changing your algorithms than by changing their implementations: Try to call sqrt() less instead of making calls faster. (And if you think this isn't possible - the improvements for sqrt() you mention are just that: improvements of the algorithm used to calculate a square root.)
Since it is used very often, it is likely that your standard library's implementation of sqrt() is nearly optimal for the general case. Unless you have a restricted domain (e.g., if you need less precision) where the algorithm can take some shortcuts, it's very unlikely someone comes up with an implementation that's faster.
Note that, since that function uses 10% of your execution time, even if you manage to come up with an implementation that only takes 75% of the time of std::sqrt(), this still will only bring your execution time down by 2,5%. For most applications users wouldn't even notice this, except if they use a watch to measure.

How accurate do you need your sqrt to be? You can get reasonable approximations very quickly: see Quake3's excellent inverse square root function for inspiration (note that the code is GPL'ed, so you may not want to integrate it directly).

Don't know if you fixed this, but I've read about it before, and it seems that the fastest thing to do is replace the sqrt function with an inline assembly version;
you can see a description of a load of alternatives here.
The best is this snippet of magic:
double inline __declspec (naked) __fastcall sqrt(double n)
{
_asm fld qword ptr [esp+4]
_asm fsqrt
_asm ret 8
}
It's about 4.7x faster than the standard sqrt call with the same precision.

Here is a fast way with a look up table of only 8KB. Mistake is ~0.5% of the result. You can easily enlarge the table, thus reducing the mistake. Runs about 5 times faster than the regular sqrt()
// LUT for fast sqrt of floats. Table will be consist of 2 parts, half for sqrt(X) and half for sqrt(2X).
const int nBitsForSQRTprecision = 11; // Use only 11 most sagnificant bits from the 23 of float. We can use 15 bits instead. It will produce less error but take more place in a memory.
const int nUnusedBits = 23 - nBitsForSQRTprecision; // Amount of bits we will disregard
const int tableSize = (1 << (nBitsForSQRTprecision+1)); // 2^nBits*2 because we have 2 halves of the table.
static short sqrtTab[tableSize];
static unsigned char is_sqrttab_initialized = FALSE; // Once initialized will be true
// Table of precalculated sqrt() for future fast calculation. Approximates the exact with an error of about 0.5%
// Note: To access the bits of a float in C quickly we must misuse pointers.
// More info in: http://en.wikipedia.org/wiki/Single_precision
void build_fsqrt_table(void){
unsigned short i;
float f;
UINT32 *fi = (UINT32*)&f;
if (is_sqrttab_initialized)
return;
const int halfTableSize = (tableSize>>1);
for (i=0; i < halfTableSize; i++){
*fi = 0;
*fi = (i << nUnusedBits) | (127 << 23); // Build a float with the bit pattern i as mantissa, and an exponent of 0, stored as 127
// Take the square root then strip the first 'nBitsForSQRTprecision' bits of the mantissa into the table
f = sqrtf(f);
sqrtTab[i] = (short)((*fi & 0x7fffff) >> nUnusedBits);
// Repeat the process, this time with an exponent of 1, stored as 128
*fi = 0;
*fi = (i << nUnusedBits) | (128 << 23);
f = sqrtf(f);
sqrtTab[i+halfTableSize] = (short)((*fi & 0x7fffff) >> nUnusedBits);
}
is_sqrttab_initialized = TRUE;
}
// Calculation of a square root. Divide the exponent of float by 2 and sqrt() its mantissa using the precalculated table.
float fast_float_sqrt(float n){
if (n <= 0.f)
return 0.f; // On 0 or negative return 0.
UINT32 *num = (UINT32*)&n;
short e; // Exponent
e = (*num >> 23) - 127; // In 'float' the exponent is stored with 127 added.
*num &= 0x7fffff; // leave only the mantissa
// If the exponent is odd so we have to look it up in the second half of the lookup table, so we set the high bit.
const int halfTableSize = (tableSize>>1);
const int secondHalphTableIdBit = halfTableSize << nUnusedBits;
if (e & 0x01)
*num |= secondHalphTableIdBit;
e >>= 1; // Divide the exponent by two (note that in C the shift operators are sign preserving for signed operands
// Do the table lookup, based on the quaternary mantissa, then reconstruct the result back into a float
*num = ((sqrtTab[*num >> nUnusedBits]) << nUnusedBits) | ((e + 127) << 23);
return n;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js