Karatsuba multiplication improvement - c++

I have implemented Karatsuba multiplication algorithm for my educational goals. Now I am looking for further improvments. I have implemented some kind of long arithmetic and it works well whether I do not use the base of integer representation more than 100.
With base 10 and compiling with clang++ -O3 multiplication of two random integers in range [10^50000, 10^50001] takes:
Naive algorithm took me 1967 cycles (1.967 seconds)
Karatsuba algorithm took me 400 cycles (0.4 seconds)
And the same numbers with base 100:
Naive algorithm took me 409 cycles (0.409 seconds)
Karatsuba algorithm took me 140 cycles (0.14 seconds)
Is there a way for improve this results?
Now I use such function to finalize my result:
void finalize(vector<int>& res) {
for (int i = 0; i < res.size(); ++i) {
res[i + 1] += res[i] / base;
res[i] %= base;
}
}
As you can see each step it calculates carry and push it to the next digit. And if I take base >=1000 the result will be overflowed.
If you see at my code I use vectors of int to represent long integer. According to my base a number will divide in separate parts of vector.
Now I see several options:
to use long long type for vector, but it might also be overflowed for vast length integers
implement representation of carry in long arithmetic
After I had saw some coments I decided to expand the issue. Assume that we want to represent our long integer as a vector of ints. For instanse:
ULLONG_MAX = 18446744073709551615
And for input we pass 210th Fibonacci number 34507973060837282187130139035400899082304280 which does not fit to any stadard type. If we represent it in a vector of int with base 10000000 it will be like:
v[0]: 2304280
v[1]: 89908
v[2]: 1390354
v[3]: 2187130
v[4]: 6083728
v[5]: 5079730
v[6]: 34
And when we do multiplication we may get (for simplicity let it be two identical numbers)
(34507973060837282187130139035400899082304280)^2:
v[0] * v[0] = 5309706318400
...
v[0] * v[4] = 14018612755840
...
It was only the first row and we have to do the six steps like that. Certainly, some step will cause overflow during multiplication or after carry calculation.
If I missed something, please, let me know and I will change it.
If you want to see full version, it is on my github

Base 2^64 and base 2^32 are the most popular bases for doing high precision arithmetic. Usually, the digits are stored in an unsigned integral type, because they have well-behaved semantics with regard to overflow.
For example, one can detect the carry from an addition as follows:
uint64_t x, y; // initialize somehow
uint64_t sum = x + y;
uint64_t carry = sum < x; // 1 if true, 0 if false
Also, assembly languages usually have a few "add with carry" instructions; if you can write inline assembly (or have access to intrinsics) you can take advantage of these.
For multiplication, most computers have machine instructions that can compute a one machine word -> two machine word product; sometimes, the instructions to get the two halves are called "multiply hi" and "multiply low". You need to write assembly to get them, although many compilers offer larger integer types whose use would let you access these instructions: e.g. in gcc you can implement multiply hi as
uint64_t mulhi(uint64_t x, uint64_t y)
{
return ((__uint128_t) x * y) >> 64;
}
When people can't use this, they do multiplication in 2^32 instead, so that they can use the same approach to implement a portable mulhi instruction, using uint64_t as the double-digit type.
If you want to write efficient code, you really need to take advantage of these bigger multiply instructions. Multiplying digits in base 2^32 is more than ninety times more powerful than multiplying digits in base 10. Multiplying digits in base 2^64 is four times more powerful than that. And your computer can probably do these more quickly than whatever you implement for base 10 multiplication.

Related

How to let GCC compiler turn variable-division into mul(if faster)

int a, b;
scanf("%d %d", &a, &b);
printf("%d\n", (unsigned int)a/(unsigned char)b);
When compiling, I got
...
::00401C1E:: C70424 24304000 MOV DWORD PTR [ESP],403024 %d %d
::00401C25:: E8 36FFFFFF CALL 00401B60 scanf
::00401C2A:: 0FB64C24 1C MOVZX ECX,BYTE PTR [ESP+1C]
::00401C2F:: 8B4424 18 MOV EAX,[ESP+18]
::00401C33:: 31D2 XOR EDX,EDX
::00401C35:: F7F1 DIV ECX
::00401C37:: 894424 04 MOV [ESP+4],EAX
::00401C3B:: C70424 2A304000 MOV DWORD PTR [ESP],40302A %d\x0A
::00401C42:: E8 21FFFFFF CALL 00401B68 printf
Will it be faster if the DIV turn into MUL and use an array to store the mulvalue? If so, how to let the compiler do the optimization?
int main() {
uint a, s=0, i, t;
scanf("%d", &a);
diviuint aa = a;
t = clock();
for (i=0; i<1000000000; i++)
s += i/a;
printf("Result:%10u\n", s);
printf("Time:%12u\n", clock()-t);
return 0;
}
where diviuint(a) make a memory of 1/a and use multiple instead
Using s+=i/aa makes the speed 2 times of s+=i/a
You are correct that finding the multiplicative inverse may be worth it if integer division inside a loop is unavoidable. gcc and clang won't do this for you with run-time constants, though; only compile-time constants. It's too expensive (in code-size) for the compiler to do without being sure it's needed, and the perf gains aren't as big with non compile-time constants. (I'm not confident a speedup will always be possible, depending on how good integer division is on the target microarchitecture.)
Using a multiplicative inverse
If you can't transform things to pull the divide out of the loop, and it runs many iterations, and a significant increase in code-size is with the performance gain (e.g. you aren't bottlenecked on cache misses that hide the div latency), then you might get a speedup from doing for run-time constants what the compiler does for compile-time constants.
Note that different constants need different shifts of the high half of the full-multiply, and some constants need more different shifts than others. (Another way of saying that some of the shift-counts are zero for some constants). So non-compile-time-constant divide-by-multiplying code needs all the shifts, and the shift counts have to be variable-count. (On x86, this is more expensive than immediate-count shifts).
libdivide has an implementation of the necessary math. You can use it to do SIMD-vectorized division, or for scalar, I think. This will definitely provide a big speedup over unpacking to scalar and doing integer division there. I haven't used it myself.
(Intel SSE/AVX doesn't do integer-division in hardware, but provides a variety of multiplies, and fairly efficient variable-count shift instructions. For 16bit elements, there's an instruction that produces only the high half of the multiply. For 32bit elements, there's a widening multiply, so you'd need a shuffle with that.)
Anyway, you could use libdivide to vectorize that add loop, with a horizontal sum at the end.
Other ways to get the div out of the loop
for (i=0; i<1000000000; i++)
s += i/a;
In your example, you might get better results from using a uint128_t s accumulator and dividing by a outside the loop. A 64bit add/adc pair is pretty cheap. (It wouldn't give identical results, though, because integer division truncates instead of rounding to nearest.)
I think you can account for that by looping with i += a; tmp++, and doing s += tmp*a, to combine all the adds from iterations where i/a is the same. So s += 1 * a accounts for all the iterations from i = [a .. a*2-1]. Obviously that was just a trivial example, and looping more efficiently is usually not actually possible. It's off-topic for this question, but worth saying anyway: Look for big optimizations by re-structuring code or taking advantage of some math before trying to speed up doing the exact same thing faster. Speaking of math, you can use the sum(0..n) = n * (n+1) / 2 formula here, because we can factor a out of a*1 + a*2 + a*3 ... a*max. I may have an off-by-one here, but I'm confident a closed-form simple constant time calculation will give the same answer as the loop for any a:
uint32_t n = 1000000000 / a;
uint32_t s = a * n*(n+1)/2 + 1000000000 % a;
If you just needed i/a in a loop, it might be worth it to do something like:
// another optimization for an unlikely case
for (uint32_t i=0, remainder=0, i_over_a=0 ; i < n ; i++) {
// use i_over_a
++remainder;
if (remainder == a) { // if you don't need the remainder in the loop, it could save an insn or two to count down from a to 0 instead of up from 0 to a, e.g. on x86. But then you need a clever variable name other than remainder.
remainder = 0;
++i_over_a;
}
}
Again, this is unlikely: it only works if you're dividing the loop counter by a constant. However, it should work well. Either a is large so branch mispredicts will be infrequent, or a is (hopefully) small enough for a good branch predictor to recognize the repeating pattern of a-1 branches one way, then 1 branch the other way. The worst-case a value might be 33 or 65 or something, depending on microarchitecture. Branchless asm is probably possible but not worth it. e.g. handle ++i_over_a with an add-with-carry and a conditional move for zeroing. (e.g. x86 pseudo-code cmp a-1, remainder / cmovc remainder, 0 / adc i_over_a, 0. The b (below) condition is just CF==1, same as the c (carry) condition. The branchless asm would be simplified by decrementing from a to 0. (don't need a zeroed reg for cmov, and could have a in a reg instead of a-1))
Replacing DIV with MUL may make sense (but doesn't have to in all cases) when one of the values is known at compile time. When both are user inputs, you don't know what's the range, so all usual tricks will not work.
Basically you need to handle both a and b between INT_MAX and INT_MIN. There's no space left for scaling them up/down. Even if you wanted to extend them to larger types, it would probably take longer time just to invert b and check that the result will be consistent.
The only way to KNOW if div or mul is faster is by testing both in a benchmark [obviously, if you use your above code, you'd mostly measure the time of read/write of the inputs and results, not the actual divide instruction, so you need something where you can isolate the divide instruction(s) from the input and output].
My guess would be that on slightly older processors, mul is a bit faster, on modern processors, div will be as fast as, if not faster than, a lookup of 256 int values.
If you have ONE target system, then it's plausible to test this. If you have several different systems you want to run on, you will have to ensure the "improved code" is faster on at least some of them - and not significantly slower on the rest.
Note also that you would introduce a dependency, which may in itself slow down the sequence of operations - modern CPU's are pretty good at "hiding" latency as long as there are other instructions to execute [so you should use this in an "as realistic scenario as possible"].
There is a wrong assumption in the question. The multiplicative inverse of an integer greater than 1 is a fraction less than one. These don't exist in the world of integers. A lookup table doesn't work because you can't lookup what doesn't exist. Even if you "scale" the dividend the results will not be correct in the sense of being the same as an integer division. Take this example:
printf("%x %x\n", 0x10/0x9, 0x30/0x9);
// prints: 1 5
Assuming a multiplicative inverse existed, both terms are divided by the same divisor (9) so must have the same lookup table value (multiplicative inverse). Any fixed lookup value corresponding to the divisor (9) multiplied by an integer will be precisely 3 times greater in the second term relative to the first term. As you can see from the example, the result of an actual integer division is a 5, not a 3.
You can approximate things by using a scaled lookup table. For instance a lookup table that is the multiplicative inverse when the result is divided by 2^16. You would then multiply by the lookup table value and shift the result 16 bits to the right. Time consuming and requires a 1024 byte lookup table. Even so, this would not produce the same results as an integer divide. A compiler optimization is not going to produce "approximate" results of an integer division.

Consistent hashing SHA1 modulo operation

I hope some guru here could help me out
I am writing a C/C++ code to implement Consistent Hashing using SHA1 as the hashing algorithm.
I need to implement module operation as follow :
0100 0011....0110 100 mod 0010 1001 = ?
If the divisor (0000 1111) is a power of 2 pow(2,n) , then it would be easy as the last n bit of dividend is the result.
SHA1 length is 160 bit ( or 40 hexa ). My question is how do I implement a modulo operation of a long bit string to another arbitrary bit string?
Thank you.
As user stark points out (more rudely than necessary, I think) in comments, this is just long division in base 2^32. Or in base 2, with more digits. And you can use the algorithm for long division that you learned in school for base 10.
There are fancier algorithms for doing division on much bigger numbers, but for your application I suspect you can afford to do it very simply and inefficiently, along the following lines (this is the base-2 version, which just amounts to subtracting off left-shifted versions of the denominator until you can no longer do so):
// Treat x,y as 160-bit numbers, where [0] is least significant and
// [4] is most significant. Compute x mod y, and put the result in out.
void mod160(unsigned int out[5], const unsigned int x[5], const unsigned y[5]) {
unsigned int temp[5];
copy160(out, x);
copy160(temp, y);
int n=0;
// Find first 1-bit in quotient.
while (lessthanhalf160(temp, x)) { lshift160(temp); ++n; }
while (n>=0) {
if (!less160(out, temp)) sub160(out, temp); // quotient bit is 1
rshift160(temp); --n; // proceed to next bit of quotient
}
}
For the avoidance of doubt, the above is only a crude sketch:
It may be full of bugs.
I haven't written the implementations of the building-block functions like less160.
Actually you'd probably just put the code for those inline rather than having separate functions. E.g., copy160 is just five assignments, or a short loop, or a memcpy.
It could surely be made more efficient; e.g., that first step might do better to count leading 0-bits and then do a single shift, instead of shifting one place at a time. (The right-shifting probably doesn't want to do that, though, because half the time you will be doing only a single shift.)
The "base-2^32" version may well be faster, but the implementation will be a bit more complicated.

Represent Integers with 2000 or more digits [duplicate]

This question already has answers here:
Handling large numbers in C++?
(10 answers)
Closed 7 years ago.
I would like to write a program, which could compute integers having more then 2000 or 20000 digits (for Pi's decimals). I would like to do in C++, without any libraries! (No big integer, boost,...). Can anyone suggest a way of doing it? Here are my thoughts:
using const char*, for holding the integer's digits;
representing the number like
( (1 * 10 + x) * 10 + x )...
The obvious answer works along these lines:
class integer {
bool negative;
std::vector<std::uint64_t> data;
};
Where the number is represented as a sign bit and a (unsigned) base 2**64 value.
This means the absolute value of your number is:
data[0] + (data[1] << 64) + (data[2] << 128) + ....
Or, in other terms you represent your number as a little-endian bitstring with words as large as your target machine can reasonably work with. I chose 64 bit integers, as you can minimize the number of individual word operations this way (on a x64 machine).
To implement Addition, you use a concept you have learned in elementary school:
a b
+ x y
------------------
(a+x+carry) (b+y reduced to one digit length)
The reduction (modulo 2**64) happens automatically, and the carry can only ever be either zero or one. All that remains is to detect a carry, which is simple:
bool next_carry = false;
if(x += y < y) next_carry = true;
if(prev_carry && !++x) next_carry = true;
Subtraction can be implemented similarly using a borrow instead.
Note that getting anywhere close to the performance of e.g. libgmp is... unlikely.
A long integer is usually represented by a sequence of digits (see positional notation). For convenience, use little endian convention: A[0] is the lowest digit, A[n-1] is the highest one. In general case your number is equal to sum(A[i] * base^i) for some value of base.
The simplest value for base is ten, but it is not efficient. If you want to print your answer to user often, you'd better use power-of-ten as base. For instance, you can use base = 10^9 and store all digits in int32 type. If you want maximal speed, then better use power-of-two bases. For instance, base = 2^32 is the best possible base for 32-bit compiler (however, you'll need assembly to make it work optimally).
There are two ways to represent negative integers, The first one is to store integer as sign + digits sequence. In this case you'll have to handle all cases with different signs yourself. The other option is to use complement form. It can be used for both power-of-two and power-of-ten bases.
Since the length of the sequence may be different, you'd better store digit sequence in std::vector. Do not forget to remove leading zeroes in this case. An alternative solution would be to store fixed number of digits always (fixed-size array).
The operations are implemented in pretty straightforward way: just as you did them in school =)
P.S. Alternatively, each integer (of bounded length) can be represented by its reminders for a set of different prime modules, thanks to CRT. Such a representation supports only limited set of operations, and requires nontrivial convertion if you want to print it.

Sorting floating numbers

I have a list of normal vectors and I am calculating the scalar triple product and sorting them. I have compared the sorting in three different cases:
Using Matlab sort to find the largest absolute triple product values
Using std::sort function in C++ to get the product values for std::vector. Using doubles for triple products.
Using Radix sort in OpenCL C - converting the absolute floating values to unsigned integers and converting them back. I am using cl_float for triple products.
All of them give values which are different along with the different indices which causes problems in my algorithm. What is the problem in this case and how can I keep them consistent?
The problem at hand:
Calculate the scalar triple product of 3 3-dimensional vertices, being each component of each vector represented as binary32 float.
Being able to tell if a result of that calculation is greater than other result of that calculation.
So far so good, but if we directly apply the formulae to the vectors, some bits may be lost in the operations and we will be unable to discern two results. As #rcgldr pointed out, sorting is not the problem, the precision is.
One solution to floating point roundoff problems is to increase the number of bits, that is, use double. You said you have no double, so let's do it ourselves: perform the whole calculation in an array of unsigned char as long as we need.
Okay, okay, practical considerations:
The input is made of normalized vectors, so the length is no greater than one, that implies no component is greater than one
The exponent of a binary32 float ranges from -127 (zero, denormals) to 127 (or 128 for infinity), but all components will have exponent from -127 to 0 (or else they would be greater than one).
The maximum precision of the input is 24 bits.
Scalar triple product involves an vector product and an scalar product. In the vector product (which will happen first) there is subtraction of results of multiplication, and in the scalar product there is a sum of results of multiplication.
Considerations 2 and 3 tells us that the whole family of input components can be fit in an fixed-point format of size 127 bits for offsetting plus 24 bits for the mantissa, that's 19 bytes. Let's make it 24.
To be able to fully represent all possible sums and subtractions, one extra bit suffices (in the advent of carryover), but to fully represent all possible multiplication results, we need double the number of bits, so it resolutes that doubling the size is enough to represent the vector multiplication, and tripling it will make it enough for the next multiplication in the scalar product.
Here is a draft of class that loads a float to that very huge fixed point format, keeping the sign as a bool flag (there is a helper function rollArrayRight() that I'll post separately, but hopefully it's name explains it):
const size_t initialSize=24;
const size_t sizeForMult1=initialSize+initialSize;
const size_t finalSize=sizeForMult1+initialSize;
class imSoHuge{
public:
bool isNegative;
uint8_t r[finalSize];
void load(float v){
isNegative=false;
for(size_t p=0;p<finalSize;p++)r[p]=0;
union{
uint8_t b[4];
uint32_t u;
float f;
} reunion;
reunion.f=v;
if((reunion.b[3]&0x80) != 0x00)isNegative=true;
uint32_t m, eu;
eu=reunion.u<<1; //get rid of the sign;
eu>>=24;
m=reunion.u&(0x007fffff);
if(eu==0){//zero or denormal
if(m==0)return; //zero
}else{
m|=(0x00800000); //implicit leading one if it's not denormal
}
int32_t e=(int32_t)eu-127; //exponent is now in [e]. Debiased (does this word exists?)
reunion.u=m;
r[finalSize-1]=reunion.b[3];
r[finalSize-2]=reunion.b[2];
r[finalSize-3]=reunion.b[1];
r[finalSize-4]=reunion.b[0];
rollArrayRight(r, finalSize, e-(sizeForMult1*8)); //correct position for fixed-point
}
explicit imSoHuge(float v){
load(v);
}
};
When the class is constructed with the number 1.0f, for example, the array r have something like 00 00 00 00 80 00, notice that it is loaded to the lower part of it, the multiplications will ~roll~ the number accordingly to the upper bytes, and we can then recover our float.
To make it useful, we need to implement the equivalent to sum and multiplication, very straight-forward, as long as we remember we can only sum arrays that have been multiplied the same number of times (as in Triple product) or else theirs magnitude would not match.
One example where such class would make a difference:
Consider the following 3 vectors:
float a[]={0.0097905760, 0.0223784577, 0.9997016787};
float b[]={0.8248013854, 0.4413521587, 0.3534274995};
float c[]={0.4152690768, 0.3959976136, 0.8189856410};
And the following function that calculates the triple product: (hope I've got it right haha)
float fTripleProduct(float*a, float*b, float*c){
float crossAB[3];
crossAB[0]=(a[1]*b[2])-(a[2]*b[1]);
crossAB[1]=(a[2]*b[0])-(a[0]*b[2]);
crossAB[2]=(a[0]*b[1])-(a[1]*b[0]);
float tripleP=(crossAB[0]*c[0])+(crossAB[1]*c[1])+(crossAB[2]*c[2]);
return tripleP;
}
The result for fTripleProduct(a,b,c); is 0.1336331
If we change the last digit of the fisrt component of a from 0 to 6, making it 0.0097905766 (which have a different hexadecimal representation) and call the function again, the result is the same, but we know it should be greater.
Now consider we have implemented the multiplication, sum, and subtraction for the imSoHuge class and have a function to calculate the triple product using it
imSoHuge tripleProduct(float*a, float*b, float*c){
imSoHuge crossAB[3];
crossAB[0]=(imSoHuge(a[1])*imSoHuge(b[2]))-(imSoHuge(a[2])*imSoHuge(b[1]));
crossAB[1]=(imSoHuge(a[2])*imSoHuge(b[0]))-(imSoHuge(a[0])*imSoHuge(b[2]));
crossAB[2]=(imSoHuge(a[0])*imSoHuge(b[1]))-(imSoHuge(a[1])*imSoHuge(b[0]));
imSoHuge tripleP=(crossAB[0]*imSoHuge(c[0]))+(crossAB[1]*imSoHuge(c[1]))+(crossAB[2]*imSoHuge(c[2]));
return tripleP;
}
If we call that function for the two above versions of the vectors, the results in the array differ:
0 0 0 4 46 b9 4 69 39 3f 53 b8 19 e0 ...
0 0 0 4 46 b9 4 85 93 82 df ba 7d 80 ...
And they differ after the precision of a binary32 float indeed, meaning that if we cast that array back to float, it will be the same float, but if we compare the arrays, we can tell which one is greater.
The put that reasoning to the test, I've made a full working example, that you can compile and run right away with -O3 -Wall -std=c++11 in GCC, or equivalent on another compiler and will output:
Using class: second result is greater
casting to float:
first reasult: 1.336331e-001
second result: 1.336331e-001
as floats, the results are the same: 1.336331e-001
The source code is here (working fine on Ideone):
Source Code on IDEONE C++11 code
If you have not migrated to C++11, the code compiles and run in C++98 if you define the exact-width types uint8_t, uint16_t, uint32_t, int32_t yourself.
How to use it?
Simple call the function tripleProduct with your inputs and compare the results using the provided overload comparators operators, you can also cast the class imSoHuge to float (after the calculation of triple product) using the provided overload cast operator.
You can provide an array of that class and comparators to any sorting algorithm.
Conclusions and considerations:
Notice that a float multiplication is now performed as a multiplication of two 70+ bytes long array, that means hundreds of time more clock cycles, plus the sums, comparisons etc, this shall be thousands of time slower, but hey, it's exact.
The whole design of the algorithm is to work with normalized vectors (there is some room here, as I don't know the precision or your normalization procedure), but it will all overflow and be meaningless with most 'greater-than-one' vectors.
You can easily cap the array of the result to as many bytes you wish, if keeping all that array in memory is too much. Very few cases will produce results diverging after ~12bytes
I haven't stress-tested everything, like denormals, and corner cases, there is some comments in the code to the critical points.
and of course:
You can easily improve everything, I was just willing to share the reasoning =)
Source code again
Main reference:
Single-precision floating-point format (Wikipedia)

Is it possible to roll a significantly faster version of sqrt

In an app I'm profiling, I found that in some scenarios this function is able to take over 10% of total execution time.
I've seen discussion over the years of faster sqrt implementations using sneaky floating-point trickery, but I don't know if such things are outdated on modern CPUs.
MSVC++ 2008 compiler is being used, for reference... though I'd assume sqrt is not going to add much overhead though.
See also here for similar discussion on modf function.
EDIT: for reference, this is one widely-used method, but is it actually much quicker? How many cycles is SQRT anyway these days?
Yes, it is possible even without trickery:
sacrifice accuracy for speed: the sqrt algorithm is iterative, re-implement with fewer iterations.
lookup tables: either just for the start point of the iteration, or combined with interpolation to get you all the way there.
caching: are you always sqrting the same limited set of values? if so, caching can work well. I've found this useful in graphics applications where the same thing is being calculated for lots of shapes the same size, so results can be usefully cached.
Hello from 11 years in the future.
Considering this still gets occasional votes, I thought I'd add a note about performance, which now even more than then is dramatically limited by memory accesses. You absolutely must use a realistic benchmark (ideally, your whole application) when optimising something like this - the memory access patterns of your application will have a dramatic effect on solutions like lookup tables and caches, and just comparing 'cycles' for your optimised version will lead you wildly astray: it is also very difficult to assign program time to individual instructions, and your profiling tool may mislead you here.
On a related note, consider using simd/vectorised instructions for calculating square roots, like _mm512_sqrt_ps or similar, if they suit your use case.
Take a look at section 15.12.3 of intel's optimisation reference manual, which describes approximation methods, with vectorised instructions, which would probably translate pretty well to other architectures too.
There's a great comparison table here:
http://assemblyrequired.crashworks.org/timing-square-root/
Long story short, SSE2's ssqrts is about 2x faster than FPU fsqrt, and an approximation + iteration is about 4x faster than that (8x overall).
Also, if you're trying to take a single-precision sqrt, make sure that's actually what you're getting. I've heard of at least one compiler that would convert the float argument to a double, call double-precision sqrt, then convert back to float.
You're very likely to gain more speed improvements by changing your algorithms than by changing their implementations: Try to call sqrt() less instead of making calls faster. (And if you think this isn't possible - the improvements for sqrt() you mention are just that: improvements of the algorithm used to calculate a square root.)
Since it is used very often, it is likely that your standard library's implementation of sqrt() is nearly optimal for the general case. Unless you have a restricted domain (e.g., if you need less precision) where the algorithm can take some shortcuts, it's very unlikely someone comes up with an implementation that's faster.
Note that, since that function uses 10% of your execution time, even if you manage to come up with an implementation that only takes 75% of the time of std::sqrt(), this still will only bring your execution time down by 2,5%. For most applications users wouldn't even notice this, except if they use a watch to measure.
How accurate do you need your sqrt to be? You can get reasonable approximations very quickly: see Quake3's excellent inverse square root function for inspiration (note that the code is GPL'ed, so you may not want to integrate it directly).
Don't know if you fixed this, but I've read about it before, and it seems that the fastest thing to do is replace the sqrt function with an inline assembly version;
you can see a description of a load of alternatives here.
The best is this snippet of magic:
double inline __declspec (naked) __fastcall sqrt(double n)
{
_asm fld qword ptr [esp+4]
_asm fsqrt
_asm ret 8
}
It's about 4.7x faster than the standard sqrt call with the same precision.
Here is a fast way with a look up table of only 8KB. Mistake is ~0.5% of the result. You can easily enlarge the table, thus reducing the mistake. Runs about 5 times faster than the regular sqrt()
// LUT for fast sqrt of floats. Table will be consist of 2 parts, half for sqrt(X) and half for sqrt(2X).
const int nBitsForSQRTprecision = 11; // Use only 11 most sagnificant bits from the 23 of float. We can use 15 bits instead. It will produce less error but take more place in a memory.
const int nUnusedBits = 23 - nBitsForSQRTprecision; // Amount of bits we will disregard
const int tableSize = (1 << (nBitsForSQRTprecision+1)); // 2^nBits*2 because we have 2 halves of the table.
static short sqrtTab[tableSize];
static unsigned char is_sqrttab_initialized = FALSE; // Once initialized will be true
// Table of precalculated sqrt() for future fast calculation. Approximates the exact with an error of about 0.5%
// Note: To access the bits of a float in C quickly we must misuse pointers.
// More info in: http://en.wikipedia.org/wiki/Single_precision
void build_fsqrt_table(void){
unsigned short i;
float f;
UINT32 *fi = (UINT32*)&f;
if (is_sqrttab_initialized)
return;
const int halfTableSize = (tableSize>>1);
for (i=0; i < halfTableSize; i++){
*fi = 0;
*fi = (i << nUnusedBits) | (127 << 23); // Build a float with the bit pattern i as mantissa, and an exponent of 0, stored as 127
// Take the square root then strip the first 'nBitsForSQRTprecision' bits of the mantissa into the table
f = sqrtf(f);
sqrtTab[i] = (short)((*fi & 0x7fffff) >> nUnusedBits);
// Repeat the process, this time with an exponent of 1, stored as 128
*fi = 0;
*fi = (i << nUnusedBits) | (128 << 23);
f = sqrtf(f);
sqrtTab[i+halfTableSize] = (short)((*fi & 0x7fffff) >> nUnusedBits);
}
is_sqrttab_initialized = TRUE;
}
// Calculation of a square root. Divide the exponent of float by 2 and sqrt() its mantissa using the precalculated table.
float fast_float_sqrt(float n){
if (n <= 0.f)
return 0.f; // On 0 or negative return 0.
UINT32 *num = (UINT32*)&n;
short e; // Exponent
e = (*num >> 23) - 127; // In 'float' the exponent is stored with 127 added.
*num &= 0x7fffff; // leave only the mantissa
// If the exponent is odd so we have to look it up in the second half of the lookup table, so we set the high bit.
const int halfTableSize = (tableSize>>1);
const int secondHalphTableIdBit = halfTableSize << nUnusedBits;
if (e & 0x01)
*num |= secondHalphTableIdBit;
e >>= 1; // Divide the exponent by two (note that in C the shift operators are sign preserving for signed operands
// Do the table lookup, based on the quaternary mantissa, then reconstruct the result back into a float
*num = ((sqrtTab[*num >> nUnusedBits]) << nUnusedBits) | ((e + 127) << 23);
return n;
}