Hash function: Is there a way to optimize my code further?

Hash function: Is there a way to optimize my code further? - c++

Above is the hash function.
I wrote the code below. I am not sure if I can use another clever way to make this more efficient. I am using the understanding that I do not need to do the mod at all since unsigned int takes care of that through overflow.
int myHash(string s)
{
unsigned int hash = 0;
long long int multiplier = 1;
for(int i = s.size()-1;i>-1;i--)
{
hash += (multiplier * s[i]);
multiplier *= 31;
}
return hash;
}

I would avoid using long long for multiplier. At least if you don't know 100% that your processor does 64-bit multiplies in the same amount of time as a 32-bit multiply. Really modern top of the range processors probably do, older & smaller processors almost certainly take longer to do 64-bit mul operations than 32-bit ones.
Multiplying by 31 can actually be quite fast even on processors that aren't good at multiplying, because x *= 31 can be converted to x = x * 32 - x; or x = (x << 5) - x; - in fact it may be worth trying that [if you haven't compiled the code to assembler and seen that the compiler already does that].
Beyond that, it would be processor or compiler-specific optimisations that comes to mind. Loop unrolling for example. Or using inline assembler or intrinsics to make use of vector instructions (subject to availability for different processor architectures and different generations). Modern compilers like recent versions of gcc or clang will probably vectorize this code, subject to being given the "right" options.
As with all optimisation projects, measure the time, using a representative workload, keep records of what you changed. Look at the generated code, try to figure out if there's a better way to do it. And don't lose track of the fact that it's the OVERALL program's performance that matter. If you spend 80% of the time in this function, by all means, optimize the heck out of it. If you spend 20% of the time, optimize it a bit, if you spend 2% of the time in it, unless there's OBVIOUS things you can do to improve it, it's not going to give you much. I've seen the results of people writing code to save a few clock-cycles in some code that takes several million cycles in the loop two lines further on. And using bit-fiddling tricks to save 2 bytes in something that takes half a megabyte. It just creates mess, not really worth doing.

I guess you could make the argument not have to copy the string for the function call, make s const string &s instead, or use std::string_view if you happen to be using C++17. Otherwise it looks fast the the point where you should leave the rest to the compiler. Try making it optimize with -O2 or your compilers equivalent.

Let me preface this by saying it's probably not worth doing -- it's unlikely that your hash function is going to be the bottleneck in your program, so making the hash function more elaborate in an attempt to make it more efficient will probably just make it harder to understand and maintain while not making your program measurably faster. So don't do this unless you've actually determined that your program spends a significant percentage of its time computing string hashes, and make sure you have a good benchmark routine that you can run "before" and "after" this change to verify that it actually did speed things up significantly, otherwise you might just be chasing rainbows.
That said, one potential way to hash long strings more quickly would be to process the string a word at a time rather than a character at a time, something like this:
unsigned int aSlightlyFasterHash(const string & s)
{
const unsigned int numWordsInString = s.size()/sizeof(unsigned int);
const unsigned int numExtraBytesInString = s.size()%sizeof(unsigned int);
// Compute the bulk of the hash by reading the string a word at a time
unsigned int hash = 0;
const unsigned int * iptr = reinterpret_cast<const unsigned int *>(s.c_str());
for (unsigned int i=0; i<numWordsInString; i++)
{
hash += *iptr;
iptr++;
}
// Then any "leftover" bytes at the end we will mix in to the hash the old way
const unsigned char * cptr = reinterpret_cast<const unsigned char *>(iptr);
unsigned int multiplier = 1;
for(unsigned int i=0; i<numExtraBytesInString; i++)
{
hash += (multiplier * *cptr);
cptr++;
multiplier *= 31;
}
return hash;
}
Note that the above function will return different hash values than the hash function you provided.
That cuts down on the number of loop iterations by a factor of four; of course it's likely that the execution of the function is limited by RAM bandwidth rather than CPU cycles anyway, so do be too surprised if this doesn't go noticeably faster on a modern CPU. If RAM bandwidth is indeed the bottleneck, then there's not too much you can do about it, since you have to read the contents of the string in order to compute a hash code for the string; there's no getting around that (except perhaps by precomputing the hash code in advance and storing it somewhere, but that only works if you know all the strings you are going to use in advance).

Related

How to force pow(float, int) to return float

The overloaded function float pow(float base, int iexp ) was removed in C++11 and now pow returns a double. In my program, I am computing lots of these (in single precision) and I am interested in the most efficient way how to do it.
Is there some special function (in standard libraries or any other) with the above signature?
If not, is it better (in terms of performance in single precision) to explicitly cast result of pow into float before any other operations (which would cast everything else into double) or cast iexp into float and use overloaded function float pow(float base, float exp)?
EDIT: Why I need float and do not use double?
The primarily reason is RAM -- I need tens or hundreds of GB so this reduction is huge advantage. So I need from float to get float. And now I need the most efficient way to achieve that (less casts, use already optimize algorithms, etc).

You could easily write your own fpow using exponentiation by squaring.
float my_fpow(float base, unsigned exp)
{
float result = 1.f;
while (exp)
{
if (exp & 1)
result *= base;
exp >>= 1;
base *= base;
}
return result;
}
Boring part:
This algorithm gives the best accuracy, that can be archived with float type when |base| > 1
Proof:
Let we want to calculate pow(a, n) where a is base and n is exponent.
Let's define b1=a1, b2=a2, b3=a4, b4=a8,and so on.
Then an is a product over all such bi where ith bit is set in n.
So we have ordered set B={bk1,bk1,...,bkn} and for any j the bit kj is set in n.
The following obvious algorithm A can be used for rounding error minimization:
If B contains single element, then it is result
Pick two elements p and q from B with minimal modulo
Remove them from B
Calculate product s = p*q and put it to B
Go to the first step
Now, lets prove that elements in B could be just multiplied from left to right without loosing accuracy. It comes form the fact, that:
bj > b1*b2*...*bj-1
because bj=bj-1*bj-1=bj-1*bj-2*bj-2=...=bj-1*bj-2*...*b1*b1
Since, b1 = a1 = a and its modulo more than one then:
bj > b1*b2*...*bj-1
Hence we may conclude, that during multiplication from left to right the accumulator variable is less than any element from B.
Then, expression result *= base; (except the very first iteration, for sure) does multiplication of two minimal numbers from B, so the rounding error is minimal. So, the code employs algorithm A.

Another question that can only be honestly answered with "wrong question". Or at least: "Are you really willing to go there?". float theoretically needs ca. 80% less die space (for the same number of cycles) and so can be much cheaper for bulk processing. GPUs love float for this reason.
However, let's look at x86 (admittedly, you didn't say what architecture you're on, so I picked the most common). The price in die space has already been paid. You literally gain nothing by using float for calculations. Actually, you may even lose throughput because additional extensions from float to double are required, and additional rounding to intermediate float precision. In other words, you pay extra to have a less accurate result. This is typically something to avoid except maybe when you need maximum compatibility with some other program.
See Jens' comment as well. These options give the compiler permission to disregard some language rules to achieve higher performance. Needless to say this can sometimes backfire.
There are two scenarios where float might be more efficient, on x86:
GPU (including GPGPU), in fact many GPUs don't even support double and if they do, it's usually much slower. Yet, you will only notice when doing very many calculations of this sort.
CPU SIMD aka vectorization
You'd know if you did GPGPU. Explicit vectorization by using compiler intrinsics is also a choice – one you could make, for sure, but this requires quite a cost-benefit analysis. Possibly your compiler is able to auto-vectorize some loops, but this is usually limited to "obvious" applications, such as where you multiply each number in a vector<float> by another float, and this case is not so obvious IMO. Even if you pow each number in such a vector by the same int, the compiler may not be smart enough to vectorize this effectively, especially if pow resides in another translation unit, and without effective link time code generation.
If you are not ready to consider changing the whole structure of your program to allow effective use of SIMD (including GPGPU), and you're not on an architecture where float is indeed much cheaper by default, I suggest you stick with double by all means, and consider float at best a storage format that may be useful to conserve RAM, or to improve cache locality (when you have a lot of them). Even then, measuring is an excellent idea.
That said, you could try ivaigult's algorithm (only with double for the intermediate and for the result), which is related to a classical algorithm called Egyptian multiplication (and a variety of other names), only that the operands are multiplied and not added. I don't know how pow(double, double) works exactly, but it is conceivable that this algorithm could be faster in some cases. Again, you should be OCD about benchmarking.

If you're targeting GCC you can try
float __builtin_powif(float, int)
I have no idea about it's performance tough.

Is there some special function (in standard libraries or any other) with the above signature?
Unfortunately, not that I know of.
But, as many have already mentioned benchmarking is necessary to understand if there is even an issue at all.
I've assembled a quick benchmark online. Benchmark code:
#include <iostream>
#include <boost/timer/timer.hpp>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/uniform_real_distribution.hpp>
#include <cmath>
int main ()
{
boost::random::mt19937 gen;
boost::random::uniform_real_distribution<> dist(0, 10000000);
const size_t size = 10000000;
std::vector<float> bases(size);
std::vector<float> fexp(size);
std::vector<int> iexp(size);
std::vector<float> res(size);
for(size_t i=0; i<size; i++)
{
bases[i] = dist(gen);
iexp[i] = std::floor(dist(gen));
fexp[i] = iexp[i];
}
std::cout << "float pow(float, int):" << std::endl;
{
boost::timer::auto_cpu_timer timer;
for(size_t i=0; i<size; i++)
res[i] = std::pow(bases[i], iexp[i]);
}
std::cout << "float pow(float, float):" << std::endl;
{
boost::timer::auto_cpu_timer timer;
for(size_t i=0; i<size; i++)
res[i] = std::pow(bases[i], fexp[i]);
}
return 0;
}
Benchmark results (quick conclusions):
gcc: c++11 is consistently faster than c++03.
clang: indeed int-version of c++03 seems a little faster. I'm not sure if it is within a margin of error, since I only run the benchmark online.
Both: even with c++11 calling pow with int seems to be a tad more performant.
It would be great if others could verify if this holds for their configurations as well.

Try using powf() instead. This is C99 function that should be also available in C++11.

Tail call optimisation seems to slightly worsen performance

In a quicksort implementation, the data on left is for pure -O2 optimized code, and data on right is -O2 optimized code with -fno-optimize-sibling-calls flag turned on i.e with tail-call optimisation turned off. This is average of 3 different runs, variation seemed negligible. Values were of range 1-1000, time in millisecond. Compiler was MinGW g++, version 6.3.0.
size of array with TLO(ms) without TLO(ms)
8M 35,083 34,051
4M 8,952 8,627
1M 613 609
Below is my code:
#include <bits/stdc++.h>
using namespace std;
int N = 4000000;
void qsort(int* arr,int start=0,int finish=N-1){
if(start>=finish) return ;
int i=start+1,j = finish,temp;
auto pivot = arr[start];
while(i!=j){
while (arr[j]>=pivot && j>i) --j;
while (arr[i]<pivot && i<j) ++i;
if(i==j) break;
temp=arr[i];arr[i]=arr[j];arr[j]=temp; //swap big guy to right side
}
if(arr[i]>=arr[start]) --i;
temp = arr[start];arr[start]=arr[i];arr[i]=temp; //swap pivot
qsort(arr,start,i-1);
qsort(arr,i+1,finish);
}
int main(){
srand(time(NULL));
int* arr = new int[N];
for(int i=0;i<N;i++) {arr[i] = rand()%1000+1;}
auto start = clock();
qsort(arr);
cout<<(clock()-start)<<endl;
return 0;
}
I heard clock() isn't the perfect way to measure time. But this effect seems to be consistent.
EDIT: as response to a comment, I guess my question is : Explain how exactly gcc's tail-call optimizer works and how this happened and what should I do to leverage tail-call to speed up my program?

On speed:
As already pointed out in the comments, the primary goal of tail-call-optimization is to reduce the usage of the stack.
However, often there is a collateral: the program becomes faster because there is no overhead needed for a call of a function. This gain is most prominent if the work in the function itself is not that big, so the overhead has some weight.
If there is a lot of work done during a function call, the overhead can be neglected and there is no noticeable speed-up.
On the other hand, if tail call optimization is done, that means that potentially other optimization cannot be done, which could otherwise speed-up your code.
The case of your quick-sort is not that clear cut: There are some calls with a lot of workload and a lot of calls with a very small work load.
So, for 1M elements there are more disadvantages from tail-call-optimization as advantages. On my machine the tail-call-optimized function becomes faster than the non-optimized function for arrays smaller than 50000 elements.
I must confess, I cannot say, why this is the case alone from looking at the assembly. All I can understand, is that the resulting assemblies are pretty different and that the quicksort is really called once for the optimized version.
There is a clear cut example, for which tail-call-optimization is much faster (because there is not very much happening in the function itself and the overhead is noticeable):
//fib.cpp
#include <iostream>
unsigned long long int fib(unsigned long long int n){
if (n==0 || n==1)
return 1;
return fib(n-1)+fib(n-2);
}
int main(){
unsigned long long int N;
std::cin >> N;
std::cout << fib(N);
}
running time echo "40" | ./fib, I get 1.1 vs. 1.6 seconds for tail-call-optimized version vs. non-optimized version. Actually, I'm pretty impressed, that the compiler is able to use tail-call-optimization here - but it really does, as can be see at godbolt.org, - the second call of fib is optimized.
On tail call optimization:
Usually, tail-call optimization can be done if the recursion call is the last operation (prior to return) in the function - the variables on the stack can be reused for the next call, i.e. the function should be of the form
ResType f( InputType input){
//do work
InputType new_input = ...;
return f(new_input);
}
There are some languages which don't do tail call optimization at all (e.g. python) and some for which you can explicitely ask the compiler to do it and the compiler will fail if it were not able to (e.g. clojure). c++ goes a way in beetween: the compiler tries its best (which is amazingly good!), but you have no guarantee it will succseed and if not, it silently falls to a version without tail-call-optimization.
Let's take look at this simple and standard implementation of tail call recursion:
//should be called fac(n,1)
unsigned long long int
fac(unsigned long long int n, unsigned long long int res_so_far){
if (n==0)
return res_so_far;
return fac(n-1, res_so_far*n);
}
This classical form of tail-call makes it easy for compiler to optimize: see result here - no recursive call to fac!
However, the gcc compiler is able to perform the TCO also for less obvious cases:
unsigned long long int
fac(unsigned long long int n){
if (n==0)
return 1;
return n*fac(n-1);
}
It is easier to read and write for us humans, but harder to optimize for compiler (fun fact: TCO is not performed if the return type would be int instead of unsigned long long int): after all the result from the recursive call is used for further calculations (multiplication) before it is returned. But gcc manages to perform TCO here as well!
At hand of this example, we can see the result of TCO at work:
//factorial.cpp
#include <iostream>
unsigned long long int
fac(unsigned long long int n){
if (n==0)
return 1;
return n*fac(n-1);
}
int main(){
unsigned long long int N;
std::cin >> N;
std::cout << fac(N);
}
Running time echo "40000000" | ./factorial will get you the result (0) in no time if the tail-call-optimization was on, or "Segmentation fault" otherwise - because of the stack-overflow due to recursion depth.
Actually it is a simple test to see whether the tail-call-optimization was performed or not: "Segmentation fault" for non-optimized version and large recursion depth.
Corollary:
As already pointed out in the comments: Only the second call of the quick-sort is optimized via TLO. In you implementation, if you are unlucky and the second half of the array always consist of only one element you will need O(n) space on the stack.
However, if the first call would be always with the smaller half and the second call with the larger half were TLO, you would need at most O(log n) recursion depth and thus only O(log n) space on the stack.
That means you should check for which part of the array you call the quicksort first as it plays a huge role.

Can reducing loop times in C++ codes help increase the speed?

I give the following example to illustrate my question:
void fun(int i, float *pt)
{
// do something based on i
std::cout<<*(pt+i)<<std::endl;
}
const unsigned int LOOP = 2000000007;
void fun_without_optmization()
{
float *example;
example = new float [LOOP];
for(unsigned int i=0; i<LOOP; i++)
{
fun(i,example);
}
delete []example;
}
void fun_with_optimization()
{
float *example;
example = new float [LOOP];
unsigned int unit_loop = LOOP/10;
unsigned int left_loop = LOOP%10;
pt = example;
for(unsigend int i=0; i<unit_loop; i++)
{
fun(0,pt);
fun(1,pt);
fun(2,pt);
fun(3,pt);
fun(4,pt);
fun(5,pt);
fun(6,pt);
fun(7,pt);
fun(8,pt);
fun(9,pt);
pt=pt+10;
}
delete []example;
}
As far as I understand, function fun_without_optimization() and function fun_with_optimization() should perform the same. The only argument why the second function is better than the first is that the pointer calculation in fun becomes simple. Any other arguments why the second function is better?

Unrolling a loop in which I/O is performed is like moving the landing strip for a B747 from London an inch eastward in JFK.

Re: "Any other arguments why the second function is better?" - would you accept the answer explaining why it is NOT better?
Manually unrolling a loop is error-prone, as is clearly illustrated by your code: you forgot to process the tail left_loop.
For at least a couple of decades compiler does this optimization for you.
How do you know the optimal number of iteration to put in that unrolled loop? Do you target a specific cache size and calculate the length of assembly instructions in bytes? The compiler might.
Your messing with the otherwise clean loop can prevent other optimizations, like the use of SIMD.
The bottom line is: if you know something that your compiler doesn't (specific pattern of the run-time data, details of the targeted execution environment, etc.), and you know what you are doing - you can try manual loop unrolling. But even then - profile.

The technique you describe is called loop unrolling; potentially this increases performance, as the time for evaluation of the control structures (update of te loop variable and checking the termination condition) becomes smaller. However, decent compilers can do this for you and maintainability of the code decreases if done manually.

This is an optimization technique used for parallel architectures (architectures that support VLIW instructions). Depending on the number DALU (most common 4) and ALU(most common 2) units the architecture supports, and the level of "parallelization" the code supports, multiple instructions can be executes in one cycle.
So this code:
for (int i=0; i<n;i++) //n multiple of 4, for simplicity
a+=temp; //just a random instruction
Will actually execute faster on a parallel architecture if rewritten like:
for (int i=0;i<n ;i+=4)
{
temp0 = temp0 +temp1; //reads and additions can be executed in parallel
temp1 = temp2 +temp3;
a=temp0+temp1+a;
}
There is a limit to how much you can parallelize your code, a limit imposed by the physical ALUs/DALUs the CPU has. That's why it's important to know your architecture before you attempt to (properly) optimize your code.
It does not stop here: the code you want to optimize has to be a continuous block of code, meaning no jumps ( no function calls, no chance of flow instructions), for maximum efficiency.
Writing your code, like:
for(unsigend int i=0; i<unit_loop; i++)
{
fun(0,pt);
fun(1,pt);
fun(2,pt);
fun(3,pt);
fun(4,pt);
fun(5,pt);
fun(6,pt);
fun(7,pt);
fun(8,pt);
fun(9,pt);
pt=pt+10;
}
Wold not do much, unless the compiler inlines the function calls; and it looks like to many instructions anyway...
On a different note: while it's true that you ALWAYS have to work with the compiler when optimizing your code, you should NEVER rely only on it when you what to get the maximum optimization out of your code. Remember, the compiler handles 'the general case' while you are likely interested in a particular situation - that's why some compiles have special directives to help with the optimization process.

Cache Optimization Theory

I am thinking about heavy memory cache optimization and like to have some feedback.
Consider this example:
class example
{
float phase1;
float phaseInc;
float factor;
public:
void process(float* buffer,unsigned int iSamples)//<-high prio audio thread
{
for(unsigned int i = 0; i < iSamples; i++)// mostly iSamples is 32
{
phase1 += phaseInc;
float f1 = sinf(phase1);//<-sinf is just an example!
buffer[i] = f1*factor;
}
}
};
optimization idea:
void example::process(float* buffer,unsigned int iSamples)
{
float stackMemory[3];// should fit in L1
memcpy(stackMemory,&phase1,sizeof(float)*3);// get all memory at once
for(unsigned int i = 0; i < iSamples; i++)
{
stackMemory[0] += stackMemory[1];
float f1 = sinf(stackMemory[0]);
buffer[i] = f1*stackMemory[2];
}
memcpy(&phase1,stackMemory,sizeof(float)*1);// write back only changed mameory
}
Note that the real sample loop will contain thousands of operations.
So the stackMemory can become quite big.
I think it will be not more then 32kb (are there any smaller L1's out there ?).
Does the order of the used variables in this stackmemory matter ?
I hope not, because i'd like to order them so that i can reduce the writeback size.
Or does the L1 cache have the same cachline behaviour that RAM has ?
I have the feeling that i am somehow doing what prefetch is made for, but all i read about prefetch is relative vague about how to use it efficently. Try and error is not an option with 5000+ lines of code.
Code will run on Win,Mac and iOS.
Any ARM<->Intel issues to expect ?
Is it possible that this kind optimization is useless since all memory is accessed and transferred to L1 on the first iteration of the loop anyway ?
Thanks for any hints and ideas.

At first I thought there was a good chance that the second one could be slower as a result of additional memory access and instructions required for memcpy, while the first could simply work directly with these three class members already loaded into registers.
Nevertheless, I tried fiddling with the code in GCC 5.2 with both -O2 and -O3 and found that, no matter what I tried, I got identical assembly instructions for both. This is pretty amazing considering all the extra conceptual work that memcpy typically has to do that apparently got squashed away to zilch.
The one case I can think of where your second version might be faster in some scenario, on some compiler, is if the aliasing involved to access this->data_member interfered with an optimization and caused redundant loads and stores to/from registers.
It would have nothing to do with the L1 cache in that case and everything to do with register allocation on the compiler side. Caches are largely irrelevant here when you're loading the same memory (member variables) regardless for a contiguous chunk of data, it has entirely to do with registers. Nevertheless, I couldn't find a single scenario where I could cause that to happen where the compiler did a worse job with one over the other -- every case I tested yielded identical results. In a sufficiently complex real world case, perhaps there might be a difference.
Then again, in such a case, it should be on the safer side to simply do:
void process(float* buffer,unsigned int iSamples)
{
const float pi = phaseInc;
const float p1 = phase1;
const float fact = factor;
for(unsigned int i = 0; i < iSamples; i++)
{
phase1 += pi;
float f1 = sinf(p1);
buffer[i] = f1*fact;
}
}
There's no need to jump through hoops with memcpy to store the results into an array and back. That puts additional strain on the optimizer even if, in my findings, the optimizer managed to eliminate the overhead typically associated.
I realize your example is simplified, but there should not be a need to reduce the structure down to such a primitive array no matter how many data members you're dealing with (unless such an array actually is the most convenient representation). From a performance standpoint, a compiler will have an "easier" time (even if optimizers today are pretty amazing and can handle this) optimizing if you just use local variables instead of an array to which you memcpy aggregate data members in and out.

Is it possible to roll a significantly faster version of sqrt

In an app I'm profiling, I found that in some scenarios this function is able to take over 10% of total execution time.
I've seen discussion over the years of faster sqrt implementations using sneaky floating-point trickery, but I don't know if such things are outdated on modern CPUs.
MSVC++ 2008 compiler is being used, for reference... though I'd assume sqrt is not going to add much overhead though.
See also here for similar discussion on modf function.
EDIT: for reference, this is one widely-used method, but is it actually much quicker? How many cycles is SQRT anyway these days?

Yes, it is possible even without trickery:
sacrifice accuracy for speed: the sqrt algorithm is iterative, re-implement with fewer iterations.
lookup tables: either just for the start point of the iteration, or combined with interpolation to get you all the way there.
caching: are you always sqrting the same limited set of values? if so, caching can work well. I've found this useful in graphics applications where the same thing is being calculated for lots of shapes the same size, so results can be usefully cached.
Hello from 11 years in the future.
Considering this still gets occasional votes, I thought I'd add a note about performance, which now even more than then is dramatically limited by memory accesses. You absolutely must use a realistic benchmark (ideally, your whole application) when optimising something like this - the memory access patterns of your application will have a dramatic effect on solutions like lookup tables and caches, and just comparing 'cycles' for your optimised version will lead you wildly astray: it is also very difficult to assign program time to individual instructions, and your profiling tool may mislead you here.
On a related note, consider using simd/vectorised instructions for calculating square roots, like _mm512_sqrt_ps or similar, if they suit your use case.
Take a look at section 15.12.3 of intel's optimisation reference manual, which describes approximation methods, with vectorised instructions, which would probably translate pretty well to other architectures too.

There's a great comparison table here:
http://assemblyrequired.crashworks.org/timing-square-root/
Long story short, SSE2's ssqrts is about 2x faster than FPU fsqrt, and an approximation + iteration is about 4x faster than that (8x overall).
Also, if you're trying to take a single-precision sqrt, make sure that's actually what you're getting. I've heard of at least one compiler that would convert the float argument to a double, call double-precision sqrt, then convert back to float.

You're very likely to gain more speed improvements by changing your algorithms than by changing their implementations: Try to call sqrt() less instead of making calls faster. (And if you think this isn't possible - the improvements for sqrt() you mention are just that: improvements of the algorithm used to calculate a square root.)
Since it is used very often, it is likely that your standard library's implementation of sqrt() is nearly optimal for the general case. Unless you have a restricted domain (e.g., if you need less precision) where the algorithm can take some shortcuts, it's very unlikely someone comes up with an implementation that's faster.
Note that, since that function uses 10% of your execution time, even if you manage to come up with an implementation that only takes 75% of the time of std::sqrt(), this still will only bring your execution time down by 2,5%. For most applications users wouldn't even notice this, except if they use a watch to measure.

How accurate do you need your sqrt to be? You can get reasonable approximations very quickly: see Quake3's excellent inverse square root function for inspiration (note that the code is GPL'ed, so you may not want to integrate it directly).

Don't know if you fixed this, but I've read about it before, and it seems that the fastest thing to do is replace the sqrt function with an inline assembly version;
you can see a description of a load of alternatives here.
The best is this snippet of magic:
double inline __declspec (naked) __fastcall sqrt(double n)
{
_asm fld qword ptr [esp+4]
_asm fsqrt
_asm ret 8
}
It's about 4.7x faster than the standard sqrt call with the same precision.

Here is a fast way with a look up table of only 8KB. Mistake is ~0.5% of the result. You can easily enlarge the table, thus reducing the mistake. Runs about 5 times faster than the regular sqrt()
// LUT for fast sqrt of floats. Table will be consist of 2 parts, half for sqrt(X) and half for sqrt(2X).
const int nBitsForSQRTprecision = 11; // Use only 11 most sagnificant bits from the 23 of float. We can use 15 bits instead. It will produce less error but take more place in a memory.
const int nUnusedBits = 23 - nBitsForSQRTprecision; // Amount of bits we will disregard
const int tableSize = (1 << (nBitsForSQRTprecision+1)); // 2^nBits*2 because we have 2 halves of the table.
static short sqrtTab[tableSize];
static unsigned char is_sqrttab_initialized = FALSE; // Once initialized will be true
// Table of precalculated sqrt() for future fast calculation. Approximates the exact with an error of about 0.5%
// Note: To access the bits of a float in C quickly we must misuse pointers.
// More info in: http://en.wikipedia.org/wiki/Single_precision
void build_fsqrt_table(void){
unsigned short i;
float f;
UINT32 *fi = (UINT32*)&f;
if (is_sqrttab_initialized)
return;
const int halfTableSize = (tableSize>>1);
for (i=0; i < halfTableSize; i++){
*fi = 0;
*fi = (i << nUnusedBits) | (127 << 23); // Build a float with the bit pattern i as mantissa, and an exponent of 0, stored as 127
// Take the square root then strip the first 'nBitsForSQRTprecision' bits of the mantissa into the table
f = sqrtf(f);
sqrtTab[i] = (short)((*fi & 0x7fffff) >> nUnusedBits);
// Repeat the process, this time with an exponent of 1, stored as 128
*fi = 0;
*fi = (i << nUnusedBits) | (128 << 23);
f = sqrtf(f);
sqrtTab[i+halfTableSize] = (short)((*fi & 0x7fffff) >> nUnusedBits);
}
is_sqrttab_initialized = TRUE;
}
// Calculation of a square root. Divide the exponent of float by 2 and sqrt() its mantissa using the precalculated table.
float fast_float_sqrt(float n){
if (n <= 0.f)
return 0.f; // On 0 or negative return 0.
UINT32 *num = (UINT32*)&n;
short e; // Exponent
e = (*num >> 23) - 127; // In 'float' the exponent is stored with 127 added.
*num &= 0x7fffff; // leave only the mantissa
// If the exponent is odd so we have to look it up in the second half of the lookup table, so we set the high bit.
const int halfTableSize = (tableSize>>1);
const int secondHalphTableIdBit = halfTableSize << nUnusedBits;
if (e & 0x01)
*num |= secondHalphTableIdBit;
e >>= 1; // Divide the exponent by two (note that in C the shift operators are sign preserving for signed operands
// Do the table lookup, based on the quaternary mantissa, then reconstruct the result back into a float
*num = ((sqrtTab[*num >> nUnusedBits]) << nUnusedBits) | ((e + 127) << 23);
return n;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js