pow(NAN) is very slow - c++

What is the reason for the catastrophic performance of pow() for NaN values? As far as I can work out, NaNs should not have an impact on performance if the floating-point math is done with SSE instead of the x87 FPU.
This seems to be true for elementary operations, but not for pow(). I compared multiplication and division of a double to squaring and then taking the square root. If I compile the piece of code below with g++ -lrt, I get the following result:
multTime(3.14159): 20.1328ms
multTime(nan): 244.173ms
powTime(3.14159): 92.0235ms
powTime(nan): 1322.33ms
As expected, calculations involving NaN take considerably longer. Compiling with g++ -lrt -msse2 -mfpmath=sse however results in the following times:
multTime(3.14159): 22.0213ms
multTime(nan): 13.066ms
powTime(3.14159): 97.7823ms
powTime(nan): 1211.27ms
The multiplication / division of NaN is now much faster (actually faster than with a real number), but the squaring and taking the square root still takes a very long time.
Test code (compiled with gcc 4.1.2 on 32bit OpenSuSE 10.2 in VMWare, CPU is a Core i7-2620M)
#include <iostream>
#include <sys/time.h>
#include <cmath>
void multTime( double d )
{
struct timespec startTime, endTime;
double durationNanoseconds;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &startTime);
for(int i=0; i<1000000; i++)
{
d = 2*d;
d = 0.5*d;
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endTime);
durationNanoseconds = 1e9*(endTime.tv_sec - startTime.tv_sec) + (endTime.tv_nsec - startTime.tv_nsec);
std::cout << "multTime(" << d << "): " << durationNanoseconds/1e6 << "ms" << std::endl;
}
void powTime( double d )
{
struct timespec startTime, endTime;
double durationNanoseconds;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &startTime);
for(int i=0; i<1000000; i++)
{
d = pow(d,2);
d = pow(d,0.5);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endTime);
durationNanoseconds = 1e9*(endTime.tv_sec - startTime.tv_sec) + (endTime.tv_nsec - startTime.tv_nsec);
std::cout << "powTime(" << d << "): " << durationNanoseconds/1e6 << "ms" << std::endl;
}
int main()
{
multTime(3.14159);
multTime(NAN);
powTime(3.14159);
powTime(NAN);
}
Edit:
Unfortunately, my knowledge on this topic is extremely limited, but I guess that the glibc pow() never uses SSE on a 32bit system, but rather some assembly in sysdeps/i386/fpu/e_pow.S. There is a function __ieee754_pow_sse2 in more recent glibc versions, but it's in sysdeps/x86_64/fpu/multiarch/e_pow.c and therefore probably only works on x64. However, all of this might be irrelevant here, because pow() is also a gcc built-in function. For an easy fix, see Z boson's answer.

"NaNs should not have an impact on performance if the floating-point math is done with SSE instead of the x87 FPU."
I'm not sure this follows from the resource you quote. In any case, pow is a C library function. It is not implemented as an instruction, even on x87. So there are 2 separate issues here - how SSE handles NaN values, and how a pow function implementation handles NaN values.
If the pow function implementation uses a different path for special values like +/-Inf, or NaN, you might expect a NaN value for the base, or exponent, to return a value quickly. On the other hand, the implementation might not handle this as a separate case, and simply relies on floating-point operations to propagate intermediate results as NaN values.
Starting with 'Sandy Bridge', many of the performance penalties associated with denormals were reduced or eliminated. Not all though, as the author describes a penalty for mulps. Therefore, it would be reasonable to expect that not all arithmetic operations involving NaNs are 'fast'. Some architectures might even revert to microcode to handle NaNs in different contexts.

Your math library is too old. Either find another math library which implements pow with NAN better or implement a fix like this:
inline double pow_fix(double x, double y)
{
if(x!=x) return x;
if(y!=y) return y;
return pow(x,y);
}
Compile with g++ -O3 -msse2 -mfpmath=sse foo.cpp.

If you want to do squaring or taking the square root, use d*d or sqrt(d). The pow(d,2) and pow(d,0.5) will be slower and possibly less accurate, unless your compiler optimizes them based on the constant second argument 2 and 0.5; note that such an optimization may not always be possible for pow(d,0.5) since it returns 0.0 if d is a negative zero, while sqrt(d) returns -0.0.
For those doing timings, please make sure that you test the same thing.

With a complex function like pow() there are lots of ways that NaN could trigger slowness. It could be that the operations on NaNs are slow, or it could be that the pow() implementation checks for all sorts of special values that it can handle efficiently, and the NaN values fail all of those tests, leading to a more expensive path being taken. You'd have to step through the code to find out for sure.
A more recent implementation of pow() might include additional checks to handle NaN more efficiently, but this is always a tradeoff -- it would be a shame to have pow() handle 'normal' cases more slowly in order to accelerate NaN handling.
My blog post only applied to individual instructions, not complex functions like pow().

Related

Libc hypot function seems to return incorrect results for double type... why?

#include <tgmath.h>
#include <iostream>
int main(int argc, char** argv) {
#define NUM1 -0.031679909079365576
#define NUM2 -0.11491794452567111
std::cout << "double precision :"<< std::endl;
typedef std::numeric_limits< double > dbl;
std::cout.precision(dbl::max_digits10);
std::cout << std::hypot((double)NUM1, (double)NUM2);
std::cout << " VS sqrt :" << sqrt((double )NUM1*(double )NUM1
+ (double )NUM2*(double )NUM2) << std::endl;
std::cout << "long double precision :"<< std::endl;
typedef std::numeric_limits<long double > ldbl;
std::cout.precision(ldbl::max_digits10);
std::cout << std::hypot((long double)NUM1, (long double)NUM2);
std::cout << " VS sqrt :" << sqrt((long double )NUM1*(long double )NUM1 + (long double )NUM2*(long double )NUM2);
}
Returns under Linux (Ubuntu 18.04 clang or gcc, whatever optimisation, glic 2.25):
double precision :
0.1192046585217293 VS sqrt :0.11920465852172932
long double precision :
0.119204658521729311251 VS sqrt :0.119204658521729311251
According to the cppreference :
Implementations usually guarantee precision of less than 1 ulp (units in the last place): GNU, BSD, Open64
std::hypot(x, y) is equivalent to std::abs(std::complex(x,y))
POSIX specifies that underflow may only occur when both arguments are subnormal and the correct result is also subnormal (this forbids naive implementations)
So, hypot((double)NUM1, (double)NUM2) should return 0.11920465852172932, i suppose (as naive sqrt implementation).
On windows, using MSVC 64 bit, this is the case.
Why do we see this difference using glibc ? How is it possible to solve this inconsistency ?
0.11920465852172932 is represented by 0x1.e84324de1b576p-4 (as a double)
0.11920465852172930 is represented by 0x1.e84324de1b575p-4 (as a double)
0.119204658521729311251 is the long-double result, which we can assume is correct to a couple more decimal places. i.e. the exact result is closer to rounded up result.
Those FP bit-patterns differ only in the low bit of the mantissa (aka significand), and the exact result is between them. So they each have less than 1 ulp of rounding error, achieving what typical implementations (including glibc) aim for.
Unlike IEEE-754 "basic" operations (add/sub/mul/div/sqrt), hypot is not required to be "correctly rounded". That means <= 0.5 ulp of error. Achieving that would be much slower for operations the HW doesn't provide directly. (e.g. do extended-precision calculation with at least a couple extra definitely-correct bits, so you can round to the nearest double, like the hardware does for basic operations)
It happens that in this case, the naive calculation method produced the correctly-rounded result while glibc's "safe" implementation of std::hypot (that has to avoid underflow when squaring small numbers before adding) produced a result with >0.5 but <1 ulp of error.
You didn't specify whether you were using MSVC in 32-bit mode.
Presumably 32-bit mode would be using x87 for FP math, giving extra temporary precision. Although some MSVC versions' CRT code sets the x87 FPU's internal precision to round to 53-bit mantissa after every operation, so it behaves like SSE2 using actual double, except with a wider exponent range. See Bruce Dawson's blog post.
So I don't know if there's any reason beyond luck that MSVC's std::hypot got the correctly-rounded result for this.
Note that long double in MSVC is the same type as 64-bit double; that C++ implementation doesn't expose x86 / x86-64's 80-bit hardware extended-precision type. (64-bit mantissa).

Why this same code produce two different fp results on different Machines?

Here's the code:
#include <iostream>
#include <math.h>
const double ln2per12 = log(2.0) / 12.0;
int main() {
std::cout.precision(100);
double target = 9.800000000000000710542735760100185871124267578125;
double unnormalizatedValue = 9.79999999999063220457173883914947509765625;
double ln2per12edValue = unnormalizatedValue * ln2per12;
double errorLn2per12 = fabs(target - ln2per12edValue / ln2per12);
std::cout << unnormalizatedValue << std::endl;
std::cout << ln2per12 << std::endl;
std::cout << errorLn2per12 << " <<<<< its different" << std::endl;
}
If I try on my machine (MSVC), or here (GCC):
errorLn2per12 = 9.3702823278363212011754512786865234375e-12
Instead, here (GCC):
errorLn2per12 = 9.368505970996920950710773468017578125e-12
which is different. Its due to Machine Epsilon? Or Compiler precision flags? Or a different IEEE evaluation?
What's the cause here for this drift? The problem seems in fabs() function (since the other values seems the same).
Even without -Ofast, the C++ standard does not require implementations to be exact with log (or sin, or exp, etc.), only that they be within a few ulp (i.e. there may be some inaccuracies in the last binary places). This allows faster hardware (or software) approximations, which each platform/compiler may do differently.
(The only floating point math function that you will always get perfect results from on all platforms is sqrt.)
More annoyingly, you may even get different results between compilation (the compiler may use some internal library to be as precise as float/double allows for constant expressions) and runtime (e.g. hardware-supported approximations).
If you want log to give the exact same result across platforms and compilers, you will have to implement it yourself using only +, -, *, / and sqrt (or find a library with this guarantee). And avoid a whole host of pitfalls along the way.
If you need floating point determinism in general, I strongly recommend reading this article to understand how big of a problem you have ahead of you: https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/

How to force pow(float, int) to return float

The overloaded function float pow(float base, int iexp ) was removed in C++11 and now pow returns a double. In my program, I am computing lots of these (in single precision) and I am interested in the most efficient way how to do it.
Is there some special function (in standard libraries or any other) with the above signature?
If not, is it better (in terms of performance in single precision) to explicitly cast result of pow into float before any other operations (which would cast everything else into double) or cast iexp into float and use overloaded function float pow(float base, float exp)?
EDIT: Why I need float and do not use double?
The primarily reason is RAM -- I need tens or hundreds of GB so this reduction is huge advantage. So I need from float to get float. And now I need the most efficient way to achieve that (less casts, use already optimize algorithms, etc).
You could easily write your own fpow using exponentiation by squaring.
float my_fpow(float base, unsigned exp)
{
float result = 1.f;
while (exp)
{
if (exp & 1)
result *= base;
exp >>= 1;
base *= base;
}
return result;
}
Boring part:
This algorithm gives the best accuracy, that can be archived with float type when |base| > 1
Proof:
Let we want to calculate pow(a, n) where a is base and n is exponent.
Let's define b1=a1, b2=a2, b3=a4, b4=a8,and so on.
Then an is a product over all such bi where ith bit is set in n.
So we have ordered set B={bk1,bk1,...,bkn} and for any j the bit kj is set in n.
The following obvious algorithm A can be used for rounding error minimization:
If B contains single element, then it is result
Pick two elements p and q from B with minimal modulo
Remove them from B
Calculate product s = p*q and put it to B
Go to the first step
Now, lets prove that elements in B could be just multiplied from left to right without loosing accuracy. It comes form the fact, that:
bj > b1*b2*...*bj-1
because bj=bj-1*bj-1=bj-1*bj-2*bj-2=...=bj-1*bj-2*...*b1*b1
Since, b1 = a1 = a and its modulo more than one then:
bj > b1*b2*...*bj-1
Hence we may conclude, that during multiplication from left to right the accumulator variable is less than any element from B.
Then, expression result *= base; (except the very first iteration, for sure) does multiplication of two minimal numbers from B, so the rounding error is minimal. So, the code employs algorithm A.
Another question that can only be honestly answered with "wrong question". Or at least: "Are you really willing to go there?". float theoretically needs ca. 80% less die space (for the same number of cycles) and so can be much cheaper for bulk processing. GPUs love float for this reason.
However, let's look at x86 (admittedly, you didn't say what architecture you're on, so I picked the most common). The price in die space has already been paid. You literally gain nothing by using float for calculations. Actually, you may even lose throughput because additional extensions from float to double are required, and additional rounding to intermediate float precision. In other words, you pay extra to have a less accurate result. This is typically something to avoid except maybe when you need maximum compatibility with some other program.
See Jens' comment as well. These options give the compiler permission to disregard some language rules to achieve higher performance. Needless to say this can sometimes backfire.
There are two scenarios where float might be more efficient, on x86:
GPU (including GPGPU), in fact many GPUs don't even support double and if they do, it's usually much slower. Yet, you will only notice when doing very many calculations of this sort.
CPU SIMD aka vectorization
You'd know if you did GPGPU. Explicit vectorization by using compiler intrinsics is also a choice – one you could make, for sure, but this requires quite a cost-benefit analysis. Possibly your compiler is able to auto-vectorize some loops, but this is usually limited to "obvious" applications, such as where you multiply each number in a vector<float> by another float, and this case is not so obvious IMO. Even if you pow each number in such a vector by the same int, the compiler may not be smart enough to vectorize this effectively, especially if pow resides in another translation unit, and without effective link time code generation.
If you are not ready to consider changing the whole structure of your program to allow effective use of SIMD (including GPGPU), and you're not on an architecture where float is indeed much cheaper by default, I suggest you stick with double by all means, and consider float at best a storage format that may be useful to conserve RAM, or to improve cache locality (when you have a lot of them). Even then, measuring is an excellent idea.
That said, you could try ivaigult's algorithm (only with double for the intermediate and for the result), which is related to a classical algorithm called Egyptian multiplication (and a variety of other names), only that the operands are multiplied and not added. I don't know how pow(double, double) works exactly, but it is conceivable that this algorithm could be faster in some cases. Again, you should be OCD about benchmarking.
If you're targeting GCC you can try
float __builtin_powif(float, int)
I have no idea about it's performance tough.
Is there some special function (in standard libraries or any other) with the above signature?
Unfortunately, not that I know of.
But, as many have already mentioned benchmarking is necessary to understand if there is even an issue at all.
I've assembled a quick benchmark online. Benchmark code:
#include <iostream>
#include <boost/timer/timer.hpp>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/uniform_real_distribution.hpp>
#include <cmath>
int main ()
{
boost::random::mt19937 gen;
boost::random::uniform_real_distribution<> dist(0, 10000000);
const size_t size = 10000000;
std::vector<float> bases(size);
std::vector<float> fexp(size);
std::vector<int> iexp(size);
std::vector<float> res(size);
for(size_t i=0; i<size; i++)
{
bases[i] = dist(gen);
iexp[i] = std::floor(dist(gen));
fexp[i] = iexp[i];
}
std::cout << "float pow(float, int):" << std::endl;
{
boost::timer::auto_cpu_timer timer;
for(size_t i=0; i<size; i++)
res[i] = std::pow(bases[i], iexp[i]);
}
std::cout << "float pow(float, float):" << std::endl;
{
boost::timer::auto_cpu_timer timer;
for(size_t i=0; i<size; i++)
res[i] = std::pow(bases[i], fexp[i]);
}
return 0;
}
Benchmark results (quick conclusions):
gcc: c++11 is consistently faster than c++03.
clang: indeed int-version of c++03 seems a little faster. I'm not sure if it is within a margin of error, since I only run the benchmark online.
Both: even with c++11 calling pow with int seems to be a tad more performant.
It would be great if others could verify if this holds for their configurations as well.
Try using powf() instead. This is C99 function that should be also available in C++11.

Is it possible to roll a significantly faster version of sqrt

In an app I'm profiling, I found that in some scenarios this function is able to take over 10% of total execution time.
I've seen discussion over the years of faster sqrt implementations using sneaky floating-point trickery, but I don't know if such things are outdated on modern CPUs.
MSVC++ 2008 compiler is being used, for reference... though I'd assume sqrt is not going to add much overhead though.
See also here for similar discussion on modf function.
EDIT: for reference, this is one widely-used method, but is it actually much quicker? How many cycles is SQRT anyway these days?
Yes, it is possible even without trickery:
sacrifice accuracy for speed: the sqrt algorithm is iterative, re-implement with fewer iterations.
lookup tables: either just for the start point of the iteration, or combined with interpolation to get you all the way there.
caching: are you always sqrting the same limited set of values? if so, caching can work well. I've found this useful in graphics applications where the same thing is being calculated for lots of shapes the same size, so results can be usefully cached.
Hello from 11 years in the future.
Considering this still gets occasional votes, I thought I'd add a note about performance, which now even more than then is dramatically limited by memory accesses. You absolutely must use a realistic benchmark (ideally, your whole application) when optimising something like this - the memory access patterns of your application will have a dramatic effect on solutions like lookup tables and caches, and just comparing 'cycles' for your optimised version will lead you wildly astray: it is also very difficult to assign program time to individual instructions, and your profiling tool may mislead you here.
On a related note, consider using simd/vectorised instructions for calculating square roots, like _mm512_sqrt_ps or similar, if they suit your use case.
Take a look at section 15.12.3 of intel's optimisation reference manual, which describes approximation methods, with vectorised instructions, which would probably translate pretty well to other architectures too.
There's a great comparison table here:
http://assemblyrequired.crashworks.org/timing-square-root/
Long story short, SSE2's ssqrts is about 2x faster than FPU fsqrt, and an approximation + iteration is about 4x faster than that (8x overall).
Also, if you're trying to take a single-precision sqrt, make sure that's actually what you're getting. I've heard of at least one compiler that would convert the float argument to a double, call double-precision sqrt, then convert back to float.
You're very likely to gain more speed improvements by changing your algorithms than by changing their implementations: Try to call sqrt() less instead of making calls faster. (And if you think this isn't possible - the improvements for sqrt() you mention are just that: improvements of the algorithm used to calculate a square root.)
Since it is used very often, it is likely that your standard library's implementation of sqrt() is nearly optimal for the general case. Unless you have a restricted domain (e.g., if you need less precision) where the algorithm can take some shortcuts, it's very unlikely someone comes up with an implementation that's faster.
Note that, since that function uses 10% of your execution time, even if you manage to come up with an implementation that only takes 75% of the time of std::sqrt(), this still will only bring your execution time down by 2,5%. For most applications users wouldn't even notice this, except if they use a watch to measure.
How accurate do you need your sqrt to be? You can get reasonable approximations very quickly: see Quake3's excellent inverse square root function for inspiration (note that the code is GPL'ed, so you may not want to integrate it directly).
Don't know if you fixed this, but I've read about it before, and it seems that the fastest thing to do is replace the sqrt function with an inline assembly version;
you can see a description of a load of alternatives here.
The best is this snippet of magic:
double inline __declspec (naked) __fastcall sqrt(double n)
{
_asm fld qword ptr [esp+4]
_asm fsqrt
_asm ret 8
}
It's about 4.7x faster than the standard sqrt call with the same precision.
Here is a fast way with a look up table of only 8KB. Mistake is ~0.5% of the result. You can easily enlarge the table, thus reducing the mistake. Runs about 5 times faster than the regular sqrt()
// LUT for fast sqrt of floats. Table will be consist of 2 parts, half for sqrt(X) and half for sqrt(2X).
const int nBitsForSQRTprecision = 11; // Use only 11 most sagnificant bits from the 23 of float. We can use 15 bits instead. It will produce less error but take more place in a memory.
const int nUnusedBits = 23 - nBitsForSQRTprecision; // Amount of bits we will disregard
const int tableSize = (1 << (nBitsForSQRTprecision+1)); // 2^nBits*2 because we have 2 halves of the table.
static short sqrtTab[tableSize];
static unsigned char is_sqrttab_initialized = FALSE; // Once initialized will be true
// Table of precalculated sqrt() for future fast calculation. Approximates the exact with an error of about 0.5%
// Note: To access the bits of a float in C quickly we must misuse pointers.
// More info in: http://en.wikipedia.org/wiki/Single_precision
void build_fsqrt_table(void){
unsigned short i;
float f;
UINT32 *fi = (UINT32*)&f;
if (is_sqrttab_initialized)
return;
const int halfTableSize = (tableSize>>1);
for (i=0; i < halfTableSize; i++){
*fi = 0;
*fi = (i << nUnusedBits) | (127 << 23); // Build a float with the bit pattern i as mantissa, and an exponent of 0, stored as 127
// Take the square root then strip the first 'nBitsForSQRTprecision' bits of the mantissa into the table
f = sqrtf(f);
sqrtTab[i] = (short)((*fi & 0x7fffff) >> nUnusedBits);
// Repeat the process, this time with an exponent of 1, stored as 128
*fi = 0;
*fi = (i << nUnusedBits) | (128 << 23);
f = sqrtf(f);
sqrtTab[i+halfTableSize] = (short)((*fi & 0x7fffff) >> nUnusedBits);
}
is_sqrttab_initialized = TRUE;
}
// Calculation of a square root. Divide the exponent of float by 2 and sqrt() its mantissa using the precalculated table.
float fast_float_sqrt(float n){
if (n <= 0.f)
return 0.f; // On 0 or negative return 0.
UINT32 *num = (UINT32*)&n;
short e; // Exponent
e = (*num >> 23) - 127; // In 'float' the exponent is stored with 127 added.
*num &= 0x7fffff; // leave only the mantissa
// If the exponent is odd so we have to look it up in the second half of the lookup table, so we set the high bit.
const int halfTableSize = (tableSize>>1);
const int secondHalphTableIdBit = halfTableSize << nUnusedBits;
if (e & 0x01)
*num |= secondHalphTableIdBit;
e >>= 1; // Divide the exponent by two (note that in C the shift operators are sign preserving for signed operands
// Do the table lookup, based on the quaternary mantissa, then reconstruct the result back into a float
*num = ((sqrtTab[*num >> nUnusedBits]) << nUnusedBits) | ((e + 127) << 23);
return n;
}

The most efficient way of implementing pow() function in floating point

I am trying to implement my own version of pow() and sqrt() function as my custom library doesn't have pow()/sqrt() floating point support.
Can anyone help?
Yes, Sun can (Oracle now, I guess):
fdlibm, the "freely distributable math library", has sqrt and pow, along with many other math functions.
They're fairly high-tech implementations, though, and of course nothing is ever the "most efficient" implementation of something like this. Are you after source code to get it done, or are you really not so much looking for pow and sqrt, but actually looking for an education in floating-point algorithms programming?
Sure - it's easy if you have exponential and natural log functions.
Since y = x^n, you can take the natural log of both sides:
ln(y) = n*ln(x)
Then taking the exponential of both sides gives you what you want:
y = exp(n*ln(x))
If you want something better, the best place I know to look is Abramowitz and Stegun.
Note that if your instruction set has an instruction for square root or power, you'll be much better off using that. The x87 floating point instructions, for example, have an instruction fsqrt, and the SSE2 additions include another instruction sqrtsd, which are probably going to be much faster than most solutions written in C. In fact, atleast gcc uses the two instructions when compilation takes place on an x86 machine.
For power, however, things get somewhat murky. There's an instruction in the x87 floating point instruction set that can be used to calculate n*log2(n), namely fyl2x. Another instruction, fldl2e, stores log2(e) in the floating point stack. You might want to give these a look.
You might also want to take a look at how individual C libraries do this. dietlibc, for example, simply uses fsqrt:
sqrt:
fldl 4(%esp)
fsqrt
ret
glibc uses Sun's implementation for machines where a hardware square root instruction is not available (under sysdeps/ieee754/flt-32/e-sqrtf.c), and uses fsqrt on the x86 instruction set (though gcc can be instructed to instead use the sqrtsd instruction.)
Square root is properly implemented with an iterative Newtons method.
double ipow(int base, int exp)
{
bool flag=0;
if(exp<0) {flag=1;exp*=-1;}
int result = 1;
while (exp)
{
if (exp & 1)
result *= base;
exp >>= 1;
base *= base;
}
if(flag==0)
return result;
else
return (1.0/result);
}
//most suitable way to implement power function for integer to power integer
For calculating the square root of a float in C I'd recommend using fsqrt if you target x86.
You can use such ASM instruction with:
asm("fsqrt" : "+t"(myfloat));
For GCC or
asm {
fstp myfloat
fsqrt
fldp myfloat
}
Or something like that for Visual Studio.
For implementing pow, using a big switch statement like the one at upitasoft.com/link/powLUT.h should do.
It can cause some cache problems but if you keep it like that it shouldn't be a problem, just limit the range (note, you can still optimize the code I provided).
If you want to support floating point powers, is way harder...
You can try using the natural logarithm and exponential functions, such as:
float result = exp(number * log(power));
But usually it is slow and/or imprecise.
Hope I helped.
The fastest way I can think of doing a pow() would be along these lines (note, this is pretty complicated):
//raise x^y
double pow(double x, int y) {
int power;
map<int, double> powers;
for (power = 1; power < y; power *= 2, x *= x)
powers.insert(power, x);
while (power > y) {
//figure out how to get there
map<int, double>::iterator p = powers.lower_bound(power - y);
//p is an iterator that points to the biggest power we have that doesn't go over power - y
power -= p->first;
x /= p->second;
}
return x;
}
I have no idea about how to implement a decimal power. My best guess would be to use logarithms.
Edit: I'm attempting a logarithmic solution (based on y), as opposed to a linear solution, which you propose. Let me work this out and edit it, because I know it works.
Edit 2: Hehe, my bad. power *= 2 instead of power++