Why a power function is often computed as a logarithm? - c++

I am studying some x86 ASM code and what the code really does, it's my understanding that a power function (x^y) internally works as a logarithm function. By internally I mean the CPU registers.
Why is this? What is the benefit? Is it a benefit that can be replicated and borrowed by other high level languages like C++?

You should take a look at this MATH,
What the answer says is that the log functions can be achieved through taylor series and a look-up table combined. If there is a look-up table, then it may be hashed and the look-up will be easier than pow which only can be obtained through computation.
so for xy you can write it as
y = log<sub>10</sub> x .
or
y = (ln x)/(ln 10);
now if there is no log implementation for pow then it should be achieved through Addition chain exponentiation Wiki which requires runtime computation which might take longer than finding log.
EDIT : Thanks to #harold exponentiation can be performed more optimally using Exponentiation by squaring.

See the answer to this question: How to: pow(real, real) in x86
Think back to your logarithm rules: the base of 2 cancels with log2(x) leaving just x^y.
Edit:
We have an x86 instruction to calculate each of the components needed.

Related

Fast approximation of square_root(x) with known 0<=x<=1

How to approximate sqrt(float32bit x) if I know that x<=1?
There have to be some tricks to exploit the range x<=1.
The return result doesn't have to be precise, the maximum error may be <0.001.
(0.001 is just a magic number, you can change it.)
I don't mind language, but I prefer C++, in CPU (not GPU).
I think explicit formula is better than table look-up.
It is useful for particles in my 3D game (VS 2015 + Ogre3D + Bullet), and I can't find any clue about it.
I doubt the reason of the downvote storm is that it looks like an assignment / interview?
... or the solution is already well-known?
Square roots are ordinarily computed through FSQRT which getting increasingly faster. It is a single instruction that you practically can't beat. The fast inverse square root for example won't help unless you can use its result directly. If you have to invert it again, that FDIV alone will take roughly as much time as FSQRT would have.

Time complexity of c++ math library pow() function?

I wanted to know what is the worst case time-complexity of the pow() function that's built in c++?
That depends on the underlying architecture. On the most common desktop architecture, x86, this is a constant time operation.
See this question for more details on how it could be implemented on x86: How to: pow(real, real) in x86
Here is one implementation, take a look. To be sure, it is a rather complex piece of code, with some 19 special cases. The time complexity does not appear to be dependent on the values passed in.
Here is a short description of the method used to compute pow(x,y):
Method: Let `x = 2 * (1+f)`
Compute and return log2(x) in two pieces:
log2(x) = w1 + w2,
where w1 has 53-24 = 29 bit trailing zeros.
Perform y*log2(x) = n+y' by simulating muti-precision
arithmetic, where |y'|<=0.5.
Return x**y = 2**n*exp(y'*log2)
You don't mention what system/architecture you're on, so we are left guessing.
However, if you're not looking for specifics and just want to browse the code of a freely available implementation See http://www.netlib.org/fdlibm/, specifically http://www.netlib.org/fdlibm/w_pow.c
See this question's answer for more background: https://stackoverflow.com/a/2285277/25882

C/C++ fastest cmath log operation

I'm trying to calculate logab (and get a floating point back, not an integer). I was planning to do this as log(b)/log(a). Mathematically speaking, I can use any of the cmath log functions (base 2, e, or 10) to do this calculation; however, I will be running this calculation a lot during my program, so I was wondering if one of them is significantly faster than the others (or better yet, if there is a faster, but still simple, way to do this). If it matters, both a and b are integers.
First, precalculate 1.0/log(a) and multiply each log(b) by that expression instead.
Edit: I originally said that the natural logarithm (base e) would be fastest, but others state that base 2 is supported directly by the processor and would be fastest. I have no reason to doubt it.
Edit 2: I originally assumed that a was a constant, but in re-reading the question that is never stated. If so then there would be no benefit to precalculating. If it is however, you can maintain readability with an appropriate choice of variable names:
const double base_a = 1.0 / log(a);
for (int b = 0; b < bazillions; ++b)
double result = log(b) * base_a;
Strangely enough Microsoft doesn't supply a base 2 log function, which explains why I was unfamiliar with it. Also the x86 instruction for calculating logs includes a multiplication automatically, and the constants required for the different bases are also available via an optimized instruction, so I'd expect the 3 different log functions to have identical timing (even base 2 would have to multiply by 1).
Since b and a are integers, you can use all the glory of bit twiddling to find their logs to the base 2. Here are some:
Find the log base 2 of an integer with the MSB N set in O(N) operations (the obvious way)
Find the integer log base 2 of an integer with an 64-bit IEEE float
Find the log base 2 of an integer with a lookup table
Find the log base 2 of an N-bit integer in O(lg(N)) operations
Find the log base 2 of an N-bit integer in O(lg(N)) operations with multiply and lookup
I'll leave it to you to choose the best "fast-log" function for your needs.
On the platforms for which I have data, log2 is very slightly faster than the others, in line with my expectations. Note however, that the difference is extremely slight (only a couple percent). This is really not worth worrying about.
Write an implementation that is clear. Then measure the performance.
In the 8087 instruction set, there is only an instruction for the logarithm to base 2, so I would guess this one to be the fastest.
Of course this kind of question depends largely on your processor/architecture, so I would suggest to make a simple test and time it.
The answer is:
it depends
profile it
You don't even mention your CPU type, the variable type, the compiler flags, the data layout. If you need to do lot's of these in parallel, I'm sure there will be a SIMD option. Your compiler will optimize that as long as you use alignment and clear simple loops (or valarray if you like archaic approaches).
Chances are, the intel compiler has specific tricks for intel processors in this area.
If you really wanted you could use CUDA and leverage GPU.
I suppose, if you are unfortunate enough to lack these instruction sets you could go down at the bit fiddling level and write an algorithm that does a nice approximation. In this case, I can bet more than one apple-pie that 2-log is going to be faster than any other base-log

Speedup C++ code

I am writing a C++ number crunching application, where the bottleneck is a function that has to calculate for double:
template<class T> inline T sqr(const T& x){return x*x;}
and another one that calculates
Base dist2(const Point& p) const
{ return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z); }
These operations take 80% of the computation time. I wonder if you can suggest approaches to make it faster, even if there is some sort of accuracy loss
Thanks
First, make sure dist2 can be inlined (it's not clear from your post whether or not this is the case), having it defined in a header file if necessary (generally you'll need to do this - but if your compiler generates code at link time, then that's not necessarily the case).
Assuming x86 architecture, be sure to allow your compiler to generate code using SSE2 instructions (an example of an SIMD instruction set) if they are available on the target architecture. To give the compiler the best opportunity to optimize these, you can try to batch your sqr operations together (SSE2 instructions should be able to do up to 4 float or 2 double operations at a time depending on the instruction.. but of course it can only do this if you have the inputs to more than one operation on the ready). I wouldn't be too optimistic about the compiler's ability to figure out that it can batch them.. but you can at least set up your code so that it would be possible in theory.
If you're still not satisfied with the speed and you don't trust that your compiler is doing it best, you should look into using compiler intrinsics which will allow you to write potential parallel instructions explicitly.. or alternatively, you can go right ahead and write architecture-specific assembly code to take advantage of SSE2 or whichever instructions are most appropriate on your architecture. (Warning: if you hand-code the assembly, either take extra care that it still gets inlined, or make it into a large batch operation)
To take it even further, (and as glowcoder has already mentioned) you could perform these operations on a GPU. For your specific case, bear in mind that GPU's often don't support double precision floating point.. though if it's a good fit for what you're doing, you'll get orders of magnitude better performance this way. Google for GPGPU or whatnot and see what's best for you.
What is Base?
Is it a class with a non-explicit constructor? It's possible that you're creating a fair amount of temporary Base objects. That could be a big CPU hog.
template<class T> inline T sqr(const T& x){return x*x;}
Base dist2(const Point& p) const {
return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z);
}
If p's member variables are of type Base, you could be calling sqr on Base objects, which will be creating temporaries for the subtracted coordinates, in sqr, and then for each added component.
(We can't tell without the class definitions)
You could probably speed it up by forcing the sqr calls to be on primitves and not using Base until you get to the return type of dist2.
Other performance improvement opportunities are to:
Use non-floating point operations, if you're ok with less precision.
Use algorithms which don't need to call dist2 so much, possibly caching or using the transitive property.
(this is probably obvious, but) Make sure you're compiling with optimization turned on.
I think optimising these functions might be difficult, you might be better off optimising the code that calls these functions to call them less, or to do things differently.
You don't say whether the calls to dist2 can be parallelised or not. If they can, then you could build a thread pool and split this work up into smaller chunks per thread.
What does your profiler tell you is happening inside dist2. Are you actually using 100% CPU all the time or are you cache missing and waiting for data to load?
To be honest, we really need more details to give you a definitive answer.
If sqr() is being used only on primitive types, you might try taking the argument by value instead of reference. That would save you an indirection.
If you can organise your data suitably then you may well be able to use SIMD optimisation here. For an efficient implementation you would probably want to pad your Point struct so that it has 4 elements (i.e. add a fourth dummy element for padding).
If you have a number of these to do, and you're doing graphics or "graphic like" tasks (thermal modeling, almost any 3d modeling) you might consider using OpenGL and offloading the tasks to a GPU. This would allow the computations to run in parallel, with highly optimized operational capacity. After all, you would expect something like distance or distancesq to have its own opcode on a GPU.
A researcher at a local univeristy offload almost all of his 3d-calculations for AI work to the GPU and achieved much faster results.
There are a lot of answers mentioning SSE already… but since nobody has mentioned how to use it, I'll throw another in…
Your code has most everything a vectorizer needs to work, except two constraints: aliasing and alignment.
Aliasing is the problem of two names referring two the same object. For example, my_point.dist2( my_point ) would operate on two copies of my_point. This messes with the vectorizer.
C99 defines the keyword restrict for pointers to specify that the referenced object is referenced uniquely: there will be no other restrict pointer to that object in the current scope. Most decent C++ compilers implement C99 as well, and import this feature somehow.
GCC calls it __restrict__. It may be applied to references or this.
MSVC calls it __restrict. I'd be surprised if support were any different from GCC.
(It is not in C++0x, though.)
#ifdef __GCC__
#define restrict __restrict__
#elif defined _MSC_VER
#define restrict __restrict
#endif
 
Base dist2(const Point& restrict p) const restrict
Most SIMD units require alignment to the size of the vector. C++ and C99 leave alignment implementation-defined, but C++0x wins this race by introducing [[align(16)]]. As that's still a bit in the future, you probably want your compiler's semi-portable support, a la restrict:
#ifdef __GCC__
#define align16 __attribute__((aligned (16)))
#elif defined _MSC_VER
#define align16 __declspec(align (16))
#endif
 
struct Point {
double align16 xyz[ 3 ]; // separate x,y,z might work; dunno
…
};
This isn't guaranteed to produce results; both GCC and MSVC implement helpful feedback to tell you what wasn't vectorized and why. Google your vectorizer to learn more.
If you really need all the dist2 values, then you have to compute them. It's already low level and cannot imagine speedups apart from distributing on multiple cores.
On the other side, if you're searching for closeness, then you can supply to the dist2() function your current miminum value. This way, if sqr(x-p.x) is already larger than your current minimum, you can avoid computing the remaining 2 squares.
Furthermore, you can avoid the first square by going deeper in the double representation. Comparing directly on the exponent value with your current miminum can save even more cycles.
Are you using Visual Studio? If so you may want to look at specifying the floating point unit control using /fp fast as a compile switch. Have a look at The fp:fast Mode for Floating-Point Semantics. GCC has a host of -fOPTION floating point optimisations you might want to consider (if, as you say, accuracy is not a huge concern).
I suggest two techniques:
Move the structure members into
local variables at the beginning.
Perform like operations together.
These techniques may not make a difference, but they are worth trying. Before making any changes, print the assembly language first. This will give you a baseline for comparison.
Here's the code:
Base dist2(const Point& p) const
{
// Load the cache with data values.
register x1 = p.x;
register y1 = p.y;
register z1 = p.z;
// Perform subtraction together
x1 = x - x1;
y1 = y - y1;
z1 = z - z2;
// Perform multiplication together
x1 *= x1;
y1 *= y1;
z1 *= z1;
// Perform final sum
x1 += y1;
x1 += z1;
// Return the final value
return x1;
}
The other alternative is to group by dimension. For example, perform all 'X' operations first, then Y and followed by Z. This may show the compiler that pieces are independent and it can delegate to another core or processor.
If you can't get any more performance out of this function, you should look elsewhere as other people have suggested. Also read up on Data Driven Design. There are examples where reorganizing the loading of data can speed up performance over 25%.
Also, you may want to investigate using other processors in the system. For example, the BOINC Project can delegate calculations to a graphics processor.
Hope this helps.
From an operation count, I don't see how this can be sped up without delving into hardware optimizations (like SSE) as others have pointed out. An alternative is to use a different norm, like the 1-norm is just the sum of the absolute values of the terms. Then no multiplications are necessary. However, this changes the underlying geometry of your space by rearranging the apparent spacing of the objects, but it may not matter for your application.
Floating point operations are quite often slower, maybe you can think about modifying the code to use only integer arithmetic and see if this helps?
EDIT: After the point made by Paul R I reworded my advice not to claim that floating point operations are always slower. Thanks.
Your best hope is to double-check that every dist2 call is actually needed: maybe the algorithm that calls it can be refactored to be more efficient? If some distances are computed multiple times, maybe they can be cached?
If you're sure all of the calls are necessary, you may be able to squeeze out a last drop of performance by using an architecture-aware compiler. I've had good results using Intel's compiler on x86s, for instance.
Just a few thoughts, however unlikely that I will add anything of value after 18 answers :)
If you are spending 80% time in these two functions I can imagine two typical scenarios:
Your algorithm is at least polynomial
As your data seem to be spatial maybe you can bring the O(n) down by introducing spatial indexes?
You are looping over certain set
If this set comes either from data on disk (sorted?) or from loop there might be possibility to cache, or use previous computations to calculate sqrt faster.
Also regarding the cache, you should define the required precision (and the input range) - maybe some sort of lookup/cache can be used?
(scratch that!!! sqr != sqrt )
See if the "Fast sqrt" is applicable in your case :
http://en.wikipedia.org/wiki/Fast_inverse_square_root
Look at the context. There's nothing you can do to optimize an operation as simple as x*x.
Instead you should look at a higher level: where is the function called from? How often? Why? Can you reduce the number of calls? Can you use SIMD instructions to perform the multiplication on multiple elements at a time?
Can you perhaps offload entire parts of the algorithm to the GPU?
Is the function defined so that it can be inlined? (basically, is its definition visible at the call sites)
Is the result needed immediately after the computation? If so, the latency of FP operations might hurt you. Try to arrange your code so dependency chains are broken up or interleaved with unrelated instructions.
And of course, examine the generated assembly and see if it's what you expect.
Is there a reason you are implementing your own sqr operator?
Have you tried the one in libm it should be highly optimized.
The first thing that occurs to me is memoization ( on-the-fly caching of function calls ), but both sqr and dist2 it would seem like they are too low level for the overhead associated with memoization to be made up for in savings due to memoization. However at a higher level, you may find it may work well for you.
I think a more detailed analysis of you data is called for. Saying that most of the time in the program is spent executing MOV and JUMp commands may be accurate, but it's not going to help yhou optimise much. The information is too low level. For example, if you know that integer arguments are good enough for dist2, and the values are between 0 and 9, then a pre-cached tabled would be 1000 elements--not to big. You can always use code to generate it.
Have you unrolled loops? Broken down matrix opration? Looked for places where you can get by with table lookup instead of actual calculation.
Most drastic would be to adopt the techniques described in:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.115.8660&rep=rep1&type=pdf
though it is admittedly a hard read and you should get some help from someone who knows Common Lisp if you don't.
I'm curious why you made this a template when you said the computation is done using doubles?
Why not write a standard method, function, or just 'x * x' ?
If your inputs can be predictably constrained and you really need speed create an array that contains all the outputs your function can produce. Use the input as the index into the array (A sparse hash). A function evaluation then becomes a comparison (to test for array bounds), an addition, and a memory reference. It won't get a lot faster than that.
See the SUBPD, MULPD and DPPD instructions. (DPPD required SSE4)
Depends on your code, but in some cases a stucture-of-arrays layout might be more friendly to vectorization than an array-of-structures layout.

Fast implementation/approximation of pow() function in C/C++

I m looking for a faster implementation or good a approximation of functions provided by cmath.
I need to speed up the following functions
pow(x,y)
exp(z*pow(x,y))
where z<0. x is from (-1.0,1.0) and y is from (0.0, 5.0)
Here are some approxmiations:
Optimized pow Approximation for Java and C / C++. This approximation is very inaccurate, you have to try for yourself if it is good enough.
Optimized Exponential Functions for Java. Quite good! I use it for a neural net.
If the above approximation for pow is not good enough, you can still try to replace it with exponential functions, depending on your machine and compiler this might be faster:
x^y = e^(y*ln(x))
And the result: e^(z * x^y) = e^(z * e^(y*ln(x)))
Another trick is when some parameters of the formula do not change often. So if e.g. x and y are mostly constant, you can precalculate x^y and reuse this.
What are the possible values of x and y?
If they are within reasonable bounds, building some lookup tables could help.
I recommend the routines in the book "Math Toolkit for Real-Time Programming" by Jack W. Crenshaw.
You might also want to post some of your code to show how you are calling these functions, as there may be some other higher level optimisation possibilities that are not apparent from the description given so far.