Predefinition of often used values in computations - does it change anything?

Predefinition of often used values in computations - does it change anything? - c++

I'm auto generating C code to compute large expressions and try to figure out with simple examples whether it makes sense to predefine certain subparts in separate variables.
As a simple example, say we compute something of the form:
#include <cmath>
double test(double x, double y) {
const double c[9][9] = { ... }; // constants properly initialized, irrelevant
double expr = c[0][0]*x*y
+ c[1][0]*pow(x,2)*y + ... + c[8][0]*pow(x,9)*y
+ c[1][1]*pow(x,2)*pow(y,2) + ... + c[8][1]*pow(x,9)*pow(y,2)
+ ...
with all c[i][j] properly initialized. In reality those expressions contain tens of millions of multiplications and additions.
A colleague now proposed -- to reduce the number of calls to pow() and to cache often needed values in the expressions -- to define every power of x and y in a separate variable, which is no big deal as the code is auto generated anyway, like this:
double xp2 = pow(x,2);
double xp3 = pow(x,3);
double xp4 = pow(x,4);
// ...
// same for pow(y,n)
I think, however, that this is unnecessary, as the compiler should take care of these optimizations.
Unfortunately, I have no experience with reading and interpreting assembly but I think I see that all the calls to pow() are optimized out, is this right? Also, does the compiler cache the values for pow(x,2), pow(x,3), etc?
Thanks in advance for your input!

Using pow with integer arguments... ouch ! Typical implementations of pow are tuned for the general case of floating point arguments, which is why it is usually way slower to write
pow(x, 2) ( = exp(2 * log(x)) )
than
x * x
What I state here is very compiler dependant though. On one hand, some compilers may not even know that pow(x, 2) will yield the same value for a given x (after all, the extern function pow could have side effects), so you don't have any guarantee that common subexpressions will be eliminated. The pow function, on some (many ?) platforms/toolchains, is provided by a library the compiler has no control onto.
On other implementations though, the compiler may turn those pow calls into multiplications, or at least into intrinsics, which may in turn specialize for integer exponents. Your mileage will vary.
The first thing I'd do is to replace calls to pow by multiplications. For larger exponents, you may also do, eg.
double x2 = x * x;
double x3 = x * x2;
double x4 = x2 * x2;
Note that (credits to #Stephen Canon) doing repeated multiplications (with the above quick exponentiation scheme) will introduce roundoff error whose magnitude is proportional to the number of multiplications (ie. O(log exponent)). This error is typically tolerable, but pow guarantees exactness within one unit of least precision.

The compiler may perform common subexpression elimination- remember that it can't guarantee that all functions are re-entrant, but if pow is inlined, then it may well do this.

A good way to compute polynomials is Horner's rule. (eg here) which doesn't require pow() or any extra memory.
Your expression is x*y times a polynomial in y each of whose coefficients is a polynomial in x.
Each of these coefficients can be calculated using Horner with 8 multiplies and additions, and the polynomial in y with 8 more multiplies and additions for a total of 74 multiplies and 72 additions , whereas your sample code looks to me like more that 200 multiplications and more than a hundred calls to pow().

pow may be optimized away depending on the toolchain. The only way you can tell is to try it and see.
In the general case, unless the implementation of pow is visible to the compiler as a macro or inline, then the compiler can't cache the result as it doesn't know what side-effects the function may have.

Profile, find out where the bottlenecks are.
If the sub-expressions are used frequently, it may make sense to cache or store the intermediate values. However, accessing these values may take more time than letting the values sit in a data pipeline within the processor. Data fetches outside of the processor are much slower than fetching from its internal data cache.
Also try using Algebra to simplify the mathematical expressions. Perhaps even Linear Algebra to find some more efficient matrix expressions.
You may want to isolate the calculations to expressions involving one variable. Compilers can optimize code better when only one variable is used or changing at a time. For example, substitute the y variable with expressions involving x, if possible. This would reduce to an expression only involving x.
Also search the web for "data driven design" or "data oriented design". These sites show how to optimize code for data centric applications.

Related

Enforcing order of execution

I would like to ensure that the calculations requested are executed exactly in the order I specify, without any alterations from either the compiler or CPU (including the linker, assembler, and anything else you can think of).
Operator left-to-right associativity is assumed in the C language
I am working in C (possibly also interested in C++ solutions), which states that for operations of equal precedence there is an assumed left-to-right operator associativity, and hence
a = b + c - d + e + f - g ...;
is equivalent to
a = (...(((((b + c) - d) + e) + f) - g) ...);
A small example
However, consider the following example:
double a, b = -2, c = -3;
a = 1 + 2 - 2 + 3 + 4;
a += 2*b;
a += c;
So many opportunities for optimisation
For many compilers and pre-processors they may be clever enough to recognise the "+ 2 - 2" is redundant and optimise this away. Similarly they could recognise that the "+= 2*b" followed by the "+= c" can be written using a single FMA. Even if they don't optimise in an FMA, they may switch the order of these operations etc. Furthermore, if the compiler doesn't do any of these optimisations, the CPU may well decide to do some out of order execution, and decide it can do the "+= c" before the "+= 2*b", etc.
As floating-point arithmetic is non-associative, each type of optimisation may result in a different end result, which may be noticeable if the following is inlined somewhere.
Why worry about floating point associativity?
For most of my code I would like as much optimisation as I can have and don't care about floating-point associativity or bit-wise reproduciblilty, but occasionally there is a small snippet (similar to the above example) which I would like to be untampered with and totally respected. This is because I am working with a mathematical method which exactly requires a reproducible result.
What can I do to resolve this?
A few ideas which have come to mind:
Disable compiler optimisations and out of order execution
I don't want this, as I want the other 99% of my code to be heavily optimised. (This seems to be cutting off my nose to spite my face). I also most likely won't have permission to change my hardware settings.
Use a pragma
Write some assembly
The code snippets are small enough that this might be reasonable, although I'm not very confident in this, especially if (when) it comes to debugging.
Put this in a separate file, compile separately as un-optimised as possible, and then link using a function call
Volatile variables
To my mind these are just for ensuring that memory access is respected and un-optimised, but perhaps they might prove useful.
Access everything through judicious use of pointers
Perhaps, but this seems like a disaster in readability, performance, and bugs waiting to happen.
If anyone can think of any feasibly solutions (either from any of the ideas I've suggested or otherwise) that would be ideal. The "pragma" option or "function call" to my mind seem like the best approaches.
The ultimate goal
To have something that marks off a small chuck of simple and largely vanilla C code as protected and untouchable to any (realistically most) optimisations, while allowing for the rest of the code to be heavily optimised, covering optimisations from both the CPU and compiler.

This is not a complete answer, but it is informative, partially answers, and is too long for a comment.
Clarifying the Goal
The question actually seeks reproducibility of floating-point results, not order of execution. Also, order of execution is irrelevant; we do not care if, in (a+b)+(c+d), a+b or c+d is executed first. We care that the result of a+b is added to the result of c+d, without any reassociation or other rewriting of arithmetic unless the result is known to be the same.
Reproducibility of floating-point arithmetic is in general an unsolved technological problem. (There is no theoretical barrier; we have reproducible elementary operations. Reproducibility is a matter of what hardware and software vendors have provided and how hard it is to express the computations we want performed.)
Do you want reproducibility on one platform (e.g., always using the same version of the same math library)? Does your code use any math library routines like sin or log? Do you want reproducibility across different platforms? With multithreading? Across changes of compiler version?
Addressing Some Specific Issues
The samples shown in the question can largely be handled by writing each individual floating-point operation in its own statement, as by replacing:
a = 1 + 2 - 2 + 3 + 4;
a += 2*b;
a += c;
with:
t0 = 1 + 2;
t0 = t0 - 2;
t0 = t0 + 3;
t0 = t0 + 4;
t1 = 2*b;
t0 += t1;
a += c;
The basis for this is that both C and C++ permit an implementation to use “excess precision” when evaluating an expression but require that precision to be “discarded” when an assignment or cast is performed. Limiting each assignment expression to one operation or executing a cast after each operation effectively isolates the operations.
In many cases, a compiler will then generate code using instructions of the nominal type, instead of instructions using a type with excess precision. In particular, this should avoid a fused multiply-add (FMA) being substituted for a multiplication followed by an addition. (An FMA has effectively infinite precision in the product before it is added to the addend, thus falling under the “excess precision is permitted” rule.) There are caveats, however. An implementation might first evaluate an operation with excess precision and then round it to the nominal precision. In general, this can cause a different result than doing a single operation in the nominal precision. For the elementary operations of addition, subtract, multiplication, division, and even square root, this does not happen if the excess precision is sufficient greater than the nominal precision. (There are proofs that a result with sufficient excess precision is always close enough to the infinitely precise result that the rounding to nominal precision gets the same result.) This is true for the case where the nominal precision is the IEEE-754 basic 32-bit binary floating-point format, and the excess precision is the 64-bit format. However, it is not true where the nominal precision is the 64-bit format and the excess precision is Intel’s 80-bit format.
So, whether this workaround works depends on the platform.
Other Issues
Aside from the use of excess precision and features like FMA or the optimizer rewriting expressions, there are other things that affect reproducibility, such as non-standard treatment of subnormals (notably replacing them with zeroes), variations between math library routines. (sin, log, and similar functions return different results on different platforms. Nobody has fully implemented correctly rounded math library routines with known bounded performance.)
These are discussed in other Stack Overflow questions about floating-point reproducibility, as well as papers, specifications, and standards documents.
Irrelevant Issues
The order in which a processor executes floating-point operations is irrelevant. Processor reordering of calculations obeys rigid semantics; the results are identical regardless of the chronological order of execution. (Processor timing can affect results if, for example, a task is partitioned into subtasks, such as assigning multiple threads or processes to process different parts of the arrays. Among other issues, their results could arrive in different orders, and the process receiving their results might then add or otherwise combine their results in different orders.)
Using pointers will not fix anything. As far as C or C++ is concerned, *p where p is a pointer to double is the same as a where a is a double. One the objects has a name (a) and one of them does not, but they are like roses: They smell the same. (There are issues where, if you have some other pointer q, the compiler might not know whether *q and *p refer to the same thing. But that also holds true for *q and a.)
Using volatile qualifiers will not aid in reproducibility regarding the excess precision or expression rewriting issue. That is because only an object (not a value) is volatile, which means it has no effect until you write it or read it. But, if you write it, you are using an assignment expression1, so the rule about discarding excess precision already applies. When reading the object, you would force the compiler to retrieve the actual value from memory, but this value will not be any different than the non-volatile object has after assignment, so nothing is accomplished.
Footnote
1 I would have to check on other things that modify an object, such as ++, but those are likely not significant for this discussion.

Write this critical chunk of code in assembly language.
The situation you're in is unusual. Most of the time people want the compiler to do optimizations, so compiler developers don't spend much development effort on means to avoid them. Even with the knobs you do get (pragmas, separate compilation, indirections, ...) you can never be sure something won't be optimized. Some of the undesirable optimizations you mention (constant folding, for instance) cannot be turned off by any means in modern compilers.
If you use assembly language you can be sure you're getting exactly what you wrote. If you do it any other way you won't have that level of confidence.

"clever enough to recognise the + 2 - 2 is redundant and optimise this
away"
No ! All decent compilers will apply constant propagation and figure out that a is constant and optimize all your statement away, into something equivalent to a = 1;. Here the example with assembly.
Now if you make a volatile, the compiler has to assume that any change of a could have an impact outside the C++ programme. Constant propagation will still be performed to optimise each of these calculations, but the intermediary assignments are guaranteed to happen. Here the example with assembly.
If you don't want constant propagation to happen, you need to deactivate optimizations. In this case, the best would be to keep your code separate so to compile the rest with all optilizations on.
However this is not ideal. The optimizer could outperform you and with this approach, you'll loose global optimisation across the function boundaries.
Recommendation/quote of the day:
Don't diddle code; Find better algorithms
- B.W.Kernighan & P.J.Plauger

Performance of duplicate computations

I wonder if it's worth to make computation one time and store the result or it's faster to do twice the computation?
For example in this case:
float n1 = a - b;
float n2 = a + b;
float result = n1 * n2 / (n1 * n2);
Is it better to do:
float result = (a - b) * (a + b) / ((a - b) * (a + b));
?
I know that normally we store the result but I wonder if it's not faster to do the addition instead of calling the memory to store/retrieve the value.

It really depends: For trivial examples like yours, it does not matter. The compiler will generate the same code, since it finds the common sub-expressions and eliminates the duplicated calculations.
For more complicated examples, for example involving function calls, you are better off to use the first variant, to "store" intermediate results. Do not worry about using simple variables for intermediate storage. These are usually all kept in CPU registers, and the compiler is quite good in keeping values in registers.
The danger is that with more complex calculations the compiler may fail to do the common sub-expression elimination. This is for example the case when your code contains function calls which act like a compiler boundary.
Another topic is that with floating point, even simple operations like addition are not associative, i.e. (a+b)+c is different from a+(b+c), due to artifacts in the lowest bits. This often also prevents common subexpression elimination, since the compiler is not allowed to change the semantics of your code.

Dividing the expression into smaller expressions and giving them sensible names gives You several benefits:
It decreases cognitive load.
The longer expression could be now easier to understand and verified correct.
The line of code could be shorter which makes it easier to read and adhere to coding standards.
In C++ a temporary variable could also be marked const, then this also allows the compiler to better optimize the expressions.
But optimizations should be measured before they are discussed and used as arguments. Fast usually comes from the choice of data structures and used algorithms.
In general code should be written to be understood and be correct, and only then should it be optimized.
const float difference = a - b;
const float sum = a + b;
const float result = difference * sum / (difference * sum);

Which approximation algorithm is used for sin() by compilers? [duplicate]

I've been poring through .NET disassemblies and the GCC source code, but can't seem to find anywhere the actual implementation of sin() and other math functions... they always seem to be referencing something else.
Can anyone help me find them? I feel like it's unlikely that ALL hardware that C will run on supports trig functions in hardware, so there must be a software algorithm somewhere, right?
I'm aware of several ways that functions can be calculated, and have written my own routines to compute functions using taylor series for fun. I'm curious about how real, production languages do it, since all of my implementations are always several orders of magnitude slower, even though I think my algorithms are pretty clever (obviously they're not).

In GNU libm, the implementation of sin is system-dependent. Therefore you can find the implementation, for each platform, somewhere in the appropriate subdirectory of sysdeps.
One directory includes an implementation in C, contributed by IBM. Since October 2011, this is the code that actually runs when you call sin() on a typical x86-64 Linux system. It is apparently faster than the fsin assembly instruction. Source code: sysdeps/ieee754/dbl-64/s_sin.c, look for __sin (double x).
This code is very complex. No one software algorithm is as fast as possible and also accurate over the whole range of x values, so the library implements several different algorithms, and its first job is to look at x and decide which algorithm to use.
When x is very very close to 0, sin(x) == x is the right answer.
A bit further out, sin(x) uses the familiar Taylor series. However, this is only accurate near 0, so...
When the angle is more than about 7°, a different algorithm is used, computing Taylor-series approximations for both sin(x) and cos(x), then using values from a precomputed table to refine the approximation.
When |x| > 2, none of the above algorithms would work, so the code starts by computing some value closer to 0 that can be fed to sin or cos instead.
There's yet another branch to deal with x being a NaN or infinity.
This code uses some numerical hacks I've never seen before, though for all I know they might be well-known among floating-point experts. Sometimes a few lines of code would take several paragraphs to explain. For example, these two lines
double t = (x * hpinv + toint);
double xn = t - toint;
are used (sometimes) in reducing x to a value close to 0 that differs from x by a multiple of π/2, specifically xn × π/2. The way this is done without division or branching is rather clever. But there's no comment at all!
Older 32-bit versions of GCC/glibc used the fsin instruction, which is surprisingly inaccurate for some inputs. There's a fascinating blog post illustrating this with just 2 lines of code.
fdlibm's implementation of sin in pure C is much simpler than glibc's and is nicely commented. Source code: fdlibm/s_sin.c and fdlibm/k_sin.c

Functions like sine and cosine are implemented in microcode inside microprocessors. Intel chips, for example, have assembly instructions for these. A C compiler will generate code that calls these assembly instructions. (By contrast, a Java compiler will not. Java evaluates trig functions in software rather than hardware, and so it runs much slower.)
Chips do not use Taylor series to compute trig functions, at least not entirely. First of all they use CORDIC, but they may also use a short Taylor series to polish up the result of CORDIC or for special cases such as computing sine with high relative accuracy for very small angles. For more explanation, see this StackOverflow answer.

OK kiddies, time for the pros....
This is one of my biggest complaints with inexperienced software engineers. They come in calculating transcendental functions from scratch (using Taylor's series) as if nobody had ever done these calculations before in their lives. Not true. This is a well defined problem and has been approached thousands of times by very clever software and hardware engineers and has a well defined solution.
Basically, most of the transcendental functions use Chebyshev Polynomials to calculate them. As to which polynomials are used depends on the circumstances. First, the bible on this matter is a book called "Computer Approximations" by Hart and Cheney. In that book, you can decide if you have a hardware adder, multiplier, divider, etc, and decide which operations are fastest. e.g. If you had a really fast divider, the fastest way to calculate sine might be P1(x)/P2(x) where P1, P2 are Chebyshev polynomials. Without the fast divider, it might be just P(x), where P has much more terms than P1 or P2....so it'd be slower. So, first step is to determine your hardware and what it can do. Then you choose the appropriate combination of Chebyshev polynomials (is usually of the form cos(ax) = aP(x) for cosine for example, again where P is a Chebyshev polynomial). Then you decide what decimal precision you want. e.g. if you want 7 digits precision, you look that up in the appropriate table in the book I mentioned, and it will give you (for precision = 7.33) a number N = 4 and a polynomial number 3502. N is the order of the polynomial (so it's p4.x^4 + p3.x^3 + p2.x^2 + p1.x + p0), because N=4. Then you look up the actual value of the p4,p3,p2,p1,p0 values in the back of the book under 3502 (they'll be in floating point). Then you implement your algorithm in software in the form:
(((p4.x + p3).x + p2).x + p1).x + p0
....and this is how you'd calculate cosine to 7 decimal places on that hardware.
Note that most hardware implementations of transcendental operations in an FPU usually involve some microcode and operations like this (depends on the hardware).
Chebyshev polynomials are used for most transcendentals but not all. e.g. Square root is faster to use a double iteration of Newton raphson method using a lookup table first.
Again, that book "Computer Approximations" will tell you that.
If you plan on implmementing these functions, I'd recommend to anyone that they get a copy of that book. It really is the bible for these kinds of algorithms.
Note that there are bunches of alternative means for calculating these values like cordics, etc, but these tend to be best for specific algorithms where you only need low precision. To guarantee the precision every time, the chebyshev polynomials are the way to go. Like I said, well defined problem. Has been solved for 50 years now.....and thats how it's done.
Now, that being said, there are techniques whereby the Chebyshev polynomials can be used to get a single precision result with a low degree polynomial (like the example for cosine above). Then, there are other techniques to interpolate between values to increase the accuracy without having to go to a much larger polynomial, such as "Gal's Accurate Tables Method". This latter technique is what the post referring to the ACM literature is referring to. But ultimately, the Chebyshev Polynomials are what are used to get 90% of the way there.
Enjoy.

For sin specifically, using Taylor expansion would give you:
sin(x) := x - x^3/3! + x^5/5! - x^7/7! + ... (1)
you would keep adding terms until either the difference between them is lower than an accepted tolerance level or just for a finite amount of steps (faster, but less precise). An example would be something like:
float sin(float x)
{
float res=0, pow=x, fact=1;
for(int i=0; i<5; ++i)
{
res+=pow/fact;
pow*=-1*x*x;
fact*=(2*(i+1))*(2*(i+1)+1);
}
return res;
}
Note: (1) works because of the aproximation sin(x)=x for small angles. For bigger angles you need to calculate more and more terms to get acceptable results.
You can use a while argument and continue for a certain accuracy:
double sin (double x){
int i = 1;
double cur = x;
double acc = 1;
double fact= 1;
double pow = x;
while (fabs(acc) > .00000001 && i < 100){
fact *= ((2*i)*(2*i+1));
pow *= -1 * x*x;
acc = pow / fact;
cur += acc;
i++;
}
return cur;
}

Concerning trigonometric function like sin(), cos(),tan() there has been no mention, after 5 years, of an important aspect of high quality trig functions: Range reduction.
An early step in any of these functions is to reduce the angle, in radians, to a range of a 2*π interval. But π is irrational so simple reductions like x = remainder(x, 2*M_PI) introduce error as M_PI, or machine pi, is an approximation of π. So, how to do x = remainder(x, 2*π)?
Early libraries used extended precision or crafted programming to give quality results but still over a limited range of double. When a large value was requested like sin(pow(2,30)), the results were meaningless or 0.0 and maybe with an error flag set to something like TLOSS total loss of precision or PLOSS partial loss of precision.
Good range reduction of large values to an interval like -π to π is a challenging problem that rivals the challenges of the basic trig function, like sin(), itself.
A good report is Argument reduction for huge arguments: Good to the last bit (1992). It covers the issue well: discusses the need and how things were on various platforms (SPARC, PC, HP, 30+ other) and provides a solution algorithm the gives quality results for all double from -DBL_MAX to DBL_MAX.
If the original arguments are in degrees, yet may be of a large value, use fmod() first for improved precision. A good fmod() will introduce no error and so provide excellent range reduction.
// sin(degrees2radians(x))
sin(degrees2radians(fmod(x, 360.0))); // -360.0 < fmod(x,360) < +360.0
Various trig identities and remquo() offer even more improvement. Sample: sind()

Yes, there are software algorithms for calculating sin too. Basically, calculating these kind of stuff with a digital computer is usually done using numerical methods like approximating the Taylor series representing the function.
Numerical methods can approximate functions to an arbitrary amount of accuracy and since the amount of accuracy you have in a floating number is finite, they suit these tasks pretty well.

Use Taylor series and try to find relation between terms of the series so you don't calculate things again and again
Here is an example for cosinus:
double cosinus(double x, double prec)
{
double t, s ;
int p;
p = 0;
s = 1.0;
t = 1.0;
while(fabs(t/s) > prec)
{
p++;
t = (-t * x * x) / ((2 * p - 1) * (2 * p));
s += t;
}
return s;
}
using this we can get the new term of the sum using the already used one (we avoid the factorial and x2p)

It is a complex question. Intel-like CPU of the x86 family have a hardware implementation of the sin() function, but it is part of the x87 FPU and not used anymore in 64-bit mode (where SSE2 registers are used instead). In that mode, a software implementation is used.
There are several such implementations out there. One is in fdlibm and is used in Java. As far as I know, the glibc implementation contains parts of fdlibm, and other parts contributed by IBM.
Software implementations of transcendental functions such as sin() typically use approximations by polynomials, often obtained from Taylor series.

Chebyshev polynomials, as mentioned in another answer, are the polynomials where the largest difference between the function and the polynomial is as small as possible. That is an excellent start.
In some cases, the maximum error is not what you are interested in, but the maximum relative error. For example for the sine function, the error near x = 0 should be much smaller than for larger values; you want a small relative error. So you would calculate the Chebyshev polynomial for sin x / x, and multiply that polynomial by x.
Next you have to figure out how to evaluate the polynomial. You want to evaluate it in such a way that the intermediate values are small and therefore rounding errors are small. Otherwise the rounding errors might become a lot larger than errors in the polynomial. And with functions like the sine function, if you are careless then it may be possible that the result that you calculate for sin x is greater than the result for sin y even when x < y. So careful choice of the calculation order and calculation of upper bounds for the rounding error are needed.
For example, sin x = x - x^3/6 + x^5 / 120 - x^7 / 5040... If you calculate naively sin x = x * (1 - x^2/6 + x^4/120 - x^6/5040...), then that function in parentheses is decreasing, and it will happen that if y is the next larger number to x, then sometimes sin y will be smaller than sin x. Instead, calculate sin x = x - x^3 * (1/6 - x^2 / 120 + x^4/5040...) where this cannot happen.
When calculating Chebyshev polynomials, you usually need to round the coefficients to double precision, for example. But while a Chebyshev polynomial is optimal, the Chebyshev polynomial with coefficients rounded to double precision is not the optimal polynomial with double precision coefficients!
For example for sin (x), where you need coefficients for x, x^3, x^5, x^7 etc. you do the following: Calculate the best approximation of sin x with a polynomial (ax + bx^3 + cx^5 + dx^7) with higher than double precision, then round a to double precision, giving A. The difference between a and A would be quite large. Now calculate the best approximation of (sin x - Ax) with a polynomial (b x^3 + cx^5 + dx^7). You get different coefficients, because they adapt to the difference between a and A. Round b to double precision B. Then approximate (sin x - Ax - Bx^3) with a polynomial cx^5 + dx^7 and so on. You will get a polynomial that is almost as good as the original Chebyshev polynomial, but much better than Chebyshev rounded to double precision.
Next you should take into account the rounding errors in the choice of polynomial. You found a polynomial with minimum error in the polynomial ignoring rounding error, but you want to optimise polynomial plus rounding error. Once you have the Chebyshev polynomial, you can calculate bounds for the rounding error. Say f (x) is your function, P (x) is the polynomial, and E (x) is the rounding error. You don't want to optimise | f (x) - P (x) |, you want to optimise | f (x) - P (x) +/- E (x) |. You will get a slightly different polynomial that tries to keep the polynomial errors down where the rounding error is large, and relaxes the polynomial errors a bit where the rounding error is small.
All this will get you easily rounding errors of at most 0.55 times the last bit, where +,-,*,/ have rounding errors of at most 0.50 times the last bit.

The actual implementation of library functions is up to the specific compiler and/or library provider. Whether it's done in hardware or software, whether it's a Taylor expansion or not, etc., will vary.
I realize that's absolutely no help.

There's nothing like hitting the source and seeing how someone has actually done it in a library in common use; let's look at one C library implementation in particular. I chose uLibC.
Here's the sin function:
http://git.uclibc.org/uClibc/tree/libm/s_sin.c
which looks like it handles a few special cases, and then carries out some argument reduction to map the input to the range [-pi/4,pi/4], (splitting the argument into two parts, a big part and a tail) before calling
http://git.uclibc.org/uClibc/tree/libm/k_sin.c
which then operates on those two parts.
If there is no tail, an approximate answer is generated using a polynomial of degree 13.
If there is a tail, you get a small corrective addition based on the principle that sin(x+y) = sin(x) + sin'(x')y

They are typically implemented in software and will not use the corresponding hardware (that is, aseembly) calls in most cases. However, as Jason pointed out, these are implementation specific.
Note that these software routines are not part of the compiler sources, but will rather be found in the correspoding library such as the clib, or glibc for the GNU compiler. See http://www.gnu.org/software/libc/manual/html_mono/libc.html#Trig-Functions
If you want greater control, you should carefully evaluate what you need exactly. Some of the typical methods are interpolation of look-up tables, the assembly call (which is often slow), or other approximation schemes such as Newton-Raphson for square roots.

If you want an implementation in software, not hardware, the place to look for a definitive answer to this question is Chapter 5 of Numerical Recipes. My copy is in a box, so I can't give details, but the short version (if I remember this right) is that you take tan(theta/2) as your primitive operation and compute the others from there. The computation is done with a series approximation, but it's something that converges much more quickly than a Taylor series.
Sorry I can't rembember more without getting my hand on the book.

Whenever such a function is evaluated, then at some level there is most likely either:
A table of values which is interpolated (for fast, inaccurate applications - e.g. computer graphics)
The evaluation of a series that converges to the desired value --- probably not a taylor series, more likely something based on a fancy quadrature like Clenshaw-Curtis.
If there is no hardware support then the compiler probably uses the latter method, emitting only assembler code (with no debug symbols), rather than using a c library --- making it tricky for you to track the actual code down in your debugger.

If you want to look at the actual GNU implementation of those functions in C, check out the latest trunk of glibc. See the GNU C Library.

As many people pointed out, it is implementation dependent. But as far as I understand your question, you were interested in a real software implemetnation of math functions, but just didn't manage to find one. If this is the case then here you are:
Download glibc source code from http://ftp.gnu.org/gnu/glibc/
Look at file dosincos.c located in unpacked glibc root\sysdeps\ieee754\dbl-64 folder
Similarly you can find implementations of the rest of the math library, just look for the file with appropriate name
You may also have a look at the files with the .tbl extension, their contents is nothing more than huge tables of precomputed values of different functions in a binary form. That is why the implementation is so fast: instead of computing all the coefficients of whatever series they use they just do a quick lookup, which is much faster. BTW, they do use Tailor series to calculate sine and cosine.
I hope this helps.

I'll try to answer for the case of sin() in a C program, compiled with GCC's C compiler on a current x86 processor (let's say a Intel Core 2 Duo).
In the C language the Standard C Library includes common math functions, not included in the language itself (e.g. pow, sin and cos for power, sine, and cosine respectively). The headers of which are included in math.h.
Now on a GNU/Linux system, these libraries functions are provided by glibc (GNU libc or GNU C Library). But the GCC compiler wants you to link to the math library (libm.so) using the -lm compiler flag to enable usage of these math functions. I'm not sure why it isn't part of the standard C library. These would be a software version of the floating point functions, or "soft-float".
Aside: The reason for having the math functions separate is historic, and was merely intended to reduce the size of executable programs in very old Unix systems, possibly before shared libraries were available, as far as I know.
Now the compiler may optimize the standard C library function sin() (provided by libm.so) to be replaced with an call to a native instruction to your CPU/FPU's built-in sin() function, which exists as an FPU instruction (FSIN for x86/x87) on newer processors like the Core 2 series (this is correct pretty much as far back as the i486DX). This would depend on optimization flags passed to the gcc compiler. If the compiler was told to write code that would execute on any i386 or newer processor, it would not make such an optimization. The -mcpu=486 flag would inform the compiler that it was safe to make such an optimization.
Now if the program executed the software version of the sin() function, it would do so based on a CORDIC (COordinate Rotation DIgital Computer) or BKM algorithm, or more likely a table or power-series calculation which is commonly used now to calculate such transcendental functions. [Src: http://en.wikipedia.org/wiki/Cordic#Application]
Any recent (since 2.9x approx.) version of gcc also offers a built-in version of sin, __builtin_sin() that it will used to replace the standard call to the C library version, as an optimization.
I'm sure that is as clear as mud, but hopefully gives you more information than you were expecting, and lots of jumping off points to learn more yourself.

Don't use Taylor series. Chebyshev polynomials are both faster and more accurate, as pointed out by a couple of people above. Here is an implementation (originally from the ZX Spectrum ROM): https://albertveli.wordpress.com/2015/01/10/zx-sine/

Computing sine/cosine/tangent is actually very easy to do through code using the Taylor series. Writing one yourself takes like 5 seconds.
The whole process can be summed up with this equation here:
Here are some routines I wrote for C:
double _pow(double a, double b) {
double c = 1;
for (int i=0; i<b; i++)
c *= a;
return c;
}
double _fact(double x) {
double ret = 1;
for (int i=1; i<=x; i++)
ret *= i;
return ret;
}
double _sin(double x) {
double y = x;
double s = -1;
for (int i=3; i<=100; i+=2) {
y+=s*(_pow(x,i)/_fact(i));
s *= -1;
}
return y;
}
double _cos(double x) {
double y = 1;
double s = -1;
for (int i=2; i<=100; i+=2) {
y+=s*(_pow(x,i)/_fact(i));
s *= -1;
}
return y;
}
double _tan(double x) {
return (_sin(x)/_cos(x));
}

Improved version of code from Blindy's answer
#define EPSILON .0000000000001
// this is smallest effective threshold, at least on my OS (WSL ubuntu 18)
// possibly because factorial part turns 0 at some point
// and it happens faster then series element turns 0;
// validation was made against sin() from <math.h>
double ft_sin(double x)
{
int k = 2;
double r = x;
double acc = 1;
double den = 1;
double num = x;
// precision drops rapidly when x is not close to 0
// so move x to 0 as close as possible
while (x > PI)
x -= PI;
while (x < -PI)
x += PI;
if (x > PI / 2)
return (ft_sin(PI - x));
if (x < -PI / 2)
return (ft_sin(-PI - x));
// not using fabs for performance reasons
while (acc > EPSILON || acc < -EPSILON)
{
num *= -x * x;
den *= k * (k + 1);
acc = num / den;
r += acc;
k += 2;
}
return (r);
}

The essence of how it does this lies in this excerpt from Applied Numerical Analysis by Gerald Wheatley:
When your software program asks the computer to get a value of
or , have you wondered how it can get the
values if the most powerful functions it can compute are polynomials?
It doesnt look these up in tables and interpolate! Rather, the
computer approximates every function other than polynomials from some
polynomial that is tailored to give the values very accurately.
A few points to mention on the above is that some algorithms do infact interpolate from a table, albeit only for the first few iterations. Also note how it mentions that computers utilise approximating polynomials without specifying which type of approximating polynomial. As others in the thread have pointed out, Chebyshev polynomials are more efficient than Taylor polynomials in this case.

if you want sin then
__asm__ __volatile__("fsin" : "=t"(vsin) : "0"(xrads));
if you want cos then
__asm__ __volatile__("fcos" : "=t"(vcos) : "0"(xrads));
if you want sqrt then
__asm__ __volatile__("fsqrt" : "=t"(vsqrt) : "0"(value));
so why use inaccurate code when the machine instructions will do?

Speedup C++ code

I am writing a C++ number crunching application, where the bottleneck is a function that has to calculate for double:
template<class T> inline T sqr(const T& x){return x*x;}
and another one that calculates
Base dist2(const Point& p) const
{ return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z); }
These operations take 80% of the computation time. I wonder if you can suggest approaches to make it faster, even if there is some sort of accuracy loss
Thanks

First, make sure dist2 can be inlined (it's not clear from your post whether or not this is the case), having it defined in a header file if necessary (generally you'll need to do this - but if your compiler generates code at link time, then that's not necessarily the case).
Assuming x86 architecture, be sure to allow your compiler to generate code using SSE2 instructions (an example of an SIMD instruction set) if they are available on the target architecture. To give the compiler the best opportunity to optimize these, you can try to batch your sqr operations together (SSE2 instructions should be able to do up to 4 float or 2 double operations at a time depending on the instruction.. but of course it can only do this if you have the inputs to more than one operation on the ready). I wouldn't be too optimistic about the compiler's ability to figure out that it can batch them.. but you can at least set up your code so that it would be possible in theory.
If you're still not satisfied with the speed and you don't trust that your compiler is doing it best, you should look into using compiler intrinsics which will allow you to write potential parallel instructions explicitly.. or alternatively, you can go right ahead and write architecture-specific assembly code to take advantage of SSE2 or whichever instructions are most appropriate on your architecture. (Warning: if you hand-code the assembly, either take extra care that it still gets inlined, or make it into a large batch operation)
To take it even further, (and as glowcoder has already mentioned) you could perform these operations on a GPU. For your specific case, bear in mind that GPU's often don't support double precision floating point.. though if it's a good fit for what you're doing, you'll get orders of magnitude better performance this way. Google for GPGPU or whatnot and see what's best for you.

What is Base?
Is it a class with a non-explicit constructor? It's possible that you're creating a fair amount of temporary Base objects. That could be a big CPU hog.
template<class T> inline T sqr(const T& x){return x*x;}
Base dist2(const Point& p) const {
return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z);
}
If p's member variables are of type Base, you could be calling sqr on Base objects, which will be creating temporaries for the subtracted coordinates, in sqr, and then for each added component.
(We can't tell without the class definitions)
You could probably speed it up by forcing the sqr calls to be on primitves and not using Base until you get to the return type of dist2.
Other performance improvement opportunities are to:
Use non-floating point operations, if you're ok with less precision.
Use algorithms which don't need to call dist2 so much, possibly caching or using the transitive property.
(this is probably obvious, but) Make sure you're compiling with optimization turned on.

I think optimising these functions might be difficult, you might be better off optimising the code that calls these functions to call them less, or to do things differently.

You don't say whether the calls to dist2 can be parallelised or not. If they can, then you could build a thread pool and split this work up into smaller chunks per thread.
What does your profiler tell you is happening inside dist2. Are you actually using 100% CPU all the time or are you cache missing and waiting for data to load?
To be honest, we really need more details to give you a definitive answer.

If sqr() is being used only on primitive types, you might try taking the argument by value instead of reference. That would save you an indirection.

If you can organise your data suitably then you may well be able to use SIMD optimisation here. For an efficient implementation you would probably want to pad your Point struct so that it has 4 elements (i.e. add a fourth dummy element for padding).

If you have a number of these to do, and you're doing graphics or "graphic like" tasks (thermal modeling, almost any 3d modeling) you might consider using OpenGL and offloading the tasks to a GPU. This would allow the computations to run in parallel, with highly optimized operational capacity. After all, you would expect something like distance or distancesq to have its own opcode on a GPU.
A researcher at a local univeristy offload almost all of his 3d-calculations for AI work to the GPU and achieved much faster results.

There are a lot of answers mentioning SSE already… but since nobody has mentioned how to use it, I'll throw another in…
Your code has most everything a vectorizer needs to work, except two constraints: aliasing and alignment.
Aliasing is the problem of two names referring two the same object. For example, my_point.dist2( my_point ) would operate on two copies of my_point. This messes with the vectorizer.
C99 defines the keyword restrict for pointers to specify that the referenced object is referenced uniquely: there will be no other restrict pointer to that object in the current scope. Most decent C++ compilers implement C99 as well, and import this feature somehow.
GCC calls it __restrict__. It may be applied to references or this.
MSVC calls it __restrict. I'd be surprised if support were any different from GCC.
(It is not in C++0x, though.)
#ifdef __GCC__
#define restrict __restrict__
#elif defined _MSC_VER
#define restrict __restrict
#endif
 
Base dist2(const Point& restrict p) const restrict
Most SIMD units require alignment to the size of the vector. C++ and C99 leave alignment implementation-defined, but C++0x wins this race by introducing [[align(16)]]. As that's still a bit in the future, you probably want your compiler's semi-portable support, a la restrict:
#ifdef __GCC__
#define align16 __attribute__((aligned (16)))
#elif defined _MSC_VER
#define align16 __declspec(align (16))
#endif
 
struct Point {
double align16 xyz[ 3 ]; // separate x,y,z might work; dunno
…
};
This isn't guaranteed to produce results; both GCC and MSVC implement helpful feedback to tell you what wasn't vectorized and why. Google your vectorizer to learn more.

If you really need all the dist2 values, then you have to compute them. It's already low level and cannot imagine speedups apart from distributing on multiple cores.
On the other side, if you're searching for closeness, then you can supply to the dist2() function your current miminum value. This way, if sqr(x-p.x) is already larger than your current minimum, you can avoid computing the remaining 2 squares.
Furthermore, you can avoid the first square by going deeper in the double representation. Comparing directly on the exponent value with your current miminum can save even more cycles.

Are you using Visual Studio? If so you may want to look at specifying the floating point unit control using /fp fast as a compile switch. Have a look at The fp:fast Mode for Floating-Point Semantics. GCC has a host of -fOPTION floating point optimisations you might want to consider (if, as you say, accuracy is not a huge concern).

I suggest two techniques:
Move the structure members into
local variables at the beginning.
Perform like operations together.
These techniques may not make a difference, but they are worth trying. Before making any changes, print the assembly language first. This will give you a baseline for comparison.
Here's the code:
Base dist2(const Point& p) const
{
// Load the cache with data values.
register x1 = p.x;
register y1 = p.y;
register z1 = p.z;
// Perform subtraction together
x1 = x - x1;
y1 = y - y1;
z1 = z - z2;
// Perform multiplication together
x1 *= x1;
y1 *= y1;
z1 *= z1;
// Perform final sum
x1 += y1;
x1 += z1;
// Return the final value
return x1;
}
The other alternative is to group by dimension. For example, perform all 'X' operations first, then Y and followed by Z. This may show the compiler that pieces are independent and it can delegate to another core or processor.
If you can't get any more performance out of this function, you should look elsewhere as other people have suggested. Also read up on Data Driven Design. There are examples where reorganizing the loading of data can speed up performance over 25%.
Also, you may want to investigate using other processors in the system. For example, the BOINC Project can delegate calculations to a graphics processor.
Hope this helps.

From an operation count, I don't see how this can be sped up without delving into hardware optimizations (like SSE) as others have pointed out. An alternative is to use a different norm, like the 1-norm is just the sum of the absolute values of the terms. Then no multiplications are necessary. However, this changes the underlying geometry of your space by rearranging the apparent spacing of the objects, but it may not matter for your application.

Floating point operations are quite often slower, maybe you can think about modifying the code to use only integer arithmetic and see if this helps?
EDIT: After the point made by Paul R I reworded my advice not to claim that floating point operations are always slower. Thanks.

Your best hope is to double-check that every dist2 call is actually needed: maybe the algorithm that calls it can be refactored to be more efficient? If some distances are computed multiple times, maybe they can be cached?
If you're sure all of the calls are necessary, you may be able to squeeze out a last drop of performance by using an architecture-aware compiler. I've had good results using Intel's compiler on x86s, for instance.

Just a few thoughts, however unlikely that I will add anything of value after 18 answers :)
If you are spending 80% time in these two functions I can imagine two typical scenarios:
Your algorithm is at least polynomial
As your data seem to be spatial maybe you can bring the O(n) down by introducing spatial indexes?
You are looping over certain set
If this set comes either from data on disk (sorted?) or from loop there might be possibility to cache, or use previous computations to calculate sqrt faster.
Also regarding the cache, you should define the required precision (and the input range) - maybe some sort of lookup/cache can be used?

(scratch that!!! sqr != sqrt )
See if the "Fast sqrt" is applicable in your case :
http://en.wikipedia.org/wiki/Fast_inverse_square_root

Look at the context. There's nothing you can do to optimize an operation as simple as x*x.
Instead you should look at a higher level: where is the function called from? How often? Why? Can you reduce the number of calls? Can you use SIMD instructions to perform the multiplication on multiple elements at a time?
Can you perhaps offload entire parts of the algorithm to the GPU?
Is the function defined so that it can be inlined? (basically, is its definition visible at the call sites)
Is the result needed immediately after the computation? If so, the latency of FP operations might hurt you. Try to arrange your code so dependency chains are broken up or interleaved with unrelated instructions.
And of course, examine the generated assembly and see if it's what you expect.

Is there a reason you are implementing your own sqr operator?
Have you tried the one in libm it should be highly optimized.

The first thing that occurs to me is memoization ( on-the-fly caching of function calls ), but both sqr and dist2 it would seem like they are too low level for the overhead associated with memoization to be made up for in savings due to memoization. However at a higher level, you may find it may work well for you.
I think a more detailed analysis of you data is called for. Saying that most of the time in the program is spent executing MOV and JUMp commands may be accurate, but it's not going to help yhou optimise much. The information is too low level. For example, if you know that integer arguments are good enough for dist2, and the values are between 0 and 9, then a pre-cached tabled would be 1000 elements--not to big. You can always use code to generate it.
Have you unrolled loops? Broken down matrix opration? Looked for places where you can get by with table lookup instead of actual calculation.
Most drastic would be to adopt the techniques described in:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.115.8660&rep=rep1&type=pdf
though it is admittedly a hard read and you should get some help from someone who knows Common Lisp if you don't.

I'm curious why you made this a template when you said the computation is done using doubles?
Why not write a standard method, function, or just 'x * x' ?
If your inputs can be predictably constrained and you really need speed create an array that contains all the outputs your function can produce. Use the input as the index into the array (A sparse hash). A function evaluation then becomes a comparison (to test for array bounds), an addition, and a memory reference. It won't get a lot faster than that.

See the SUBPD, MULPD and DPPD instructions. (DPPD required SSE4)
Depends on your code, but in some cases a stucture-of-arrays layout might be more friendly to vectorization than an array-of-structures layout.

Why isn't `int pow(int base, int exponent)` in the standard C++ libraries?

I feel like I must just be unable to find it. Is there any reason that the C++ pow function does not implement the "power" function for anything except floats and doubles?
I know the implementation is trivial, I just feel like I'm doing work that should be in a standard library. A robust power function (i.e. handles overflow in some consistent, explicit way) is not fun to write.

As of C++11, special cases were added to the suite of power functions (and others). C++11 [c.math] /11 states, after listing all the float/double/long double overloads (my emphasis, and paraphrased):
Moreover, there shall be additional overloads sufficient to ensure that, if any argument corresponding to a double parameter has type double or an integer type, then all arguments corresponding to double parameters are effectively cast to double.
So, basically, integer parameters will be upgraded to doubles to perform the operation.
Prior to C++11 (which was when your question was asked), no integer overloads existed.
Since I was neither closely associated with the creators of C nor C++ in the days of their creation (though I am rather old), nor part of the ANSI/ISO committees that created the standards, this is necessarily opinion on my part. I'd like to think it's informed opinion but, as my wife will tell you (frequently and without much encouragement needed), I've been wrong before :-)
Supposition, for what it's worth, follows.
I suspect that the reason the original pre-ANSI C didn't have this feature is because it was totally unnecessary. First, there was already a perfectly good way of doing integer powers (with doubles and then simply converting back to an integer, checking for integer overflow and underflow before converting).
Second, another thing you have to remember is that the original intent of C was as a systems programming language, and it's questionable whether floating point is desirable in that arena at all.
Since one of its initial use cases was to code up UNIX, the floating point would have been next to useless. BCPL, on which C was based, also had no use for powers (it didn't have floating point at all, from memory).
As an aside, an integral power operator would probably have been a binary operator rather than a library call. You don't add two integers with x = add (y, z) but with x = y + z - part of the language proper rather than the library.
Third, since the implementation of integral power is relatively trivial, it's almost certain that the developers of the language would better use their time providing more useful stuff (see below comments on opportunity cost).
That's also relevant for the original C++. Since the original implementation was effectively just a translator which produced C code, it carried over many of the attributes of C. Its original intent was C-with-classes, not C-with-classes-plus-a-little-bit-of-extra-math-stuff.
As to why it was never added to the standards before C++11, you have to remember that the standards-setting bodies have specific guidelines to follow. For example, ANSI C was specifically tasked to codify existing practice, not to create a new language. Otherwise, they could have gone crazy and given us Ada :-)
Later iterations of that standard also have specific guidelines and can be found in the rationale documents (rationale as to why the committee made certain decisions, not rationale for the language itself).
For example the C99 rationale document specifically carries forward two of the C89 guiding principles which limit what can be added:
Keep the language small and simple.
Provide only one way to do an operation.
Guidelines (not necessarily those specific ones) are laid down for the individual working groups and hence limit the C++ committees (and all other ISO groups) as well.
In addition, the standards-setting bodies realise that there is an opportunity cost (an economic term meaning what you have to forego for a decision made) to every decision they make. For example, the opportunity cost of buying that $10,000 uber-gaming machine is cordial relations (or probably all relations) with your other half for about six months.
Eric Gunnerson explains this well with his -100 points explanation as to why things aren't always added to Microsoft products- basically a feature starts 100 points in the hole so it has to add quite a bit of value to be even considered.
In other words, would you rather have a integral power operator (which, honestly, any half-decent coder could whip up in ten minutes) or multi-threading added to the standard? For myself, I'd prefer to have the latter and not have to muck about with the differing implementations under UNIX and Windows.
I would like to also see thousands and thousands of collection the standard library (hashes, btrees, red-black trees, dictionary, arbitrary maps and so forth) as well but, as the rationale states:
A standard is a treaty between implementer and programmer.
And the number of implementers on the standards bodies far outweigh the number of programmers (or at least those programmers that don't understand opportunity cost). If all that stuff was added, the next standard C++ would be C++215x and would probably be fully implemented by compiler developers three hundred years after that.
Anyway, that's my (rather voluminous) thoughts on the matter. If only votes were handed out based on quantity rather than quality, I'd soon blow everyone else out of the water. Thanks for listening :-)

For any fixed-width integral type, nearly all of the possible input pairs overflow the type, anyway. What's the use of standardizing a function that doesn't give a useful result for vast majority of its possible inputs?
You pretty much need to have an big integer type in order to make the function useful, and most big integer libraries provide the function.
Edit: In a comment on the question, static_rtti writes "Most inputs cause it to overflow? The same is true for exp and double pow, I don't see anyone complaining." This is incorrect.
Let's leave aside exp, because that's beside the point (though it would actually make my case stronger), and focus on double pow(double x, double y). For what portion of (x,y) pairs does this function do something useful (i.e., not simply overflow or underflow)?
I'm actually going to focus only on a small portion of the input pairs for which pow makes sense, because that will be sufficient to prove my point: if x is positive and |y| <= 1, then pow does not overflow or underflow. This comprises nearly one-quarter of all floating-point pairs (exactly half of non-NaN floating-point numbers are positive, and just less than half of non-NaN floating-point numbers have magnitude less than 1). Obviously, there are a lot of other input pairs for which pow produces useful results, but we've ascertained that it's at least one-quarter of all inputs.
Now let's look at a fixed-width (i.e. non-bignum) integer power function. For what portion inputs does it not simply overflow? To maximize the number of meaningful input pairs, the base should be signed and the exponent unsigned. Suppose that the base and exponent are both n bits wide. We can easily get a bound on the portion of inputs that are meaningful:
If the exponent 0 or 1, then any base is meaningful.
If the exponent is 2 or greater, then no base larger than 2^(n/2) produces a meaningful result.
Thus, of the 2^(2n) input pairs, less than 2^(n+1) + 2^(3n/2) produce meaningful results. If we look at what is likely the most common usage, 32-bit integers, this means that something on the order of 1/1000th of one percent of input pairs do not simply overflow.

Because there's no way to represent all integer powers in an int anyways:
>>> print 2**-4
0.0625

That's actually an interesting question. One argument I haven't found in the discussion is the simple lack of obvious return values for the arguments. Let's count the ways the hypthetical int pow_int(int, int) function could fail.
Overflow
Result undefined pow_int(0,0)
Result can't be represented pow_int(2,-1)
The function has at least 2 failure modes. Integers can't represent these values, the behaviour of the function in these cases would need to be defined by the standard - and programmers would need to be aware of how exactly the function handles these cases.
Overall leaving the function out seems like the only sensible option. The programmer can use the floating point version with all the error reporting available instead.

Short answer:
A specialisation of pow(x, n) to where n is a natural number is often useful for time performance. But the standard library's generic pow() still works pretty (surprisingly!) well for this purpose and it is absolutely critical to include as little as possible in the standard C library so it can be made as portable and as easy to implement as possible. On the other hand, that doesn't stop it at all from being in the C++ standard library or the STL, which I'm pretty sure nobody is planning on using in some kind of embedded platform.
Now, for the long answer.
pow(x, n) can be made much faster in many cases by specialising n to a natural number. I have had to use my own implementation of this function for almost every program I write (but I write a lot of mathematical programs in C). The specialised operation can be done in O(log(n)) time, but when n is small, a simpler linear version can be faster. Here are implementations of both:
// Computes x^n, where n is a natural number.
double pown(double x, unsigned n)
{
double y = 1;
// n = 2*d + r. x^n = (x^2)^d * x^r.
unsigned d = n >> 1;
unsigned r = n & 1;
double x_2_d = d == 0? 1 : pown(x*x, d);
double x_r = r == 0? 1 : x;
return x_2_d*x_r;
}
// The linear implementation.
double pown_l(double x, unsigned n)
{
double y = 1;
for (unsigned i = 0; i < n; i++)
y *= x;
return y;
}
(I left x and the return value as doubles because the result of pow(double x, unsigned n) will fit in a double about as often as pow(double, double) will.)
(Yes, pown is recursive, but breaking the stack is absolutely impossible since the maximum stack size will roughly equal log_2(n) and n is an integer. If n is a 64-bit integer, that gives you a maximum stack size of about 64. No hardware has such extreme memory limitations, except for some dodgy PICs with hardware stacks that only go 3 to 8 function calls deep.)
As for performance, you'll be surprised by what a garden variety pow(double, double) is capable of. I tested a hundred million iterations on my 5-year-old IBM Thinkpad with x equal to the iteration number and n equal to 10. In this scenario, pown_l won. glibc pow() took 12.0 user seconds, pown took 7.4 user seconds, and pown_l took only 6.5 user seconds. So that's not too surprising. We were more or less expecting this.
Then, I let x be constant (I set it to 2.5), and I looped n from 0 to 19 a hundred million times. This time, quite unexpectedly, glibc pow won, and by a landslide! It took only 2.0 user seconds. My pown took 9.6 seconds, and pown_l took 12.2 seconds. What happened here? I did another test to find out.
I did the same thing as above only with x equal to a million. This time, pown won at 9.6s. pown_l took 12.2s and glibc pow took 16.3s. Now, it's clear! glibc pow performs better than the three when x is low, but worst when x is high. When x is high, pown_l performs best when n is low, and pown performs best when x is high.
So here are three different algorithms, each capable of performing better than the others under the right circumstances. So, ultimately, which to use most likely depends on how you're planning on using pow, but using the right version is worth it, and having all of the versions is nice. In fact, you could even automate the choice of algorithm with a function like this:
double pown_auto(double x, unsigned n, double x_expected, unsigned n_expected) {
if (x_expected < x_threshold)
return pow(x, n);
if (n_expected < n_threshold)
return pown_l(x, n);
return pown(x, n);
}
As long as x_expected and n_expected are constants decided at compile time, along with possibly some other caveats, an optimising compiler worth its salt will automatically remove the entire pown_auto function call and replace it with the appropriate choice of the three algorithms. (Now, if you are actually going to attempt to use this, you'll probably have to toy with it a little, because I didn't exactly try compiling what I'd written above. ;))
On the other hand, glibc pow does work and glibc is big enough already. The C standard is supposed to be portable, including to various embedded devices (in fact embedded developers everywhere generally agree that glibc is already too big for them), and it can't be portable if for every simple math function it needs to include every alternative algorithm that might be of use. So, that's why it isn't in the C standard.
footnote: In the time performance testing, I gave my functions relatively generous optimisation flags (-s -O2) that are likely to be comparable to, if not worse than, what was likely used to compile glibc on my system (archlinux), so the results are probably fair. For a more rigorous test, I'd have to compile glibc myself and I reeeally don't feel like doing that. I used to use Gentoo, so I remember how long it takes, even when the task is automated. The results are conclusive (or rather inconclusive) enough for me. You're of course welcome to do this yourself.
Bonus round: A specialisation of pow(x, n) to all integers is instrumental if an exact integer output is required, which does happen. Consider allocating memory for an N-dimensional array with p^N elements. Getting p^N off even by one will result in a possibly randomly occurring segfault.

One reason for C++ to not have additional overloads is to be compatible with C.
C++98 has functions like double pow(double, int), but these have been removed in C++11 with the argument that C99 didn't include them.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2011/n3286.html#550
Getting a slightly more accurate result also means getting a slightly different result.

The World is constantly evolving and so are the programming languages. The fourth part of the C decimal TR¹ adds some more functions to <math.h>. Two families of these functions may be of interest for this question:
The pown functions, that takes a floating point number and an intmax_t exponent.
The powr functions, that takes two floating points numbers (x and y) and compute x to the power y with the formula exp(y*log(x)).
It seems that the standard guys eventually deemed these features useful enough to be integrated in the standard library. However, the rational is that these functions are recommended by the ISO/IEC/IEEE 60559:2011 standard for binary and decimal floating point numbers. I can't say for sure what "standard" was followed at the time of C89, but the future evolutions of <math.h> will probably be heavily influenced by the future evolutions of the ISO/IEC/IEEE 60559 standard.
Note that the fourth part of the decimal TR won't be included in C2x (the next major C revision), and will probably be included later as an optional feature. There hasn't been any intent I know of to include this part of the TR in a future C++ revision.
¹ You can find some work-in-progress documentation here.

Here's a really simple O(log(n)) implementation of pow() that works for any numeric types, including integers:
template<typename T>
static constexpr inline T pown(T x, unsigned p) {
T result = 1;
while (p) {
if (p & 0x1) {
result *= x;
}
x *= x;
p >>= 1;
}
return result;
}
It's better than enigmaticPhysicist's O(log(n)) implementation because it doesn't use recursion.
It's also almost always faster than his linear implementation (as long as p > ~3) because:
it doesn't require any extra memory
it only does ~1.5x more operations per loop
it only does ~1.25x more memory updates per loop

Perhaps because the processor's ALU didn't implement such a function for integers, but there is such an FPU instruction (as Stephen points out, it's actually a pair). So it was actually faster to cast to double, call pow with doubles, then test for overflow and cast back, than to implement it using integer arithmetic.
(for one thing, logarithms reduce powers to multiplication, but logarithms of integers lose a lot of accuracy for most inputs)
Stephen is right that on modern processors this is no longer true, but the C standard when the math functions were selected (C++ just used the C functions) is now what, 20 years old?

As a matter of fact, it does.
Since C++11 there is a templated implementation of pow(int, int) --- and even more general cases, see (7) in
http://en.cppreference.com/w/cpp/numeric/math/pow
EDIT: purists may argue this is not correct, as there is actually "promoted" typing used. One way or another, one gets a correct int result, or an error, on int parameters.

A very simple reason:
5^-2 = 1/25
Everything in the STL library is based on the most accurate, robust stuff imaginable. Sure, the int would return to a zero (from 1/25) but this would be an inaccurate answer.
I agree, it's weird in some cases.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js