To infinity and back - c++

There are mathematical operations that yield real numbers from +/- infinity. For example exp(-infinity) = 0. Is there a standard for mathematical functions in the standard C library that accept IEEE-754 infinities (without throwing, or returning NaN). I am on a linux system and would be interested in such a list for glibc. I could not find such a list in their online manual. For instance their documentation on exp does not mention how it handles the -infinity case. Any help will be much appreciated.

The See Also section of POSIX' math.h definition links to the POSIX definitions of acceptable domains.
E.g. fabs():
If x is ±0, +0 shall be returned.
If x is ±Inf, +Inf shall be returned.
I converted mentioned See Also-section to StackOverflow-Markdown:
acos(),
acosh(),
asin(),
atan(),
atan2(),
cbrt(),
ceil(),
cos(),
cosh(),
erf(),
exp(),
expm1(),
fabs(),
floor(),
fmod(),
frexp(),
hypot(),
ilogb(),
isnan(),
j0(),
ldexp(),
lgamma(),
log(),
log10(),
log1p(),
logb(),
modf(),
nextafter(),
pow(),
remainder(),
rint(),
scalb(),
sin(),
sinh(),
sqrt(),
tan(),
tanh(),
y0(),
I contributed search/replace/regex-fu. We now just need someone with cURL-fu.

In C99 it's on Appendix F:
F.9.3.1 The exp functions
-- exp(±0) returns 1.
-- exp(-∞) returns +0.
-- exp(+∞) returns +∞.
Appendix F is normative and:
An implementation that defines __STDC_IEC_559__ shall conform to the specifications in this annex.

Related

Is floating point expression contraction allowed in C++?

Floating point expressions can sometimes be contracted on the processing hardware, e.g. using fused multiply-and-add as a single hardware operation.
Apparently, using these this isn't merely an implementation detail but governed by programming language specification. Specifically, the C89 standard does not allow such contractions, while in C99 they are allowed provided that some macro is defined. See details in this SO answer.
But what about C++? Are floating-point contractions not allowed? Allowed in some standards? Allowed universally?
Summary
Contractions are permitted, but a facility is provided for the user to disable them. Unclear language in the standard clouds the issue of whether disabling them will provide desired results.
I investigated this in the official C++ 2003 standard and the 2017 n4659 draft. C++ citations are from 2003 unless otherwise indicated.
Extra Precision and Range
The text “contract” does not appear in either document. However, clause 5 Expressions [expr] paragraph 10 (same text in 2017’s 8 [expr] 13) says:
The values of the floating operands and the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.
I would prefer this statement explicitly stated whether this extra precision and range could be used freely (the implementation may use it in some expressions, including subexpressions, while not using it in others) or had to be used uniformly (if the implementation uses extra precision, it must use it in every floating-point expression) or according to some other rules (such as it may use one precision for float, another for double).
If we interpret it permissively, it means that, in a*b+c, a*b could be evaluated with infinite precision and range, and then the addition could be evaluated with whatever precision and range is normal for the implementation. This is mathematically equivalent to contraction, as it has the same result as evaluating a*b+c with a fused multiply-add instruction.
Hence, with this interpretation, implementations may contract expressions.
Contractions Inherited From C
17.4.1.2 [lib.headers] 3 (similar text in 2017’s 20.5.1.2 [headers] 3) says:
The facilities of the Standard C Library are provided in 18 additional headers, as shown in Table 12…
Table 12 includes <cmath>, and paragraph 4 indicates this corresponds to math.h. Technically, the C++ 2003 standard refers to the C 1990 standard, but I do not have it in electronic form and do not know where my paper copy is, so I will use the C 2011 standard (but unofficial draft N1570), which the C++ 2017 draft refers to.
The C standard defines, in <math.h>, a pragma FP_CONTRACT:
#pragma STDC FP_CONTRACT on-off-switch
where on-off-switch is on to allow contraction of expressions or off to disallow them. It also says the default state for the pragma is implementation-defined.
The C++ standard does not define “facility” or “facilities.” A dictionary definition of “facility” is “a place, amenity, or piece of equipment provided for a particular purpose” (New Oxford American Dictionary, Apple Dictionary application version 2.2.2 (203)). An amenity is “a desirable or useful feature or facility of a building or place.” A pragma is a useful feature provided for a particular purpose, so it seems to be a facility, so it is included in <cmath>.
Hence, using this pragma should permit or disallow contractions.
Conclusions
Contractions are permitted when FP_CONTRACT is on, and it may be on by default.
The text of 8 [expr] 13 can be interpreted to effectively allow contractions even if FP_CONTRACT is off but is insufficiently clear for definitive interpretation.
Yes, it is allowed.
For example in Visual Studio Compiler, by default, fp_contract is on. This tells the compiler to use floating-point contraction instructions where possible. Set fp_contract to off to preserve individual floating-point instructions.
// pragma_directive_fp_contract.cpp
// on x86 and x64 compile with: /O2 /fp:fast /arch:AVX2
// other platforms compile with: /O2
#include <stdio.h>
// remove the following line to enable FP contractions
#pragma fp_contract (off)
int main() {
double z, b, t;
for (int i = 0; i < 10; i++) {
b = i * 5.5;
t = i * 56.025;
z = t * i + b;
printf("out = %.15e\n", z);
}
}
Detailed information about Specify Floating-Point Behavior.
Using the GNU Compiler Collection (GCC):
The default state for the FP_CONTRACT pragma (C99 and C11 7.12.2).
This pragma is not implemented. Expressions are currently only contracted if -ffp-contract=fast, -funsafe-math-optimizations or -ffast-math are used.

Error: Old-style type declaration REAL*16 not supported

I was given some legacy code to compile. Unfortunately I only have access to a f95 compiler and have 0 knowledge of Fortran. Some modules compiled but others I was getting this error:
Error: Old-style type declaration REAL*16 not supported at (1)
My plan is to at least try to fix this error and see what else happens. So here are my 2 questions.
How likely will it be that my code written for Fortran 75 is compatible in the Fortran 95 compiler? (In /usr/bin my compiler is f95 - so I assume it is Fortran 95)
How do I fix this error that I am getting? I tried googling it but cannot see to find a clear crisp answer.
The error you are seeing is due to an old declaration style that was frequent before Fortran 90, but never became standard. Thus, the compiler does not accept the (formally incorrect) code.
In the olden days before Fortran 90, you had only two types of real numbers: REAL and DOUBLE PRECISION. These types were platform dependent, but most compilers nowadays map them to the IEEE754 formats binary32 and binary64.
However, some machines offered different formats, often with extra precision. In order to make them accessible to Fortran code, the type REAL*n was invented, with n an integer literal from a set of compiler-dependent values. This syntax was never standard, so you cannot be sure of what it will mean to a given compiler without reading its documentation.
In the real world, most compilers that have not been asked to be strictly standards-compliant (with some option like -std=f95) will recognize at least REAL*4 and REAL*8, mapping them to the binary32/64 formats mentioned before, but everything else is completely platform dependent. Your compiler may have a REAL*10 type for the 80-bit arithmetic used by the x86 387 FPU, or a REAL*16 type for some 128-bit floating point math. However, it is important to stress that since the syntax is not standard, the meaning of that type could change from compiler to compiler.
Finally, in Fortran 90 a way to refer to different kinds of real and integer types was made standard. The new syntax is REAL(n) or REAL(kind=n) and is supported by all standard-compliant compilers. However, The values of n are still compiler-dependent, but the standard provides three ways to obtain specific, repeatable results:
The SELECTED_REAL_KIND function, which allows you to query the system for the value of n to specify if you want a real type with certain precision and range requirements. Usually, what you do is ask for it once and store the result in an INTEGER, PARAMETER variable that you use when declaring the real variables in question. For example, you would declare a type with at least 15 digits of precision (decimal) and exponent range of at least 100 like this:
INTEGER, PARAMETER :: rk = SELECTED_REAL_KIND(15, 100)
REAL(rk) :: v
In Fortran 2003 and onwards, the ISO_C_BINDING module contains a series of constants that are meant to give you types guaranteed to be equivalent to the C types of the same compiler family (e.g. gcc for gfortran, icc for ifort, etc.). They are called C_FLOAT, C_DOUBLE and C_LONG_DOUBLE. Thus, you could declare a variable equivalent to a C double as REAL(C_DOUBLE) :: d.
In Fortran 2008 and onwards, the ISO_FORTRAN_ENV module contains a different series of constants REAL32, REAL64 and REAL128 that will give you a floating point type of the appropriate width - if some platform does not support one of those types, the constant will be a negative number. Thus, you can declare a 128-bit float as REAL(real128) :: q.
Javier has given an excellent answer to your immediate problem. However I'd just briefly like to address "How likely will it be that my code written for Fortran 77 is compatible in the Fortran 95 compiler?", with the obvious typo corrected.
Fortran, if the programmer adheres to the standard, is amazingly backward compatible. With very, very few exceptions standard conforming Fortran 77 is Fortran 2008 conforming. The problem is that it appears in your case the original programmer has not adhered to the international standard: real*8 and similar is not, and has never been part of any such standard and the problems you are seeing are precisely why such forms should never be, and should never have been used. That said if the original programmer only made this one mistake it may well be that the rest of the code will be OK, however without the detail it is impossible to tell
TL;DR: International standards are important, stick to them!
When we are at guessing instead of requesting proper code and full details I will venture to say that the other two answers are not correct.
The error message
Error: Old-style type declaration REAL*16 not supported at (1)
DOES NOT mean that the REAL*n syntax is not supported.
The error message is misleading. It actually means that the 16-byte reals are no supported. Had the OP requested the same real by the kind notation (in any of the many ways which return the gfortran's kind 16) the error message would be:
Error: Kind 16 not supported for type REAL at (1)
That can happen in some gfortran versions. Especially in MS Windows.
This explanation can be found with just a very quick google search for the error message: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56850#c1
It does not not complain that the old syntax is the problem, it just mentions it, because mentioning kinds might be confusing even more (especially for COMPLEX*32 which is kind 16).
The main message is: We really should be closing these kinds of questions and wait for proper details instead of guessing and upvoting the question where the OP can't even copy the full message the compiler has spit.
If you are using GNU gfortran, try "real(kind=10)", which will give you 10 bytes (80-bits) extended precision. I'm converting some older F77 code to run floating point math tests - it has a quad-precision ( the real*16 you mention) defined, but not used, as the other answers provided correctly point out (extended precision formats tend to be machine/compiler specific). The gfortran 5.2 I am using does not support real(kind=16), surprisingly. But gfortran (available for 64-bit MacOS and 64-bit Linux machines), does have the "real(kind=10)" which will give you precision beyond the typical real*8 or "double precision" as it was called in some Fortran compilers.
Be careful if your old Fortran code is calling C programs and/or maybe making assumptions about how precision is handled and floating point numbers are represented. You may have to get deep into exactly what is happening, and check the code carefully, to ensure things are working as expected, especially if fortran and C routines are calling each other. Here is url for the GNU gfortran info on quad-precision: https://people.sc.fsu.edu/~jburkardt/f77_src/gfortran_quadmath/gfortran_quadmath.html
Hope this helps.

Is there a way to prevent developers from using std::min, std::max?

We have an algorithm library doing lots of std::min/std::max operations on numbers that could be NaN. Considering this post: Why does Release/Debug have a different result for std::min?, we realised it's clearly unsafe.
Is there a way to prevent developers from using std::min/std::max?
Our code is compiled both with VS2015 and g++. We have a common header file included by all our source files (through /FI option for VS2015 and -include for g++). Is there any piece of code/pragma that could be put here to make any cpp file using std::min or std::max fail to compile?
By the way, legacy code like STL headers using this function should not be impacted. Only the code we write should be impacted.
I don't think making standard library functions unavailable is the correct approach. First off, NaN are a fundamental aspect of how floating point value work. You'd need to disable all kinds of other things, e.g., sort(), lower_bound(), etc. Also, programmers are paid for being creative and I doubt that any programmer reaching for std::max() would hesitate to use a < b? b: a if std::max(a, b) doesn't work.
Also, you clearly don't want to disable std::max() or std::min() for types which don't have NaN, e.g., integers or strings. So, you'd need a somewhat controlled approach.
There is no portable way to disable any of the standard library algorithms in namespace std. You could hack it by providing suitable deleted overloads to locate uses of these algorithms, e.g.:
namespace std {
float max(float, float) = delete; // **NOT** portable
double max(double, double) = delete; // **NOT** portable
long double max(long double, long double) = delete; // **NOT** portable
// likewise and also not portable for min
}
I'm going a bit philosophical here and less code. But I think the best approach would be to educate those developers, and explain why they shouldn't code in a specific way. If you'll be able to give them a good explanation, then not only will they stop using functions that you don't want them to. They will be able to spread the message to other developers in the team.
I believe that forcing them will just make them come up with work arounds.
As modifying std is disallowed, following is UB, but may work in your case.
Marking the function as deprecated:
Since c++14, the deprecated attribute:
namespace std
{
template <typename T>
[[deprecated("To avoid to use Nan")]] constexpr const T& (min(const T&, const T&));
template <typename T>
[[deprecated("To avoid to use Nan")]] constexpr const T& (max(const T&, const T&));
}
Demo
And before
#ifdef __GNUC__
# define DEPRECATED(func) func __attribute__ ((deprecated))
#elif defined(_MSC_VER)
# define DEPRECATED(func) __declspec(deprecated) func
#else
# pragma message("WARNING: You need to implement DEPRECATED for this compiler")
# define DEPRECATED(func) func
#endif
namespace std
{
template <typename T> constexpr const T& DEPRECATED(min(const T&, const T&));
template <typename T> constexpr const T& DEPRECATED(max(const T&, const T&));
}
Demo
There's no portable way of doing this since, aside from a couple of exceptions, you are not allowed to change anything in std.
However, one solution is to
#define max foo
before including any of your code. Then both std::max and max will issue compile-time failures.
But really, if I were you, I'd just get used to the behaviour of std::max and std::min on your platform. If they don't do what the standard says they ought to do, then submit a bug report to the compiler vendor.
If you get different results in debug and release, then the problem isn't getting different results. The problem is that one version, or probably both, are wrong. And that isn't fixed by disallowing std::min or std::max or replacing them with different functions that have defined results. You have to figure out which outcome you would actually want for each function call to get the correct result.
I'm not going to answer your question exactly, but instead of disallowing std::min and std::max altogether, you could educate your coworkers and make sure that you are consistently using a total order comparator instead of a raw operator< (implicitly used by many standard library algorithms) whenever you use a function that relies on a given order.
Such a comparator is proposed for standardization in P0100 — Comparison in C++ (as well as partial and weak order comparators), probably targeting C++20. Meanwhile, the C standard committee has been working for quite a while on TS 18661 — Floating-point extensions for C, part 1: Binary floating-point arithmic, apparently targeting the future C2x (should be ~C23), which updates the <math.h> header with many new functions required to implement the recent ISO/IEC/IEEE 60559:2011 standard. Among the new functions, there is totalorder (section 14.8), which compares floating point numbers according to the IEEE totalOrder:
totalOrder(x, y) imposes a total ordering on canonical members of the format of x and y:
If x < y, totalOrder(x, y) is true.
If x > y, totalOrder(x, y) is false.
If x = y
totalOrder(-0, +0) is true.
totalOrder(+0, -0) is false.
If x and y represent the same floating point datum:
If x and y have negative sign, totalOrder(x, y) is true if and only if the exponent of x ≥ the exponent of y.
Otherwise totalOrder(x, y) is true if and only if the exponent of x ≤ the exponent of y.
If x and y are unordered numerically because x or y is NaN:
totalOrder(−NaN, y) is true where −NaN represents a NaN with negative sign bit and y is a floating-point number.
totalOrder(x, +NaN) is true where +NaN represents a NaN with positive sign bit and x is a floating-point number.
If x and y are both NaNs, then totalOrder reflects a total ordering based on:
negative sign orders below positive sign
signaling orders below quiet for +NaN, reverse for −NaN
lesser payload, when regarded as an integer, orders below greater payload for +NaN, reverse for −NaN.
That's quite a wall of text, so here is a list that helps to see what's greater than what (from greater to lesser):
positive quiet NaNs (ordered by payload regarded as integer)
positive signaling NaNs (ordered by payload regarded as integer)
positive infinity
positive reals
positive zero
negative zero
negative reals
negative infinity
negative signaling NaNs (ordered by payload regarded as integer)
negative quiet NaNs (ordered by payload regarded as integer)
Unfortunately, this total order currently lacks library support, but it is probably possible to hack together a custom total order comparator for floating point numbers and use it whenever you know there will be floating point numbers to compare. Once you get your hands on such a total order comparator, you can safely use it everywhere it is needed instead of simply disallowing std::min and std::max.
If you compile using GCC or Clang, you can poison these identifiers.
#pragma GCC poison min max atoi /* etc ... */
Using them will issue a compiler error:
error: attempt to use poisoned "min"
The only problem with this in C++ is you can only poison "identifier tokens", not std::min and std::max, so actually also poisons all functions and local variables by the names min and max... maybe not quite what you want, but maybe not a problem if you choose Good Descriptive Variable Names™.
If a poisoned identifier appears as part of the expansion of a macro
which was defined before the identifier was poisoned, it will not
cause an error. This lets you poison an identifier without worrying
about system headers defining macros that use it.
For example,
#define strrchr rindex
#pragma GCC poison rindex
strrchr(some_string, 'h');
will not produce an error.
Read the link for more info, of course.
https://gcc.gnu.org/onlinedocs/gcc-3.3/cpp/Pragmas.html
You've deprecated std::min std::max. You can find instances by doing a search with grep. Or you can fiddle about with the headers themselves to break std::min, std::max. Or you can try defining min / max or std::min, std::max to the preprocessor. The latter is a bit dodgy because of C++ namespace, if you define std::max/min you don't pick up using namespace std, if you define min/max, you also pick up other uses of these identifiers.
Or if the project has a standard header like "mylibrary.lib" that everyone includes, break std::min / max in that.
The functions should return NaN when passed NaN, of course. But the natural way of writing them will trigger always false.
IMO the failure of the C++ language standard to require min(NaN, x) and min(x, NaN) to return NaN and similarly for max is a serious flaw in the C++ language standards, because it hides the fact that a NaN has been generated and results in surprising behaviour. Very few software developers do sufficient static analysis to ensure that NaNs can never be generated for all possible input values. So we declare our own templates for min and max, with specialisations for float and double to give correct behaviour with NaN arguments. This works for us, but might not work for those who use larger parts of the STL than we do. Our field is high integrity software, so we don't use much of the STL because dynamic memory allocation is usually banned after the startup phase.

C++ handling of excess precision

I'm currently looking at code which does multi-precision floating-point arithmetic. To work correctly, that code requires values to be reduced to their final precision at well-defined points. So even if an intermediate result was computed to an 80 bit extended precision floating point register, at some point it has to be rounded to 64 bit double for subsequent operations.
The code uses a macro INEXACT to describe this requirement, but doesn't have a perfect definition. The gcc manual mentions -fexcess-precision=standard as a way to force well-defined precision for cast and assignment operations. However, it also writes:
‘-fexcess-precision=standard’ is not implemented for languages other than C
Now I'm thinking about porting those ideas to C++ (comments welcome if anyone knows an existing implementation). So it seems I can't use that switch for C++. But what is the g++ default behavior in absence of any switch? Are there more C++-like ways to control the handling of excess precision?
I guess that for my current use case, I'll probably use -mfpmath=sse in any case, which should not incur any excess precision as far as I know. But I'm still curious.
Are there more C++-like ways to control the handling of excess precision?
The C99 standard defines FLT_EVAL_METHOD, a compiler-set macro that defines how excess precision should happen in a C program (many C compilers still behave in a way that does not exactly conform to the most reasonable interpretation of the value of FP_EVAL_METHOD that they define: older GCC versions generating 387 code, Clang when generating 387 code, …). Subtle points in relation with the effects of FLT_EVAL_METHOD were clarified in the C11 standard.
Since the 2011 standard, C++ defers to C99 for the definition of FLT_EVAL_METHOD (header cfloat).
So GCC should simply allow -fexcess-precision=standard for C++, and hopefully it eventually will. The same semantics as that of C are already in the C++ standard, they only need to be implemented in C++ compilers.
I guess that for my current use case, I'll probably use -mfpmath=sse in any case, which should not incur any excess precision as far as I know.
That is the usual solution.
Be aware that C99 also defines FP_CONTRACT in math.h that you may want to look at: it relates to the same problem of some expressions being computed at a higher precision, striking from a completely different side (the modern fused-multiply-add instruction instead of the old 387 instruction set). This is a pragma to decide whether the compiler is allowed to replace source-level additions and multiplications with FMA instructions (this has the effect that the multiplication is virtually computed at infinite precision, because this is how this instruction works, instead of being rounded to the precision of the type as it would be with separate multiplication and addition instructions). This pragma has apparently not been incorporated in the C++ standard (as far as I can see).
The default value for this option is implementation-defined and some people argue for the default to be to allow FMA instructions to be generated (for C compilers that otherwise define FLT_EVAL_METHOD as 0).
You should, in C, future-proof
your code with:
#include <math.h>
#pragma STDC FP_CONTRACT off
And the equivalent incantation in C++ if your compiler documents one.
what is the g++ default behavior in absence of any switch?
I am afraid that the answer to this question is that GCC's behavior, say, when generating 387 code, is nonsensical. See the description of the situation that motivated Joseph Myers to fix the situation for C. If g++ does not implement -fexcess-precision=standard, it probably means that 80-bit computations are randomly rounded to the precision of the type when the compiler happened to have to spill some floating-point registers to memory, leading the program below to print "foo" in some circumstances outside the programmer's control:
if (x == 0.0) return;
... // code that does not modify x
if (x == 0.0) printf("foo\n");
… because the code in the ellipsis caused x, that was held in an 80-bit floating-point register, to be spilt to a 64-bit slot on the stack.
But what is the g++ default behavior in absence of any switch?
I found one answer myself via an experiment, using the following code:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
double a = atof("1.2345678");
double b = a*a;
printf("%.20e\n", b - 1.52415765279683990130);
return 0;
}
If b is rounded (-fexcess-precision=standard), then the result is zero. Otherwise (-fexcess-precision=fast) it is something like 8e-17. Compiling with -mfpmath=387 -O3, I could reproduce both cases for gcc-4.8.2. For g++-4.8.2 I get an error for -fexcess-precision=standard if I try that, and without a flag I get the same behavior as -fexcess-precision=fast gives for C. Adding -std=c++11 does not help. So now the suspicion already voiced by Pascal is official: g++ does not necessarily round everywhere it should.

Definitions of sqrt, sin, cos, pow etc. in cmath

Are there any definitions of functions like sqrt(), sin(), cos(), tan(), log(), exp() (these from math.h/cmath) available ?
I just wanted to know how do they work.
This is an interesting question, but reading sources of efficient libraries won't get you very far unless you happen to know the method used.
Here are some pointers to help you understand the classical methods. My information is by no means accurate. The following methods are only the classical ones, particular implementations can use other methods.
Lookup tables are frequently used
Trigonometric functions are often implemented via the CORDIC algorithm (either on the CPU or with a library). Note that usually sine and cosine are computed together, I always wondered why the standard C library doesn't provide a sincos function.
Square roots use Newton's method with some clever implementation tricks: you may find somewhere on the web an extract from the Quake source code with a mind boggling 1 / sqrt(x) implementation.
Exponential and logarithms use exp(2^n x) = exp(x)^(2^n) and log2(2^n x) = n + log2(x) to have an argument close to zero (to one for log) and use rational function approximation (usually Padé approximants). Note that this exact same trick can get you matrix exponentials and logarithms. According to #Stephen Canon, modern implementations favor Taylor expansion over rational function approximation where division is much slower than multiplication.
The other functions can be deduced from these ones. Implementations may provide specialized routines.
pow(x, y) = exp(y * log(x)), so pow is not to be used when y is an integer
hypot(x, y) = abs(x) sqrt(1 + (y/x)^2) if x > y (hypot(y, x) otherwise) to avoid overflow. atan2 is computed with a call to sincos and a little logic. These functions are the building blocks for complex arithmetic.
For other transcendental functions (gamma, erf, bessel, ...), please consult the excellent book Numerical Recipes, 3rd edition for some ideas. The good'old Abramowitz & Stegun is also useful. There is a new version at http://dlmf.nist.gov/.
Techniques like Chebyshev approximation, continued fraction expansion (actually related to Padé approximants) or power series economization are used in more complex functions (if you happen to read source code for erf, bessel or gamma for instance). I doubt they have a real use in bare-metal simple math functions, but who knows. Consult Numerical Recipes for an overview.
Every implementation may be different, but you can check out one implementation from glibc's (the GNU C library) source code.
edit: Google Code Search has been taken offline, so the old link I had goes nowhere.
The sources for glibc's math library are located here:
http://sourceware.org/git/?p=glibc.git;a=tree;f=math;h=3d5233a292f12cd9e9b9c67c3a114c64564d72ab;hb=HEAD
Have a look at how glibc implements various math functions, full of magic, approximation and assembly.
Definitely take a look at the fdlibm sources. They're nice because the fdlibm library is self-contained, each function is well-documented with detailed explanations of the mathematics involved, and the code is immensely clear to read.
Having looked a lot at math code, I would advise against looking at glibc - the code is often quite difficult to follow, and depends a lot on glibc magic. The math lib in FreeBSD is much easier to read, if somehow sometimes slower (but not by much).
For complex functions, the main difficulty is border cases - correct nan/inf/0 handling is already difficult for real functions, but it is a nightmare for complex functions. C99 standard defines many corner cases, some functions have easily 10-20 corner cases. You can look at the annex G of the up to date C99 standard document to get an idea. There is also a difficult with long double, because its format is not standardized - in my experience, you should expect quite a few bugs with long double. Hopefully, the upcoming revised version of IEEE754 with extended precision will improve the situation.
Most modern hardware include floating point units that implement these functions very efficiently.
Usage: root(number,root,depth)
Example: root(16,2) == sqrt(16) == 4
Example: root(16,2,2) == sqrt(sqrt(16)) == 2
Example: root(64,3) == 4
Implementation in C#:
double root(double number, double root, double depth = 1f)
{
return Math.Pow(number, Math.Pow(root, -depth));
}
Usage: Sqrt(Number,depth)
Example: Sqrt(16) == 4
Example: Sqrt(8,2) == sqrt(sqrt(8))
double Sqrt(double number, double depth = 1) return root(number,2,depth);
By: Imk0tter