Floating-point optimizations - guideline - c++

The majority of scientific computing problems that we need solve by implementing a particular algorithm in C/C++ demands accuracy that are much lower than double precision. For example, 1e-6, 1e-7 accuracy covers 99% of the cases for ODE solvers or numerical integration. Even in the rare cases when we do need higher accuracy, usually the numerical method itself fail before we can dream of reaching an accuracy that is near double precision. Example: we can't expect 1e-16 accuracy from a simple Runge–Kutta method even when solving a standard nostiff ordinary differential equation because of roundoff errors. In this case, the double precision requirement is analogous of as asking to have a better approximation of the wrong answer.
Then, aggressive float point optimizations seems to be a win-win situation in most cases because it makes your code faster (a lot faster!) and it does not affect the target accuracy of your particular problem. That said, it seems remarkable difficult to make sure that a particular implementation/code is stable against fp optimizations. Classical (and somewhat disturbing) example: GSL, the GNU scientific library, is not only the standard numerical library in the market but also it is a very well written library (I can't imagine myself doing a better job). However, GSL is not stable against fp optimizations. In fact, if you compile GSL with intel compiler, for example, then its internal tests will fail unless you turn on -fp-model strict flag which turn off fp optimizations.
Thus, my question is: are there general guidelines for writing code that is stable against aggressive floating point optimizations. Are these guidelines language (compiler) specific. If so, what are the C/C++ (gcc/icc) best practices?
Note 1: This question is not asking what are the fp optimizations flags in gcc/icc.
Note 2: This question is not asking about general guidelines for C/C++ optimization (like don't use virtual functions for small functions that are called a lot).
Note 3: This question is not asking the list of most standard fp optimizations (like x/x -> 1).
Note 4: I strongly believe this is NOT a subjective/off-topic question similar to the classical "The Coolest Server Names". If you disagree (because I am not providing a concrete example/code/problem), please flag it as community wiki. I am much more interested in the answer than gaining a few status points (not they are not important - you get the point!).

Compiler makers justify the -ffast-math kind of optimizations with the assertion that these optimizations' influence over numerically stable algorithms is minimal.
Therefore, if you want to write code that is robust against these optimizations, a sufficient condition is to write only numerically stable code.
Now your question may be, “How do I write numerically stable code?”. This is where your question may be a bit broad: there are entire books dedicated to the subject. The Wikipedia page I already linked to has a good example, and here is another good one. I could not recommend a book in particular, this is not my area of expertise.
Note 1: Numerical stability's desirability goes beyond compiler optimization. If you have choice, write numerically stable code even if you do not plan to use -ffast-math-style optimizations. Numerically unstable code may provide wrong results even when compiled with strict IEEE 754 floating-point semantics.
Note 2: you cannot expect external libraries to work when compiled with -ffast-math-style flags. These libraries, written by floating-point experts, may need to play subtle tricks with the properties of IEEE 754 computations. This kind of trick may be broken by -ffast-math optimizations, but they improve performance more than you could expect the compiler to even if you let it. For floating-point computations, expert with domain knowledge beats compiler every time. On example amongst many is the triple-double implementation found in CRlibm. This code breaks if it is not compiled with strict IEEE 754 semantics. Another, more elementary algorithm that compiler optimizations break is Kahan summation: when compiled with unsafe optimizations, c = (t - sum) - y is optimized to c = 0. This, of course, defeats the purpose of the algorithm completely.

Related

Can -ffast-math be safely used on a typical project?

While answering a question where I suggested -ffast-math, a comment pointed out that it is dangerous.
My personal feeling is that outside scientific calculations, it is OK. I also asume that serious financial applications use fixed point instead of floating point.
Of course if you want to use it in your project the ultimate answer is to test it on your project and see how much it affects it. But I think a general answer can be given by people who tried and have experience with such optimizations:
Can ffast-math be used safely on a normal project?
Given that IEEE 754 floating point has rounding errors, the assumption is that you are already living with inexact calculations.
This answer was particular illuminating on the fact that -ffast-math does much more than reordering operations that would result in a slightly different result (does not check for NaN or zero, disables signed zero just to name a few), but I fail to see what the effects of these would ultimately be in a real code.
I tried to think of typical uses of floating points, and this is what I came up with:
GUI (2D, 3D, physics engine, animations)
automation (e.g. car electronics)
robotics
industrial measurements (e.g. voltage)
and school projects, but those don't really matter here.
One of the especially dangerous things it does is imply -ffinite-math-only, which allows explicit NaN tests to pretend that no NaNs ever exist. That's bad news for any code that explicitly handles NaNs. It would try to test for NaN, but the test will lie through its teeth and claim that nothing is ever NaN, even when it is.
This can have really obvious results, such as letting NaN bubble up to the user when previously they would have been filtered out at some point. That's bad of course, but probably you'll notice and fix it.
A more insidious problem arises when NaN checks were there for error checking, for something that really isn't supposed to ever be NaN. But perhaps through some bug, bad data, or through other effects of -ffast-math, it becomes NaN anyway. And now you're not checking for it, because by assumption nothing is ever NaN, so isnan is a synonym of false. Things will go wrong, spuriously and long after you've already shipped your software, and you will get an "impossible" error report - you did check for NaN, it's right there in the code, it cannot be failing! But it is, because someone someday added -ffast-math to the flags, maybe you even did it yourself, not knowing fully what it would do or having forgotten that you used a NaN check.
So then we might ask, is that normal? That's getting quite subjective, but I would not say that checking for NaN is especially abnormal. Going fully circular and asserting that it isn't normal because -ffast-math breaks it is probably a bad idea.
It does a lot of other scary things as well, as detailed in other answers.
I wouldn't recommend to avoid using this option, but I remind one instance where unexpected floating-point behavior struck back.
The code was saying like this innocent construct:
float X, XMin, Y;
if (X < XMin)
{
Y= 1 / (XMin - X);
}
This was sometimes raising a division by zero error, because when the comparison was carried out, the full 80 bits representation (Intel FPU) was used, while later when the subtraction was performed, values were truncated to the 32 bits representation, possibly being equal.
The short answer: No, you cannot safely use -ffast-math except on code designed to be used with it. There are all sorts of important constructs for which it generates completely wrong results. In particular, for arbitrarily large x, there are expressions with correct value x but which will evaluate to 0 with -ffast-math, or vice versa.
As a more relaxed rule, if you're certain the code you're compiling was written by someone who doesn't actually understand floating point math, using -ffast-math probably won't make the results any more wrong (vs. the programmer's intent) than they already were. Such a programmer will not be performing intentional rounding or other operations that badly break, probably won't be using nans and infinities, etc. The most likely negative consequence is having computations that already had precision problems blow up and get worse. I would argue that this kind of code is already bad enough that you should not be using it in production to begin with, with or without -ffast-math.
From personal experience, I've had enough spurious bug reports from users trying to use -ffast-math (or even who have it buried in their default CFLAGS, uhg!) that I'm strongly leaning towards putting the following fragment in any code with floating point math:
#ifdef __FAST_MATH__
#error "-ffast-math is broken, don't use it"
#endif
If you still want to use -ffast-math in production, you need to actually spend the effort (lots of code review hours) to determine if it's safe. Before doing that, you probably want to first measure whether there's any benefit that would be worth spending those hours, and the answer is likely no.
Update several years later: As it turns out, -ffast-math gives GCC license to make transformations that effectively introduced undefined behavior into your program, causing miscompilation with arbitraryily-large fallout. See for example PR 93806 and related bugs. So really, no, it's never safe to use.
Given that IEEE 754 floating point has rounding errors, the assumption is that you are already living with inexact calculations.
The question you should answer is not whether the program expects inexact computations (it had better expect them, or it will break with or without -ffast-math), but whether the program expects approximations to be exactly those predicted by IEEE 754, and special values that behave exactly as predicted by IEEE 754 as well; or whether the program is designed to work fine with the weaker hypothesis that each operation introduces a small unpredictable relative error.
Many algorithms do not make use of special values (infinities, NaN) and are designed to work well in a computation model in which each operation introduces a small nondeterministic relative error. These algorithms work well with -ffast-math, because they do not use the hypothesis that the error of each operation is exactly the error predicted by IEEE 754. The algorithms also work fine when the rounding mode is other than the default round-to-nearest: the error in the end may be larger (or smaller), but a FPU in round-upwards mode also implements the computation model that these algorithms expect, so they work more or less identically well in these conditions.
Other algorithms (for instance Kahan summation, “double-double” libraries in which numbers are represented as the sum of two doubles) expect the rules to be respected to the letter, because they contain smart shortcuts based on subtle behaviors of IEEE 754 arithmetic. You can recognize these algorithms by the fact that they do not work when the rounding mode is other than expected either. I once asked a question about designing double-double operations that would work in all rounding modes (for library functions that may be pre-empted without a chance to restore the rounding mode): it is extra work, and these adapted implementations still do not work with -ffast-math.
Yes, you can use -ffast-math on normal projects, for an appropriate definition of "normal projects." That includes probably 95% of all programs written.
But then again, 95% of all programs written would not benefit much from -ffast-math either, because they don't do enough floating point math for it to be important.
Yes, they can be used safely, provided that you know what you are doing. This implies that you understand that they represent magnitudes, not exact values. This means:
You always do a sanity check on any external fp input.
You never divide by 0.
You never check for equality, unless you know it is an integer with an absolute value below the max value of the mantissa.
etc.
In fact, I would argue the converse. Unless you are working in very specific applications where NaNs and denormals have meaning, or if you really need that tiny incremental bit of reproduceability, then -ffast-math should be on by default. That way, your unit tests have a better chance of flushing out errors. Basically, whenever you think fp calculations have either reproduceability or precision, even under ieee, you are wrong.

gsl_complex vs. std::complex performance

I'm writing a program that depends a lot on complex additions and multiplications. I wanted to know whether I should use gsl_complex or std::complex.
I don't seem to find a comparison online of how much better GSL complex arithmetic is as compared to std::complex. A rudimentary google search didn't help me find a benchmarks page for GSL complex either.
I wrote a 20-line program that generates two random arrays of complex numbers (1e7 of them) and then checked how long addition and multiplication took using clock() from <ctime>. Using this method (without compiler optimisation) I got to know that gsl_complex_add and gsl_complex_mul are almost twice as fast as std::complex<double>'s + and * respectively. But I've never done this sort of thing before, so is this even the way you check which is faster?
Any links or suggestions would be helpful. Thanks!
EDIT:
Okay, so I tried again with a -O3 flag, and now the results are extremely different! std::complex<float>::operator+ is more than twice as fast as gsl_complex_add, while gsl_complex_mul is about 1.25 times as fast as std::complex<float>::operator*. If I use double, gsl_complex_add is about 30% faster than std::complex<double>::operator+ while std::complex<double>::operator* is about 10% faster than gsl_complex_mul. I only need float-level precision, but I've heard that double is faster (and memory is not an issue for me)! So now I'm really confused!
Turn on optimisations.
Any library or set of functions that you link with will be compiled WITH optimisation (unless the names of the developer are Kermit, Swedish Chef, Miss Peggy (project manager) and Cookie Monster (tester) - in other words, the development team is a bunch of Muppets).
Since std::complex uses templates, it is compiled by the compiler settings you give, so the code will be unoptimized. So your question is really "Why is function X faster than function Y that does the same thing, when function X is compiled with optimisation and Y is compiled without optimisation?" - which should really be obvious to answer: "Optimisation works nearly all of the time!" (If optimisation wasn't working most of the time, compiler developers would have a MUCH easier time)
Edit: So my above point has just been proven. Note that since templates can inline the code, it is often more efficient than an external library (because the compiler can just insert the instructions straight into the flow, rather than calling out to another function).
As to float vs. double, the only time that float is slower than double is if there is ONLY double hardware available, with two functions added to "shorten" and "lengthen" between float and double. I'm not aware of any such hardware. double has more bits, so it SHOULD take longer.
Edit2:
When it comes to choosing "one solution over another", there are so many factors. Performance is one (and in some cases, the most important, in other cases not). Other aspects are "ease of use", "availability", "fit for the project", etc.
If you look at ONLY performance, you can sometimes run simple benchmarks to determine that one solution is better or worse than another, but for complex libraries [not "real&imaginary" type complex numbers, but rather "complicated"], there are sometimes optimisations to deal with large amounts of data, where if you use a less sophisticated solution, the "large data" will not achieve the same performance, because less effort has been spent on solving the "big data" type problems. So, if you have a "simple" benchmark that does some basic calculations on a small set of data, and you are, in reality, going to run some much bigger datasets, the small benchmark MAY not reflect reality.
And there is no way that I, or anyone else, can tell you which solution will give you the best performance on YOUR system with YOUR datasets, unless we have access to your datasets, know exactly which calculations you are performance (that is, pretty much have your code), and have experience with running that with both "packages".
And going on to the rest of the criteria ("ease of use", etc), those are much more "personal opinion" based, so wouldn't be a good fit for an SO question in the first place.
This answer depends not only on the optimization flags, but also on the compiler used to compile GSL library and your particular code. Example: if you compile gsl with gcc and your program with icc, then you may see a (significant) difference (I have done this test with std::pow vs gsl_pow). Also, the standard makefile generated by ./configure does not compile GSL with aggressive float point optimizations (example: it does not include fast-math flag in gcc) because some GSL routines (differential equation solver for example) fail their stringent accuracy tests when these optimizations are present.
One of the great points about GSL is the modularity of the library. If you don't need double accuracy, then you can compile gsl_complex.h, gsl_complex_math.h and math.c separately with aggressive float number optimizations (however you need to delete the line #include <config.h> in math.c). Another strategy is to compile a separate version of the whole library with aggressive float number optimizations and test if accuracy is not an issue for your particular problem (that is my favorite approach).
EDIT: I forgot to mention that gsl_complex.h also has a float version of gsl_complex
typedef struct
{
float dat[2];
}
gsl_complex_float;

What are the functions in the standard library that can be implemented faster with programming hacks? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have recently read an article about fast sqrt calculation. Therefore, I have decided to ask SO community and its experts to help me find out, which STL algorithms or mathematical calculations can be implemented faster with programming hacks?
It would be great if you can give examples or links.
Thanks in advance.
System library developers have more concerns than just performance in mind:
Correctness and standards compliance: Critical!
General use: No optimisations are introduced, unless they benefit the majority of users.
Maintainability: Good hand-written assembly code can be faster, but you don't see much of it. Why?
Portability: Decent libraries should be portable to more than just Windows/x86/32bit.
Many optimisation hacks that you see around violate one or more of the requirements above.
In addition, optimisations that will be useless or even break when the next generation CPU comes around the corner are not a welcome thing.
If you don't have profiler evidence on it being really useful, don't bother optimising the system libraries. If you do, work on your own algorithms and code first, anyway...
EDIT:
I should also mention a couple of other all-encompassing concerns:
The cost/effort to profit/result ratio: Optimisations are an investment. Some of them are seemingly-impressive bubbles. Others are deeper and more effective in the long run. Their benefits must always be considered in relation to the cost of developing and maintaining them.
The marketing people: No matter what you think, you'll end up doing whatever they want - or think they want.
Probably all of them can be made faster for a specific problem domain.
Now the real question is, which ones should you hack to make faster? None, until the profiler tells you to.
Several of the algorithms in <algorithm> can be optimized for vector<bool>::[const_]iterator. These include:
find
count
fill
fill_n
copy
copy_backward
move // C++0x
move_backward // C++0x
swap_ranges
rotate
equal
I've probably missed some. But all of the above algorithms can be optimized to work on many bits at a time instead of just one bit at a time (as would a naive implementation).
This is an optimization that I suspect is sorely missing from most STL implementations. It is not missing from this one:
http://libcxx.llvm.org/
This is where you really need to listen to project managers and MBAs. What you're suggesting is re-implementing parts of the STL and or standard C library. There is an associated cost in terms of time to implement and maintenance burden of doing so, so you shouldn't do it unless you really, genuinely need to, as John points out. The rule is simple: is this calculation you're doing slowing you down (a.k.a. you are bound by the CPU)? If not, don't create your own implementation just for the sake of it.
Now, if you're really interested in fast maths, there are a few places you can start. The gnu multi-precision library implements many algorithms from modern computer arithmetic and semi numerical algorithms that are all about doing maths on arbitrary precision integers and floats insanely fast. The guys who write it optimise in assembly per build platform - it is about as fast as you can get in single core mode. This is the most general case I can think of for optimised maths i.e. that isn't specific to a certain domain.
Bringing my first paragraph and second in with what thkala has said, consider that GMP/MPIR have optimised assembly versions per cpu architecture and OS they support. Really. It's a big job, but it is what makes those libraries so fast on a specific small subset of problems that are programming.
Sometimes domain specific enhancements can be made. This is about understanding the problem in question. For example, when doing finite field arithmetic under rijndael's finite field you can, based on the knowledge that the characteristic polynomial is 2 with 8 terms, assume that your integers are of size uint8_t and that addition/subtraction are equivalent to xor operations. How does this work? Well basically if you add or subtract two elements of the polynomial, they contain either zero or one. If they're both zero or both one, the result is always zero. If they are different, the result is one. Term by term, that is equivalent to xor across a 8-bit binary string, where each bit represents a term in the polynomial. Multiplication is also relatively efficient. You can bet that rijndael was designed to take advantage of this kind of result.
That's a very specific result. It depends entirely on what you're doing to make things efficient. I can't imagine many STL functions are purely optimised for cpu speed, because amongst other things STL provides: collections via templates, which are about memory, file access which is about storage, exception handling etc. In short, being really fast is a narrow subset of what STL does and what it aims to achieve. Also, you should note that optimisation has different views. For example, if your app is heavy on IO, you are IO bound. Having a massively efficient square root calculation isn't really helpful since "slowness" really means waiting on the disk/OS/your file parsing routine.
In short, you as a developer of an STL library are trying to build an "all round" library for many different use cases.
But, since these things are always interesting, you might well be interested in bit twiddling hacks. I can't remember where I saw that, but I've definitely stolen that link from somebody else on here.
Almost none. The standard library is designed the way it is for a reason.
Taking sqrt, which you mention as an example, the standard library version is written to be as fast as possible, without sacrificing numerical accuracy or portability.
The article you mention is really beyond useless. There are some good articles floating around the 'net, describing more efficient ways to implement square roots. But this article isn't among them (it doesn't even measure whether the described algorithms are faster!) Carmack's trick is slower than std::sqrt on a modern CPU, as well as being less accurate.
It was used in a game something like 12 years ago, when CPUs had very different performance characteristics. It was faster then, but CPU's have changed, and today, it's both slower and less accurate than the CPU's built-in sqrt instruction.
You can implement a square root function which is faster than std::sqrt without losing accuracy, but then you lose portability, as it'll rely on CPU features not present on older CPU's.
Speed, accuracy, portability: choose any two. The standard library tries to balance all three, which means that the speed isn't as good as it could be if you were willing to sacrifice accuracy or portability, and accuracy is good, but not as good as it could be if you were willing to sacrifice speed, and so on.
In general, forget any notion of optimizing the standard library. The question you should be asking is whether you can write more specialized code.
The standard library has to cover every case. If you don't need that, you might be able to speed up the cases that you do need. But then it is no longer a suitable replacement for the standard library.
Now, there are no doubt parts of the standard library that could be optimized. the C++ IOStreams library in particular comes to mind. It is often naively, and very inefficiently, implemented. The C++ committee's technical report on C++ performance has an entire chapter dedicated to exploring how IOStreams could be implemented to be faster.
But that's I/O, where performance is often considered to be "unimportant".
For the rest of the standard library, you're unlikely to find much room for optimization.

What claims, if any, can be made about the accuracy/precision of floating-point calculations?

I'm working on an application that does a lot of floating-point calculations. We use VC++ on Intel x86 with double precision floating-point values. We make claims that our calculations are accurate to n decimal digits (right now 7, but trying to claim 15).
We go to a lot of effort of validating our results against other sources when our results change slightly (due to code refactoring, cleanup, etc.). I know that many many factors play in to the overall precision, such as the FPU control state, the compiler/optimizer, floating-point model, and the overall order of operations themselves (i.e., the algorithm itself), but given the inherent uncertainty in FP calculations (e.g., 0.1 cannot be represented), it seems invalid to claim any specific degree of precision for all calulations.
My question is this: is it valid to make any claims about the accuracy of FP calculations in general without doing any sort of analysis (such as interval analysis)? If so, what claims can be made and why?
EDIT:
So given that the input data is accurate to, say, n decimal places, can any guarantee be made about the result of any arbitrary calculations, given that double precision is being used? E.g., if the input data has 8 significant decimal digits, the output will have at least 5 significant decimal digits... ?
We are using math libraries and are unaware of any guarantees they may or may not make. The algorithms we use are not necessarily analyzed for precision in any way. But even given a specific algorithm, the implementation will affect the results (just changing the order of two addition operations, for example). Is there any inherent guarantee whatsoever when using, say, double precision?
ANOTHER EDIT:
We do empirically validate our results against other sources. So are we just getting lucky when we achieve, say, 10-digit accuracy?
As with all such questions, I have to just simply answer with the article What Every Computer Scientist Should Know About Floating-Point Arithmetic. It's absolutely indispensable for the type of work you are talking about.
Short answer: No.
Reason: Have you proved (yes proved) that you aren't losing any precision as you go along? Are you sure? Do you understand the intrinsic precision of any library functions you're using for transcendental functions? Have you computed the limits of additive errors? If you are using an iterative algorithm, do you know how well it has converged when you quit? This stuff is hard.
Unless your code uses only the basic operations specified in IEEE 754 (+, -, *, / and square root), you do not even know how much precision loss each call to library functions outside your control (trigonometric functions, exp/log, ...) introduce. Functions outside the basic 5 are not guaranteed to be, and are usually not, precise at 1ULP.
You can do empirical checks, but that's what they remain... empirical. Don't forget the part about there being no warranty in the EULA of your software!
If your software was safety-critical, and did not call library-implemented mathematical functions, you could consider http://www-list.cea.fr/labos/gb/LSL/fluctuat/index.html . But only critical software is worth the effort and has a chance to fit in the analysis constraints of this tool.
You seem, after your edit, mostly concerned about your compiler doing things in your back. It is a natural fear to have (because like for the mathematical functions, you are not in control). But it's rather unlikely to be the problem. Your compiler may compute with a higher precision than you asked for (80-bit extendeds when you asked for 64-bit doubles or 64-bit doubles when you asked for 32-bit floats). This is allowed by the C99 standard. In round-to-nearest, this may introduce double-rounding errors. But it's only 1ULP you are losing, and so infrequently that you needn't worry. This can cause puzzling behaviors, as in:
float x=1.0;
float y=7.0;
float z=x/y;
if (z == x/y)
...
else
... /* the else branch is taken */
but you were looking for trouble when you used == between floating-point numbers.
When you have code that does cancelations on purpose, such as in Kahan's summation algorithm:
d = (a+b)-a-b;
and the compiler optimizes that into d=0;, you have a problem. And yes, this optimization "as if floats operation were associative" has been seen in general compilers. It is not allowed by C99. But the situation has gotten better, I think. Compiler authors have become more aware of the dangers of floating-point and no longer try to optimize so aggressively. Plus, if you were doing this in your code you would not be asking this question.
Given that your vendors of machines, compilers, run-time libraries, and operation systems don't make any such claim about floating point accuracy, you should take that to be a warning that your group should be leery of making claims that could come under harsh scrutiny if clients ever took you to court.
Without doing formal verification of the entire system, I would avoid such claims. I work on scientific software that has indirect human safety implications, so we have consider such things in the past, and we do not make these sort of claims.
You could make useless claims about precision of double (length) floating point calculations, but it would be basically worthless.
Ref: The pitfalls of verifying floating-point computations from ACM Transactions on Programming Languages and Systems 30, 3 (2008) 12
No, you cannot make any such claim. If you wanted to do so, you would need to do the following:
Hire an expert in numerical computing to analyze your algorithms.
Either get your library and compiler vendors to open their sources to said expert for analysis, or get them to sign off on hard semantics and error bounds.
Double-precision floating-point typically carries about 15 digits of decimal accuracy, but there are far too many ways for some or all of that accuracy to be lost, that are far too subtle for a non-expert to diagnose, to make any claim like what you would like to claim.
There are relatively simpler ways to keep running error bounds that would let you make accuracy claims about any specific computation, but making claims about the accuracy of all computations performed with your software is a much taller order.
A double precision number on an Intel CPU has slightly better than 15 significant digits (decimal).
The potrntial error for a simple computation is in the ballparl of n/1.0e15, where n is the order of magnitude of the number(s) you are working with. I suspect that Intel has specs for the accuracy of CPU-based FP computations.
The potential error for library functions (like cos and log) is usually documented. If not, you can look at the source code (e.g. thr GNU source) and calculate it.
You would calculate error bars for your calculations just as you would for manual calculations.
Once you do that, you may be able to reduce the error by judicious ordering of the computations.
Since you seem to be concerned about accuracy of arbitrary calculations, here is an approach you can try: run your code with different rounding modes for floating-point calculations. If the results are pretty close to each other, you are probably okay. If the results are not close, you need to start worrying.
The maximum difference in the results will give you a lower bound on the accuracy of the calculations.

Platform independent math library

Is there a publically available library that will produce the exact
same results for sin, cos, floor, ceil, exp and log on 32 bit and
64 bit linux, solaris and possibly other platforms?
I am considering the following alternatives:
a) cephes compiled
with gcc -mfpmath=sse and the same optimization levels on each
platform ... but its not clear that this would work.
b) MPFR but I am worried that this would be
too slow.
Regarding precision (edited): For this particular application
I don't really need something that produces the value that is
numerically closest to the exact value. I just need the answers
to be the exact same on all platforms, os and "bitness". That
being said the values need to be reasonable (5 digits would
probably be enough). I apologize for not having made this clear
in my initial question.
I guess MAPM or MPFR with a low enough precision setting might do
the trick but I was hoping to find something that did not have
the "multiple precision" machinery/flavor to it. In any case, I will
try this out.
Would something like: http://lipforge.ens-lyon.fr/www/crlibm/index.html be what you are searching for (this is a library whose aim is to be able to replace the standard math library of C99 -- so keep good enough performance in the normal cases -- while ensuring correctly rounded result according to IEEE 754 rounding modes) ?
crlibm is the correct tool for this. An earlier poster linked to it. Because it is correctly rounded, it will deliver bit-identical results on all platforms with IEEE-754 compliant hardware if compiled properly. It is much, much faster than MPFR.
You shouldn't need one. floor and ceil will be exact since their computation is straightforward.
What you are concerned with is rounding on the last bit for the transendentals like sin, cos, and exp. But these are native to the CPU microcode and can be done in high quality consistently regardless of library. However, the rounding does vary from chip architecture to architecture.
So, if exact answers for the transindentals is indeed your goal, you do need a portable library, and you also will be giving up huge efficiencies by doing so.
You could use a portable library like MAPM which gives you not only consistent ULP results but as a side benefit lets you define arbirary precision.
You can check your math precision with tools like this one and this one.
You mention using SSE. If you're planning on only running on x86 chips, then what exactly are the inconsistencies you're expecting?
As for MPFR, don't worry - test it! By the way, if it's good enough to be included in GCC, it's probably good enough for you.
You want to use MPFR. That library has been around for years and has been ported to every platform under the sun and optimized by tons of people.
If MPFR isn't sufficient for your needs we're talking about full custom ASM implementations in which case it might be more efficient to consider implementing it in dedicated hardware.