Mathematical Guarantee in Trignometric Functions of CUDA - c++

CUDA supports mathematical functions. But do they provide any guarantee like if I compute sin(x) the result would the closet representable value to the mathematical value of sin(x)? If the answer is no, is there any alternative if we want to stay in GPU? Something like it always return an upper bound or always a lower bound (but possibly not the closest possible one).

CUDA Appendix Suggests that the value provided by the API are not accurate. There is a difference of, 1/2 ulps in their value. (when compared between host and device).
However, for all practical purposes, these values are accurate.
Anyway if you want to perform symbolic operations with high-level precision, using float precision makes it inaccurate.

Related

Is cos(x) required to return identical values in different C++ implementations that use IEEE-754?

Is there any sort of guarantee - either in the C++ standard or in some other document - that C++ code computing cos(x) will produce identical values when compiled with g++, clang, MSVC, etc., assuming those implementations are using IEEE-754 64-bit doubles and the input value x is exactly equal? My assumption is "yes," but I'd like to confirm that before relying on this behavior.
Context: I'm teaching a course in which students may need to compute trigonometric functions of inputs. I can assure that those inputs are identical when fed into the functions. I'm aware that equality-testing doubles is not a good idea, but in this specific case I was wondering if it was safe to do so.
cos is a transcendental function. Transcendental functions are subject to the table-maker's dilemma. Informally, what this means is: let's say you come up with some iterative algorithm for approximating the cosine of an input value: for example, a Taylor series. When you run this iterative algorithm, you have to decide how much extra precision to keep at the intermediate stages (rounding too early may reduce the accuracy of the final result). But because the function is transcendental, it's very difficult to determine how many extra bits must be carried during the calculation in order to yield a correctly rounded final result, and for some input values, the number of extra bits required might be very large.
For this reason, it is generally not practical to design hardware that guarantees correctly rounded results for transcendental functions such as cos (where "correctly rounded" means that the resulting floating point value is the one that's closest to the true real value of the function). Instead, the hardware designers will implement a calculation technique that performs reasonably well and that, for most practical input values, will yield a result that is within 1 bit of the exact real result. (If you absolutely need a cosine function that always yields a correctly rounded result, then apparently it's possible to implement one: GNU MPFR claims to have done it. But this will perform much worse than hardware.)
IEEE 754 (2008) lists cos as one of the "recommended correctly rounded functions", which means that if you implement IEEE 754's version of cos, then you have to yield a correctly rounded result. But these functions are only "recommended" to be provided, and not required. Therefore, a conforming implementation of IEEE 754 might not provide a correctly rounded cos function, and might instead provide a "practical" cos function as described in the previous paragraph. Therefore, in practice, two implementations of C++ which both claim to be IEEE 754 compliant may not yield the exact same value for a transcendental function such as cos when applied to the same argument.
(Note that IEEE 754 requires implementations to provide a square root function that is correctly rounded. This is not a transcendental function, so correctly rounding it is not nearly as difficult.)

Precision in computations c++ Cplex

I implemented my mathematical model using Ilog Cplex with c++. Most of my decision variables have fractional values in the optimal solutions. Some of them are very small that cplex outputs them as 0. Is there a way to increase the precision so that I can still see the values of such variables?
Also, when I use cplex.getBestObjValue(), it gives me "-Inf". (This is a maximization problem.)
Having values for integer variables that are close to (but not exactly) integer values is quite normal. CPLEX has an integrality tolerance so that these values are accepted as close enough to the correct integer values. Just use standard C++ output functions to output these values to whatever precision you want.
Mostly this is not a problem, but you can set the integrality tolerance to a smaller value if necessary. I normally round these values to the nearest integer value and use that as my solution. You can also try re-solving your model with those decision variables fixed to their rounded integer values to be sure the solution really is valid. If you are not sure that is sufficient, try Alex's suggestion for numerical precision emphasis too.
you could try to use the setting IloCplex::Param::Emphasis::Numerical
Emphasizes precision in numerically unstable or difficult problems. This parameter lets you specify to CPLEX that it should emphasize precision in numerically difficult or unstable problems, with consequent performance trade-offs in time and memory.
About your second question, is your model a LP ?
regards

If two languages follow IEEE 754, will calculations in both languages result in the same answers?

I'm in the process of converting a program from Scilab code to C++. One loop in particular is producing a slightly different result than the original Scilab code (it's a long piece of code so I'm not going to include it in the question but I'll try my best to summarise the issue below).
The problem is, each step of the loop uses calculations from the previous step. Additionally, the difference between calculations only becomes apparent around the 100,000th iteration (out of approximately 300,000).
Note: I'm comparing the output of my C++ program with the outputs of Scilab 5.5.2 using the "format(25);" command. Meaning I'm comparing 25 significant digits. I'd also like to point out I understand how precision cannot be guaranteed after a certain number of bits but read the sections below before commenting. So far, all calculations have been identical up to 25 digits between the two languages.
In attempts to get to the bottom of this issue, so far I've tried:
Examining the data type being used:
I've managed to confirm that Scilab is using IEEE 754 doubles (according to the language documentation). Also, according to Wikipedia, C++ isn't required to use IEEE 754 for doubles, but from what I can tell, everywhere I use a double in C++ it has perfectly match Scilab's results.
Examining the use of transcendental functions:
I've also read from What Every Computer Scientist Should Know About Floating-Point Arithmetic that IEEE does not require transcendental functions to be exactly rounded. With that in mind, I've compared the results of these functions (sin(), cos(), exp()) in both languages and again, the results appear to be the same (up to 25 digits).
The use of other functions and predefined values:
I repeated the above steps for the use of sqrt() and pow(). As well as the value of Pi (I'm using M_PI in C++ and %pi in Scilab). Again, the results were the same.
Lastly, I've rewritten the loop (very carefully) in order to ensure that the code is identical between the two languages.
Note: Interestingly, I noticed that for all the above calculations the results between the two languages match farther than the actual result of the calculations (outside of floating point arithmetic). For example:
Value of sin(x) using Wolfram Alpha = 0.123456789.....
Value of sin(x) using Scilab & C++ = 0.12345yyyyy.....
Where even once the value computed using Scilab or C++ started to differ from the actual result (from Wolfram). Each language's result still matched each other. This leads me to believe that most of the values are being calculated (between the two languages) in the same way. Even though they're not required to by IEEE 754.
My original thinking was one of the first three points above are implemented differently between the two languages. But from what I can tell everything seems to produce identical results.
Is it possible that even though all the inputs to these loops are identical, the results can be different? Possibly because a very small error (past what I can see with 25 digits) is occurring that accumulates over time? If so, how can I go about fixing this issue?
No, the format of the numbering system does not guarantee equivalent answers from functions in different languages.
Functions, such as sin(x), can be implemented in different ways, using the same language (as well as different languages). The sin(x) function is an excellent example. Many implementations will use a look-up table or look-up table with interpolation. This has speed advantages. However, some implementations may use a Taylor Series to evaluate the function. Some implementations may use polynomials to come up with a close approximation.
Having the same numeric format is one hurdle to solve between languages. Function implementation is another.
Remember, you need to consider the platform as well. A program that uses an 80-bit floating point processor will have different results than a program that uses a 64-bit floating point software implementation.
Some architectures provide the capability of using extended precision floating point registers (e.g. 80 bits internally, versus 64-bit values in RAM). So, it's possible to get slightly different results for the same calculation, depending on how the computations are structured, and the optimization level used to compile the code.
Yes, it's possible to have a different results. It's possible even if you are using exactly the same source code in the same programming language for the same platform. Sometimes it's enough to have a different compiler switch; for example -ffastmath would lead the compiler to optimize your code for speed rather than accuracy, and, if your computational problem is not well-conditioned to begin with, the result may be significantly different.
For example, suppose you have this code:
x_8th = x*x*x*x*x*x*x*x;
One way to compute this is to perform 7 multiplications. This would be the default behavior for most compilers. However, you may want to speed this up by specifying compiler option -ffastmath and the resulting code would have only 3 multiplications:
temp1 = x*x; temp2 = temp1*temp1; x_8th = temp2*temp2;
The result would be slightly different because finite precision arithmetic is not associative, but sufficiently close for most applications and much faster. However, if your computation is not well-conditioned that small error can quickly get amplified into a large one.
Note that it is possible that the Scilab and C++ are not using the exact same instruction sequence, or that one uses FPU and the other uses SSE, so there may not be a way to get them to be exactly the same.
As commented by IInspectable, if your compiler has _control87() or something similar, you can use it to change the precision and/or rounding settings. You could try combinations of this to see if it has any effect, but again, even you manage to get the settings identical for Scilab and C++, differences in the actual instruction sequences may be the issue.
http://msdn.microsoft.com/en-us/library/e9b52ceh.aspx
If SSE is used, I"m not sure what can be adjusted as I don't think SSE has an 80 bit precision mode.
In the case of using FPU in 32 bit mode, and if your compiler doesn't have something like _control87, you could use assembly code. If inline assembly is not allowed, you would need to call an assembly function. This example is from an old test program:
static short fcw; /* 16 bit floating point control word */
/* ... */
/* set precision control to extended precision */
__asm{
fnstcw fcw
or fcw,0300h
fldcw fcw
}

How can I get consistent program behavior when using floats?

I am writing a simulation program that proceeds in discrete steps. The simulation consists of many nodes, each of which has a floating-point value associated with it that is re-calculated on every step. The result can be positive, negative or zero.
In the case where the result is zero or less something happens. So far this seems straightforward - I can just do something like this for each node:
if (value <= 0.0f) something_happens();
A problem has arisen, however, after some recent changes I made to the program in which I re-arranged the order in which certain calculations are done. In a perfect world the values would still come out the same after this re-arrangement, but because of the imprecision of floating point representation they come out very slightly different. Since the calculations for each step depend on the results of the previous step, these slight variations in the results can accumulate into larger variations as the simulation proceeds.
Here's a simple example program that demonstrates the phenomena I'm describing:
float f1 = 0.000001f, f2 = 0.000002f;
f1 += 0.000004f; // This part happens first here
f1 += (f2 * 0.000003f);
printf("%.16f\n", f1);
f1 = 0.000001f, f2 = 0.000002f;
f1 += (f2 * 0.000003f);
f1 += 0.000004f; // This time this happens second
printf("%.16f\n", f1);
The output of this program is
0.0000050000057854
0.0000050000062402
even though addition is commutative so both results should be the same. Note: I understand perfectly well why this is happening - that's not the issue. The problem is that these variations can mean that sometimes a value that used to come out negative on step N, triggering something_happens(), now may come out negative a step or two earlier or later, which can lead to very different overall simulation results because something_happens() has a large effect.
What I want to know is whether there is a good way to decide when something_happens() should be triggered that is not going to be affected by the tiny variations in calculation results that result from re-ordering operations so that the behavior of newer versions of my program will be consistent with the older versions.
The only solution I've so far been able to think of is to use some value epsilon like this:
if (value < epsilon) something_happens();
but because the tiny variations in the results accumulate over time I need to make epsilon quite large (relatively speaking) to ensure that the variations don't result in something_happens() being triggered on a different step. Is there a better way?
I've read this excellent article on floating point comparison, but I don't see how any of the comparison methods described could help me in this situation.
Note: Using integer values instead is not an option.
Edit the possibility of using doubles instead of floats has been raised. This wouldn't solve my problem since the variations would still be there, they'd just be of a smaller magnitude.
I've worked with simulation models for 2 years and the epsilon approach is the sanest way to compare your floats.
Generally, using suitable epsilon values is the way to go if you need to use floating point numbers. Here are a few things which may help:
If your values are in a known range you and you don't need divisions you may be able to scale the problem and use exact operations on integers. In general, the conditions don't apply.
A variation is to use rational numbers to do exact computations. This still has restrictions on the operations available and it typically has severe performance implications: you trade performance for accuracy.
The rounding mode can be changed. This can be use to compute an interval rather than an individual value (possibly with 3 values resulting from round up, round down, and round closest). Again, it won't work for everything but you may get an error estimate out of this.
Keeping track of the value and a number of operations (possible multiple counters) may also be used to estimate the current size of the error.
To possibly experiment with different numeric representations (float, double, interval, etc.) you might want to implement your simulation as templates parameterized for the numeric type.
There are many books written on estimating and minimizing errors when using floating point arithmetic. This is the topic of numerical mathematics.
Most cases I'm aware of experiment briefly with some of the methods mentioned above and conclude that the model is imprecise anyway and don't bother with the effort. Also, doing something else than using float may yield better result but is just too slow, even using double due to the doubled memory footprint and the smaller opportunity of using SIMD operations.
I recommend that you single step - preferably in assembly mode - through the calculations while doing the same arithmetic on a calculator. You should be able to determine which calculation orderings yield results of lesser quality than you expect and which that work. You will learn from this and probably write better-ordered calculations in the future.
In the end - given the examples of numbers you use - you will probably need to accept the fact that you won't be able to do equality comparisons.
As to the epsilon approach you usually need one epsilon for every possible exponent. For the single-precision floating point format you would need 256 single precision floating point values as the exponent is 8 bits wide. Some exponents will be the result of exceptions but for simplicity it is better to have a 256 member vector than to do a lot of testing as well.
One way to do this could be to determine your base epsilon in the case where the exponent is 0 i e the value to be compared against is in the range 1.0 <= x < 2.0. Preferably the epsilon should be chosen to be base 2 adapted i e a value that can be exactly represented in a single precision floating point format - that way you know exactly what you are testing against and won't have to think about rounding problems in the epsilon as well. For exponent -1 you would use your base epsilon divided by two, for -2 divided by 4 and so on. As you approach the lowest and the highest parts of the exponent range you gradually run out of precision - bit by bit - so you need to be aware that extreme values can cause the epsilon method to fail.
If it absolutely has to be floats then using an epsilon value may help but may not eliminate all problems. I would recommend using doubles for the spots in the code you know for sure will have variation.
Another way is to use floats to emulate doubles, there are many techniques out there and the most basic one is to use 2 floats and do a little bit of math to save most of the number in one float and the remainder in the other (saw a great guide on this, if I find it I'll link it).
Certainly you should be using doubles instead of floats. This will probably reduce the number of flipped nodes significantly.
Generally, using an epsilon threshold is only useful when you are comparing two floating-point number for equality, not when you are comparing them to see which is bigger. So (for most models, at least) using epsilon won't gain you anything at all -- it will just change the set of flipped nodes, it wont make that set smaller. If your model itself is chaotic, then it's chaotic.

Floating point versus fixed point: what are the pros/cons?

Floating point type represents a number by storing its significant digits and its exponent separately on separate binary words so it fits in 16, 32, 64 or 128 bits.
Fixed point type stores numbers with 2 words, one representing the integer part, another representing the part past the radix, in negative exponents, 2^-1, 2^-2, 2^-3, etc.
Float are better because they have wider range in an exponent sense, but not if one wants to store number with more precision for a certain range, for example only using integer from -16 to 16, thus using more bits to hold digits past the radix.
In terms of performances, which one has the best performance, or are there cases where some is faster than the other ?
In video game programming, does everybody use floating point because the FPU makes it faster, or because the performance drop is just negligible, or do they make their own fixed type ?
Why isn't there any fixed type in C/C++ ?
That definition covers a very limited subset of fixed point implementations.
It would be more correct to say that in fixed point only the mantissa is stored and the exponent is a constant determined a-priori. There is no requirement for the binary point to fall inside the mantissa, and definitely no requirement that it fall on a word boundary. For example, all of the following are "fixed point":
64 bit mantissa, scaled by 2-32 (this fits the definition listed in the question)
64 bit mantissa, scaled by 2-33 (now the integer and fractional parts cannot be separated by an octet boundary)
32 bit mantissa, scaled by 24 (now there is no fractional part)
32 bit mantissa, scaled by 2-40 (now there is no integer part)
GPUs tend to use fixed point with no integer part (typically 32-bit mantissa scaled by 2-32). Therefore APIs such as OpenGL and Direct3D often use floating-point types which are capable of holding these values. However, manipulating the integer mantissa is often more efficient so these APIs allow specifying coordinates (in texture space, color space, etc) this way as well.
As for your claim that C++ doesn't have a fixed point type, I disagree. All integer types in C++ are fixed point types. The exponent is often assumed to be zero, but this isn't required and I have quite a bit of fixed-point DSP code implemented in C++ this way.
At the code level, fixed-point arithmetic is simply integer arithmetic with an implied denominator.
For many simple arithmetic operations, fixed-point and integer operations are essentially the same. However, there are some operations which the intermediate values must be represented with a higher number of bits and then rounded off. For example, to multiply two 16-bit fixed-point numbers, the result must be temporarily stored in 32-bit before renormalizing (or saturating) back to 16-bit fixed-point.
When the software does not take advantage of vectorization (such as CPU-based SIMD or GPGPU), integer and fixed-point arithmeric is faster than FPU. When vectorization is used, the efficiency of vectorization matters a lot more, such that the performance differences between fixed-point and floating-point is moot.
Some architectures provide hardware implementations for certain math functions, such as sin, cos, atan, sqrt, for floating-point types only. Some architectures do not provide any hardware implementation at all. In both cases, specialized math software libraries may provide those functions by using only integer or fixed-point arithmetic. Often, such libraries will provide multiple level of precisions, for example, answers which are only accurate up to N-bits of precision, which is less than the full precision of the representation. The limited-precision versions may be faster than the highest-precision version.
Fixed point is widely used in DSP and embedded-systems where often the target processor has no FPU, and fixed point can be implemented reasonably efficiently using an integer ALU.
In terms of performance, that is likley to vary depending on the target architecture and application. Obviously if there is no FPU, then fixed point will be considerably faster. When you have an FPU it will depend on the application too. For example performing some functions such as sqrt() or log() will be much faster when directly supported in the instruction set rather thna implemented algorithmically.
There is no built-in fixed point type in C or C++ I imagine because they (or at least C) were envisaged as systems level languages and the need fixed point is somewhat domain specific, and also perhaps because on a general purpose processor there is typically no direct hardware support for fixed point.
In C++ defining a fixed-point data type class with suitable operator overloads and associated math functions can easily overcome this shortcomming. However there are good and bad solutions to this problem. A good example can be found here: http://www.drdobbs.com/cpp/207000448. The link to the code in that article is broken, but I tracked it down to ftp://66.77.27.238/sourcecode/ddj/2008/0804.zip
You need to be careful when discussing "precision" in this context.
For the same number of bits in representation the maximum fixed point value has more significant bits than any floating point value (because the floating point format has to give some bits away to the exponent), but the minimum fixed point value has fewer than any non-denormalized floating point value (because the fixed point value wastes most of its mantissa in leading zeros).
Also depending on the way you divide the fixed point number up, the floating point value may be able to represent smaller numbers meaning that it has a more precise representation of "tiny but non-zero".
And so on.
The diferrence between floating point and integer math depends on the CPU you have in mind. On Intel chips the difference is not big in clockticks. Int math is still faster because there are multiple integer ALU's that can work in parallel. Compilers are also smart to use special adress calculation instructions to optimize add/multiply in a single instruction. Conversion counts as an operation too, so just choose your type and stick with it.
In C++ you can build your own type for fixed point math. You just define as struct with one int and override the appropriate overloads, and make them do what they normally do plus a shift to put the comma back to the right position.
You dont use float in games because it is faster or slower you use it because it is easier to implement the algorithms in floating point than in fixed point. You are assuming the reason has to do with computing speed and that is not the reason, it has to do with ease of programming.
For example you may define the width of the screen/viewport as going from 0.0 to 1.0, the height of the screen 0.0 to 1.0. The depth of the word 0.0 to 1.0. and so on. Matrix math, etc makes things real easy to implement. Do all of the math that way up to the point where you need to compute real pixels on a real screen size, say 800x400. Project the ray from the eye to the point on the object in the world and compute where it pierces the screen, using 0 to 1 math, then multiply x by 800, y times 400 and place that pixel.
floating point does not store the exponent and mantissa separately and the mantissa is a goofy number, what is left over after the exponent and sign, like 23 bits, not 16 or 32 or 64 bits.
floating point math at its core uses fixed point logic with extra logic and extra steps required. By definition compared apples to apples fixed point math is cheaper because you dont have to manipulate the data on the way into the alu and dont have to manipulate the data on the way out (normalize). When you add in IEEE and all of its garbage that adds even more logic, more clock cycles, etc. (properly signed infinity, quiet and signaling nans, different results for same operation if there is an exception handler enabled). As someone pointed out in a comment in a real system where you can do fixed and float in parallel, you can take advantage of some or all of the processors and recover some clocks that way. both with float and fixed clock rate can be increased by using vast quantities of chip real estate, fixed will remain cheaper, but float can approach fixed speeds using these kinds of tricks as well as parallel operation.
One issue not covered is the answers is a power consumption. Though it highly depends on specific hardware architecture, usually FPU consumes much more energy than ALU in CPU thus if you target mobile applications where power consumption is important it's worth consider fixed point impelementation of the algorithm.
It depends on what you're working on. If you're using fixed point then you lose precision; you have to select the number of places after the decimal place (which may not always be good enough). In floating point you don't need to worry about this as the precision offered is nearly always good enough for the task in hand - uses a standard form implementation to represent the number.
The pros and cons come down to speed and resources. On modern 32bit and 64bit platforms there is really no need to use fixed point. Most systems come with built in FPUs that are hardwired to be optimised for fixed point operations. Furthermore, most modern CPU intrinsics come with operations such as the SIMD set which help optimise vector based methods via vectorisation and unrolling. So fixed point only comes with a down side.
On embedded systems and small microcontrollers (8bit and 16bit) you may not have an FPU nor extended instruction sets. In which case you may be forced to use fixed point methods or the limited floating point instruction sets that are not very fast. So in these circumstances fixed point will be a better - or even your only - choice.