precision differences in matlab and c++

I am trying to make equivalence tests on an algorithm written in C++ and in Matlab.
The algorithm contains some kind of a loop in time and runs more than 1000 times. It has arithmetic operations and some math functions.
I feed the initial inputs to both platforms by hand (like a=1.767, b=6.65, ...) and when i check the hexadecimal representations of those inputs they are the same. So no problem for inputs. And get the outputs of c++ to matlab by a text file with 16 decimal digits. (i use "setprecision(32)" statement)
But here comes the problem; although after the 614'th step of both code, all the results are exactly the same, on the step of 615 I get a difference about And after this step the error becomes larger and larger, and at the end of the runs it is about 5.xx..xxe-14.
0x3ff1 3e42 a211 6cca--->[C++ function]--->0x3ff4 7619 7005 5a42
0x3ff1 3e42 a211 6cca--->[MATLAB function]--->ans
ans - 0x3ff4 7619 7005 5a42
I searched how matlab behaves the numbers and found really interesting things like "denormalized mantissa". While realmin is about e-308, by denormalizing the mantissa matlab has the smallest real number about e-324. Further matlab holds many more digits for "pi" or "exp(1)" than that of c++.
On the other hand, matlab help says that whatever the format it displays, matlab uses the double precision internally.
So,I'd really appreciate if someone explains what the exact reason is for these differences? How can we make equivalence tests on matlab and c++?

There is one thing in x86 CPU about floating points numbers. Internally, the floating point unit uses registers that are 10 bytes, i.e. 80 bits. Furthermore, the CPU has a setting that tells whether the floating point calculations should be made with 32 bits (float), 64 bits (double) or 80 bits precision. Less precision meaning faster executed floating point operations. (The 32 bits mode used to be popular for video games, where speed takes over precision).
From this I remember I tracked a bug in a calculation library (dll) that given the same input did not gave the same result whether it was started from a test C++ executable, or from MatLab.. Furthermore, this did not happen in Debug mode, only in Release!
The final conclusion was that MatLab did set the CPU floating point precision to 80 bits, whereas our test executable did not (and leave the default 64 bits precision). Furthermore, this calculation mismatch did not happen Debug mode because all the variables were written to memory into 64 bits double variables, and reloaded from there afterward, nullifying the additional 16 bits. In Release mode, some variables were optimized out (not written to memory), and all calculations were done with floating point registers only, on 80 bits, keeping the additional 16 bits non-zero value.
Don't know if this helps, but maybe worth knowing.

A similar discussion occurred before, the conclusion was that IEEE 754 tolerates error in the last bit for transcendental functions (cos, sin, exp, etc..). So you can't expect exactly same results between MATLAB and C (not even same C code compiled in different compilers).

I may be way off track here and you may already have investigated this possibility but it could be possible that there are differences between C++ and Matlab in the way that the mathematical library functions (sin() cos() and exp() that you mention) are implemented internally. Ultimately, some kind of functional approximation must be being used to generate function values and if there is some difference between these methods then presumably it is possible that this manifests itself in the form of numerical rounding error over a large number of iterations.
If two languages follow IEEE 754, will calculations in both languages result in the same answers?

I'm in the process of converting a program from Scilab code to C++. One loop in particular is producing a slightly different result than the original Scilab code (it's a long piece of code so I'm not going to include it in the question but I'll try my best to summarise the issue below).
The problem is, each step of the loop uses calculations from the previous step. Additionally, the difference between calculations only becomes apparent around the 100,000th iteration (out of approximately 300,000).
Note: I'm comparing the output of my C++ program with the outputs of Scilab 5.5.2 using the "format(25);" command. Meaning I'm comparing 25 significant digits. I'd also like to point out I understand how precision cannot be guaranteed after a certain number of bits but read the sections below before commenting. So far, all calculations have been identical up to 25 digits between the two languages.
In attempts to get to the bottom of this issue, so far I've tried:
Examining the data type being used:
I've managed to confirm that Scilab is using IEEE 754 doubles (according to the language documentation). Also, according to Wikipedia, C++ isn't required to use IEEE 754 for doubles, but from what I can tell, everywhere I use a double in C++ it has perfectly match Scilab's results.
Examining the use of transcendental functions:
I've also read from What Every Computer Scientist Should Know About Floating-Point Arithmetic that IEEE does not require transcendental functions to be exactly rounded. With that in mind, I've compared the results of these functions (sin(), cos(), exp()) in both languages and again, the results appear to be the same (up to 25 digits).
The use of other functions and predefined values:
I repeated the above steps for the use of sqrt() and pow(). As well as the value of Pi (I'm using M_PI in C++ and %pi in Scilab). Again, the results were the same.
Lastly, I've rewritten the loop (very carefully) in order to ensure that the code is identical between the two languages.
Note: Interestingly, I noticed that for all the above calculations the results between the two languages match farther than the actual result of the calculations (outside of floating point arithmetic). For example:
Value of sin(x) using Wolfram Alpha = 0.123456789.....
Value of sin(x) using Scilab & C++ = 0.12345yyyyy.....
Where even once the value computed using Scilab or C++ started to differ from the actual result (from Wolfram). Each language's result still matched each other. This leads me to believe that most of the values are being calculated (between the two languages) in the same way. Even though they're not required to by IEEE 754.
My original thinking was one of the first three points above are implemented differently between the two languages. But from what I can tell everything seems to produce identical results.
Is it possible that even though all the inputs to these loops are identical, the results can be different? Possibly because a very small error (past what I can see with 25 digits) is occurring that accumulates over time? If so, how can I go about fixing this issue?
No, the format of the numbering system does not guarantee equivalent answers from functions in different languages.
Functions, such as sin(x), can be implemented in different ways, using the same language (as well as different languages). The sin(x) function is an excellent example. Many implementations will use a look-up table or look-up table with interpolation. This has speed advantages. However, some implementations may use a Taylor Series to evaluate the function. Some implementations may use polynomials to come up with a close approximation.
Having the same numeric format is one hurdle to solve between languages. Function implementation is another.
Remember, you need to consider the platform as well. A program that uses an 80-bit floating point processor will have different results than a program that uses a 64-bit floating point software implementation.
Some architectures provide the capability of using extended precision floating point registers (e.g. 80 bits internally, versus 64-bit values in RAM). So, it's possible to get slightly different results for the same calculation, depending on how the computations are structured, and the optimization level used to compile the code.
Yes, it's possible to have a different results. It's possible even if you are using exactly the same source code in the same programming language for the same platform. Sometimes it's enough to have a different compiler switch; for example -ffastmath would lead the compiler to optimize your code for speed rather than accuracy, and, if your computational problem is not well-conditioned to begin with, the result may be significantly different.
For example, suppose you have this code:
x_8th = x*x*x*x*x*x*x*x;
One way to compute this is to perform 7 multiplications. This would be the default behavior for most compilers. However, you may want to speed this up by specifying compiler option -ffastmath and the resulting code would have only 3 multiplications:
temp1 = x*x; temp2 = temp1*temp1; x_8th = temp2*temp2;
The result would be slightly different because finite precision arithmetic is not associative, but sufficiently close for most applications and much faster. However, if your computation is not well-conditioned that small error can quickly get amplified into a large one.
Note that it is possible that the Scilab and C++ are not using the exact same instruction sequence, or that one uses FPU and the other uses SSE, so there may not be a way to get them to be exactly the same.
As commented by IInspectable, if your compiler has _control87() or something similar, you can use it to change the precision and/or rounding settings. You could try combinations of this to see if it has any effect, but again, even you manage to get the settings identical for Scilab and C++, differences in the actual instruction sequences may be the issue.
If SSE is used, I"m not sure what can be adjusted as I don't think SSE has an 80 bit precision mode.
In the case of using FPU in 32 bit mode, and if your compiler doesn't have something like _control87, you could use assembly code. If inline assembly is not allowed, you would need to call an assembly function. This example is from an old test program:
static short fcw; /* 16 bit floating point control word */
/* ... */
/* set precision control to extended precision */
fnstcw fcw
or fcw,0300h
fldcw fcw

`std::sin` is wrong in the last bit

I am porting some program from Matlab to C++ for efficiency. It is important for the output of both programs to be exactly the same (**).
I am facing different results for this operation:
std::sin(0.497418836818383950) = 0.477158760259608410 (C++)
sin(0.497418836818383950) = 0.47715876025960846000 (Matlab)
N[Sin[0.497418836818383950], 20] = 0.477158760259608433 (Mathematica)
So, as far as I know both C++ and Matlab are using IEEE754 defined double arithmetic. I think I have read somewhere that IEEE754 allows differents results in the last bit. Using mathematica to decide, seems like C++ is more close to the result. How can I force Matlab to compute the sin with precision to the last bit included, so that the results are the same?
In my program this behaviour leads to big errors because the numerical differential equation solver keeps increasing this error in the last bit. However I am not sure that C++ ported version is correct. I am guessing that even if the IEEE754 allows the last bit to be different, somehow guarantees that this error does not get bigger when using the result in more IEEE754 defined double operations (because otherwise, two different programs correct according to the IEEE754 standard could produce completely different outputs). So the other question is Am I right about this?
I would like get an answer to both bolded questions. Edit: The first question is being quite controversial, but is the less important, can someone comment about the second one?
Note: This is not an error in the printing, just in case you want to check, this is how I obtained these results:
Note (**): What I mean by this is that the final output, which are the results of some calculations showing some real numbers with 4 decimal places, need to be exactly the same. The error I talk about in the question gets bigger (because of more operations, each of one is different in Matlab and in C++) so the final differences are huge) (If you are curious enough to see how the difference start getting bigger, here is the full output [link soon], but this has nothing to do with the question)
Firstly, if your numerical method depends on the accuracy of sin to the last bit, then you probably need to use an arbitrary precision library, such as MPFR.
The IEEE754 2008 standard doesn't require that the functions be correctly rounded (it does "recommend" it though). Some C libms do provide correctly rounded trigonometric functions: I believe that the glibc libm does (typically used on most linux distributions), as does CRlibm. Most other modern libms will provide trig functions that are within 1 ulp (i.e. one of the two floating point values either side of the true value), often termed faithfully rounded, which is much quicker to compute.
None of those values you printed could actually arise as IEEE 64bit floating point values (even if rounded): the 3 nearest (printed to full precision) are:
0.477158760259608 405451814405751065351068973541259765625
0.477158760259608 46096296563700889237225055694580078125
0.477158760259608 516474116868266719393432140350341796875
The possible values you could want are:
The exact sin of the decimal .497418836818383950, which is
0.477158760259608 433132061388630377105954125778369485736356219...
(this appears to be what Mathematica gives).
The exact sin of the 64-bit float nearest .497418836818383950:
0.477158760259608 430531153841011107415427334794384396325832953...
In both cases, the first of the above list is the nearest (though only barely in the case of 1).
The sine of the double constant you wrote is about 0x1.e89c4e59427b173a8753edbcb95p-2, whose nearest double is 0x1.e89c4e59427b1p-2. To 20 decimal places, the two closest doubles are 0.47715876025960840545 and 0.47715876025960846096.
Perhaps Matlab is displaying a truncated value? (EDIT: I now see that the fourth-last digit is a 6, not a 0. Matlab is giving you a result that's still faithfully-rounded, but it's the farther of the two closest doubles to the desired result. And it's still printing out the wrong number.
I should also point out that Mathematica is probably trying to solve a different problem---compute the sine of the decimal number 0.497418836818383950 to 20 decimal places. You should not expect this to match either the C++ code's result or Matlab's result.

Different values for sin on different CPUs

I'm programming an application for calculate some geometrical transformations, and while I was testing the program, I founded something strange: I launched a test in two different machines, Z400 workstation with an Intel® Xeon® Processor W3550 and Z800 workstation with an Intel® Xeon® Processor X5560, and I got different results for one operation:
double x = 24.169408798217777 * sin(0.59420877837561048) / sin(0.97658754841928608)
With the Z400 I got x=16.330508228047432
While the Z800 throws this value x=16.330508228047435
The value differs on the last digit, and I make lot of calculations with that value, so it results inconvenient.
I tried using sinl in order to get more precission, but I got the same value all the times for each workstation. What's wrong with it? How I can fix it?
The results of the 2 calculations differ by 1 decimal digit as you noted and 1 binary digit as shown below. The Z400 is closer the correct answer. sin() calculation is not obliged to be accurate to the last binary bit, but to within 1 bit. Good sin() implementation are correct to the last bit**. Your Z800 is not as good.
printf("%a\n", 16.330508228047432);
printf("%a\n", 16.330508228047435);
printf("%a\n", 24.169408798217777 * sin(0.59420877837561048) / sin(0.97658754841928608)); // my PC eclipse
** Accuracy requirements are not a C++ requirement so much as an IEEE floating point one. Trig functions s/b accurate within 1 ulp (unit last place). Good trig libraries are accurate within 0.5 ulp (unit last place) - the best possible answer.
As a guess, I'm going to suggest that this is related to certain processors calculating floating point values in 80 bit registers (instead of 64 bit) and only reducing the precision as late as possible.
On GCC, you can disable this (which will cause all your math to be done in 64 bits, and also potentially cause it to be slightly slower) with -ffloat-store.
There are some additional suggestion on this answer that may be able to help as well if the 80 bit register is the actual problem.

Signed zero linux vs windows

i am running a program in c++ on windows and on linux.
the output is meant to be identical.
i am trying to make sure that the only differences are real differences oppose to working inviorment differences.
so far i have taken care of all the differences that can be caused by \r\n differences
but there is one thing that i can't seem to figure out.
in the windows out put there is a 0.000 and in linux it is -0.000
does any one know what can it be that is making the difference?
Probably it comes from differences in how the optimizer optimizes some FP calculations (that can be configurable - see e.g. here); in one case you get a value slightly less than 0, in the other slightly more. Both in output are rounded to a 0.000, but they keep their "real" sign.
Since in the IEEE floating point format the sign bit is separate from the value, you have two different values of 0, a positive and a negative one. In most cases it doesn't make a difference; both zeros will compare equal, and they indeed describe the same mathematical value (mathematically, 0 and -0 are the same). Where the difference can be significant is when you have underflow and need to know whether the underflow occurred from a positive or from a negative value. Also if you divide by 0, the sign of the infinity you get depends on the sign of the 0 (i.e. 1/+0.0 give +Inf, but 1/-0.0 gives -Inf). In other words, most probably it won't make a difference for you.
Note however that the different output does not necessarily mean that the number itself is different. It could well be that the value in Windows is also -0.0, but the output routine on Windows doesn't distinguish between +0.0 and -0.0 (they compare equal, after all).
Unless using (unsafe) flags like -ffast-math, the compiler is limited in the assumptions it can make when 'optimizing' IEEE-754 arithmetic. First check that both platforms are using the same rounding.
Also, if possible, check they are using the same floating-point unit. i.e., SSE vs FPU on x86. The latter might be an issue with math library function implementations - especially trigonometric / transcendental functions.

Floating point versus fixed point: what are the pros/cons?

Floating point type represents a number by storing its significant digits and its exponent separately on separate binary words so it fits in 16, 32, 64 or 128 bits.
Fixed point type stores numbers with 2 words, one representing the integer part, another representing the part past the radix, in negative exponents, 2^-1, 2^-2, 2^-3, etc.
Float are better because they have wider range in an exponent sense, but not if one wants to store number with more precision for a certain range, for example only using integer from -16 to 16, thus using more bits to hold digits past the radix.
In terms of performances, which one has the best performance, or are there cases where some is faster than the other ?
In video game programming, does everybody use floating point because the FPU makes it faster, or because the performance drop is just negligible, or do they make their own fixed type ?
Why isn't there any fixed type in C/C++ ?
That definition covers a very limited subset of fixed point implementations.
It would be more correct to say that in fixed point only the mantissa is stored and the exponent is a constant determined a-priori. There is no requirement for the binary point to fall inside the mantissa, and definitely no requirement that it fall on a word boundary. For example, all of the following are "fixed point":
64 bit mantissa, scaled by 2-32 (this fits the definition listed in the question)
64 bit mantissa, scaled by 2-33 (now the integer and fractional parts cannot be separated by an octet boundary)
32 bit mantissa, scaled by 24 (now there is no fractional part)
32 bit mantissa, scaled by 2-40 (now there is no integer part)
GPUs tend to use fixed point with no integer part (typically 32-bit mantissa scaled by 2-32). Therefore APIs such as OpenGL and Direct3D often use floating-point types which are capable of holding these values. However, manipulating the integer mantissa is often more efficient so these APIs allow specifying coordinates (in texture space, color space, etc) this way as well.
As for your claim that C++ doesn't have a fixed point type, I disagree. All integer types in C++ are fixed point types. The exponent is often assumed to be zero, but this isn't required and I have quite a bit of fixed-point DSP code implemented in C++ this way.
At the code level, fixed-point arithmetic is simply integer arithmetic with an implied denominator.
For many simple arithmetic operations, fixed-point and integer operations are essentially the same. However, there are some operations which the intermediate values must be represented with a higher number of bits and then rounded off. For example, to multiply two 16-bit fixed-point numbers, the result must be temporarily stored in 32-bit before renormalizing (or saturating) back to 16-bit fixed-point.
When the software does not take advantage of vectorization (such as CPU-based SIMD or GPGPU), integer and fixed-point arithmeric is faster than FPU. When vectorization is used, the efficiency of vectorization matters a lot more, such that the performance differences between fixed-point and floating-point is moot.
Some architectures provide hardware implementations for certain math functions, such as sin, cos, atan, sqrt, for floating-point types only. Some architectures do not provide any hardware implementation at all. In both cases, specialized math software libraries may provide those functions by using only integer or fixed-point arithmetic. Often, such libraries will provide multiple level of precisions, for example, answers which are only accurate up to N-bits of precision, which is less than the full precision of the representation. The limited-precision versions may be faster than the highest-precision version.
Fixed point is widely used in DSP and embedded-systems where often the target processor has no FPU, and fixed point can be implemented reasonably efficiently using an integer ALU.
In terms of performance, that is likley to vary depending on the target architecture and application. Obviously if there is no FPU, then fixed point will be considerably faster. When you have an FPU it will depend on the application too. For example performing some functions such as sqrt() or log() will be much faster when directly supported in the instruction set rather thna implemented algorithmically.
There is no built-in fixed point type in C or C++ I imagine because they (or at least C) were envisaged as systems level languages and the need fixed point is somewhat domain specific, and also perhaps because on a general purpose processor there is typically no direct hardware support for fixed point.
In C++ defining a fixed-point data type class with suitable operator overloads and associated math functions can easily overcome this shortcomming. However there are good and bad solutions to this problem. A good example can be found here: The link to the code in that article is broken, but I tracked it down to
You need to be careful when discussing "precision" in this context.
For the same number of bits in representation the maximum fixed point value has more significant bits than any floating point value (because the floating point format has to give some bits away to the exponent), but the minimum fixed point value has fewer than any non-denormalized floating point value (because the fixed point value wastes most of its mantissa in leading zeros).
Also depending on the way you divide the fixed point number up, the floating point value may be able to represent smaller numbers meaning that it has a more precise representation of "tiny but non-zero".
And so on.
The diferrence between floating point and integer math depends on the CPU you have in mind. On Intel chips the difference is not big in clockticks. Int math is still faster because there are multiple integer ALU's that can work in parallel. Compilers are also smart to use special adress calculation instructions to optimize add/multiply in a single instruction. Conversion counts as an operation too, so just choose your type and stick with it.
In C++ you can build your own type for fixed point math. You just define as struct with one int and override the appropriate overloads, and make them do what they normally do plus a shift to put the comma back to the right position.
You dont use float in games because it is faster or slower you use it because it is easier to implement the algorithms in floating point than in fixed point. You are assuming the reason has to do with computing speed and that is not the reason, it has to do with ease of programming.
For example you may define the width of the screen/viewport as going from 0.0 to 1.0, the height of the screen 0.0 to 1.0. The depth of the word 0.0 to 1.0. and so on. Matrix math, etc makes things real easy to implement. Do all of the math that way up to the point where you need to compute real pixels on a real screen size, say 800x400. Project the ray from the eye to the point on the object in the world and compute where it pierces the screen, using 0 to 1 math, then multiply x by 800, y times 400 and place that pixel.
floating point does not store the exponent and mantissa separately and the mantissa is a goofy number, what is left over after the exponent and sign, like 23 bits, not 16 or 32 or 64 bits.
floating point math at its core uses fixed point logic with extra logic and extra steps required. By definition compared apples to apples fixed point math is cheaper because you dont have to manipulate the data on the way into the alu and dont have to manipulate the data on the way out (normalize). When you add in IEEE and all of its garbage that adds even more logic, more clock cycles, etc. (properly signed infinity, quiet and signaling nans, different results for same operation if there is an exception handler enabled). As someone pointed out in a comment in a real system where you can do fixed and float in parallel, you can take advantage of some or all of the processors and recover some clocks that way. both with float and fixed clock rate can be increased by using vast quantities of chip real estate, fixed will remain cheaper, but float can approach fixed speeds using these kinds of tricks as well as parallel operation.
One issue not covered is the answers is a power consumption. Though it highly depends on specific hardware architecture, usually FPU consumes much more energy than ALU in CPU thus if you target mobile applications where power consumption is important it's worth consider fixed point impelementation of the algorithm.
It depends on what you're working on. If you're using fixed point then you lose precision; you have to select the number of places after the decimal place (which may not always be good enough). In floating point you don't need to worry about this as the precision offered is nearly always good enough for the task in hand - uses a standard form implementation to represent the number.
The pros and cons come down to speed and resources. On modern 32bit and 64bit platforms there is really no need to use fixed point. Most systems come with built in FPUs that are hardwired to be optimised for fixed point operations. Furthermore, most modern CPU intrinsics come with operations such as the SIMD set which help optimise vector based methods via vectorisation and unrolling. So fixed point only comes with a down side.
On embedded systems and small microcontrollers (8bit and 16bit) you may not have an FPU nor extended instruction sets. In which case you may be forced to use fixed point methods or the limited floating point instruction sets that are not very fast. So in these circumstances fixed point will be a better - or even your only - choice.