Float operation difference in C vs C++ [closed] - c++

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
Solution
Thanks to #Michael Veksler 's answer, I was put on the right track to search for a solution. #Christoph, in this post, suggests trying different compiler flags to set the precision of the floating-point operations.
For me, the -mpc32 flag solved the problem.
I have to translate C++ code to C code as the new target won't have a C++ compiler.
I am running into a strange thing, where a mathematical equation gives different results when run in a C program compared to when run in a C++ program.
Equation:
float result = (float)(a+b-c+d+e);
The elements of the equation are all floats. I check the contents of the memory of each element by using
printf("a : 0x%02X%02X%02X%02X\n",((unsigned char*)&a)[0],((unsigned char*)&a)[1],((unsigned char*)&a)[2],((unsigned char*)&a)[3]);
Both in C and in C++, a b c d and e are equal, but the results are different.
Sample of a calculation in C:
a : 0x1D9969BB
b : 0x6CEDC83E
c : 0xAC89452F
d : 0xD2DC92B3
e : 0x4FE9F23C
result : 0xCD48D63E
And a sample in C++:
a : 0x1D9969BB
b : 0x6CEDC83E
c : 0xAC89452F
d : 0xD2DC92B3
e : 0x4FE9F23C
result : 0xCC48D63E
When I separate the equation in smaller parts, as in r = a + b then r = r - c and so on, the results are equal.
I have a 64-bit Windows machine.
Can someone explain why this happens?
I am sorry for this noob question, I am just starting out.
EDIT
I use the latest version of MinGW with options
-O0 -g3 -Wall -c -fmessage-length=0
EDIT 2
Sorry for the long time...
Here are the values corresponding to the above hex ones in C:
a : -0.003564424114301801
b : 0.392436385154724120
c : 0.000000000179659565
d : -0.000000068388217755
e : 0.029652265831828117
r : 0.418524175882339480
And here are for C++:
a : -0.003564424114301801
b : 0.392436385154724120
c : 0.000000000179659565
d : -0.000000068388217755
e : 0.029652265831828117
r : 0.418524146080017090
They are printed like printf("a : %.18f\n",a);
The values are not known at compile time, the equation is in a function called multiple times throughout the execution. The elements of the equation are computed inside the function.
Also, I observed a strange thing: I ran the exact equation in a new "pure" project (for both C and C++), i.e. only the main itself. The values of the elements are the same as the ones above (in float). The result is r : 0xD148D63E for both. The same as in #geza 's comment.

Introduction: Given that the question is not detailed enough, I am left to speculate the infamous gcc's 323 bug. As the low bug-ID suggests, this bug has been there forever. The bug report existed since June 2000, currently has 94 (!) duplicates, and the last one reported only half a year ago (on 2018-08-28). The bug affects only 32 bit executable on Intel computers (like cygwin). I assume that OP's code uses x87 floating point instructions, which are the default for 32 bit executables, while SSE instructions are only optional. Since 64 bit executables are more prevalent than 32, and no longer depend on x87 instructions, this bug has zero chance of ever being fixed.
Bug description: The x87 architecture has 80 bit floating point registers. The float requires only 32 bits. The bug is that x87 floating point operations are always done with 80 bits accuracy (subject to hardware configuration flag). This extra accuracy makes precision very flaky, because it depends on when the registers are being spilled (written) to memory.
If a 80 bit register is spilled into a 32 bit variable in memory, then extra precision is lost. This is the correct behavior if this happened after each floating point operation (since float is supposed to be 32 bits). However, spilling to memory slows things down and no compiler writer wants the executable to run slow. So by default the values are not spilled to memory.
Now, sometimes the value is spilled to memory and sometimes it is not. It depends on optimization level, on compiler heuristics, and on other seemingly random factors. Even with -O0 there could be slightly different strategies for dealing with spilling the x87 registers to memory, resulting in slightly different results. The strategy of spilling is probably the difference between your C and C++ compilers that you experience.
Work around:
For ways to handle this, please read c handling of excess precision. Try running your compiler with -fexcess-precision=standard and compare it with -fexcess-precision=fast. You can also try playing with -mfpmath=sse.
NOTE: According to the C++ standard this is not really a bug. However, it is a bug according to the documentation of GCC which claims to follow the IEEE-754 FP standard on Intel architectures (like it does on many other architectures). Obviously bug 323 violates the IEE-754 standard.
NOTE 2: On some optimization levels -fast-math is invoked, and all bets are off regarding to extra precision and evaluation order.
EDIT I have simulated the described behavior on an intel 64-bit system, and got the same results as the OP. Here is the code:
int main()
{
float a = hex2float(0x1D9969BB);
float b = hex2float(0x6CEDC83E);
float c = hex2float(0xAC89452F);
float d = hex2float(0xD2DC92B3);
float e = hex2float(0x4FE9F23C);
float result = (float)((double)a+b-c+d+e);
print("result", result);
result = flush(flush(flush(flush(a+b)-c)+d)+e);
print("result2", result);
}
The implementations of the support functions:
float hex2float(uint32_t num)
{
uint32_t rev = (num >> 24) | ((num >> 8) & 0xff00) | ((num << 8) & 0xff0000) | (num << 24);
float f;
memcpy(&f, &rev, 4);
return f;
}
void print(const char* label, float val)
{
printf("%10s (%13.10f) : 0x%02X%02X%02X%02X\n", label, val, ((unsigned char*)&val)[0],((unsigned char*)&val)[1],((unsigned char*)&val)[2],((unsigned char*)&val)[3]);
}
float flush(float x)
{
volatile float buf = x;
return buf;
}
After running this I have got exactly the same difference between the results:
result ( 0.4185241461) : 0xCC48D63E
result2 ( 0.4185241759) : 0xCD48D63E
For some reason this is different than the "pure" version described at the question. At one point I was also getting the same results as the "pure" version, but since then the question has changed. The original values in the original question were different. They were:
float a = hex2float(0x1D9969BB);
float b = hex2float(0x6CEDC83E);
float c = hex2float(0xD2DC92B3);
float d = hex2float(0xA61FD930);
float e = hex2float(0x4FE9F23C);
and with these values the resulting output is:
result ( 0.4185242951) : 0xD148D63E
result2 ( 0.4185242951) : 0xD148D63E

The C and C++ standards both permit floating-point expressions to be evaluated with more precision than the nominal type. Thus, a+b-c+d+e may be evaluated using double even though the types are float, and the compiler may optimize the expression in other ways. In particular, using exact mathematics is essentially using an infinite amount of precision, so the compiler is free to optimize or otherwise rearrange the expression based on mathematical properties rather than floating-point arithmetic properties.
It appears, for whatever reason, your compiler is choosing to use this liberty to evaluate the expression differently in different circumstances (which may be related to the language being compiled or due to other variations between your C and C++ code). One may be evaluating (((a+b)-c)+d)+e while the other does (((a+b)+d)+e)-c, or other variations.
In both languages, the compiler is required to “discard” the excess precision when a cast or assignment is performed. So you can compel a certain evaluation by inserting casts or assignments. Casts would make a mess of the expression, so assignments may be easier to read:
float t0 = a+b;
float t1 = t0-c;
float t2 = t1+d;
float result = t2+e;

Related

c++ (double)0.700 * int(1000) => 699 (Not the double precision issue)

using g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
I have tried different typecasting of scaledvalue2 but not until I stored the multiplication in a double variable and then to an int could I get desired result.. but I can't explain why ???
I know double precission(0.6999999999999999555910790149937383830547332763671875) is an issue but I don't understand why one way is OK and the other is not ??
I would expect both to fail if precision is a problem.
I DON'T NEED solution to fix it.. but just a WHY ??
(the problem IS fixed)
void main()
{
double value = 0.7;
int scaleFactor = 1000;
double doubleScaled = (double)scaleFactor * value;
int scaledvalue1 = doubleScaled; // = 700
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??
int scaledvalue3 = (double)(1000.0 * 0.7); // = 700
std::ostringstream oss;
oss << scaledvalue2;
printf("convert FloatValue[%f] multi with %i to get %f = %i or %i or %i[%s]\r\n",
value,scaleFactor,doubleScaled,scaledvalue1,scaledvalue2,scaledvalue3,oss.str().c_str());
}
or in short:
value = 0.6999999999999999555910790149937383830547332763671875;
int scaledvalue_a = (double)(1000 * value); // = 699??
int scaledvalue_b = (double)(1000 * 0.6999999999999999555910790149937383830547332763671875); // = 700
// scaledvalue_a = 699
// scaledvalue_b = 700
I can't figure out what is going wrong here.
Output :
convert FloatValue[0.700000] multi with 1000 to get 700.000000 = 700 or 699 or 700[699]
vendor_id : GenuineIntel
cpu family : 6
model : 54
model name : Intel(R) Atom(TM) CPU N2600 # 1.60GHz
This is going to be a bit handwaving; I was up too late last night watching the Cubs win the World Series, so don't insist on precision.
The rules for evaluating floating-point expressions are somewhat flexible, and compilers typically treat floating-point expressions even more flexibly than the rules formally allow. This makes evaluation of floating-point expressions faster, at the expense of making the results somewhat less predictable. Speed is important for floating-point calculations. Java initially made the mistake of imposing exact requirements on floating-point expressions and the numerics community screamed with pain. Java had to give in to the real world and relax those requirements.
double f();
double g();
double d = f() + g(); // 1
double dd1 = 1.6 * d; // 2
double dd2 = 1.6 * (f() + g()); // 3
On x86 hardware (i.e., just about every desktop system in existence), floating-point calculations are in fact done with 80 bits of precision (unless you set some switches that kill performance, as Java required), even though double and float are 64 bits and 32 bits, respectively. So for arithmetic operations the operands are converted up to 80 bits and the results are converted back down to 64 or 32 bits. That's slow, so the generated code typically delays doing conversions as long as possible, doing all of the calculation with 80-bit precision.
But C and C++ both require that when a value is stored into a floating-point variable, the conversion has to be done. So, formally, in line //1, the compiler must convert the sum back to 64 bits to store it into the variable d. Then the value of dd1, calculated in line //2, must be computed using the value that was stored into d, i.e., a 64-bit value, while the value of dd2, calculated in line //3, can be calculated using f() + g(), i.e., a full 80-bit value. Those extra bits can make a difference, and the value of dd1 might be different from the value of dd2.
And often the compiler will hang on to the 80-bit value of f() + g() and use that instead of the value stored in d when it calculates the value of dd1. That's a non-conforming optimization, but as far as I know, every compiler does that sort of thing by default. They all have command-line switches to enforce the strictly-required behavior, so if you want slower code you can get it. <g>
For serious number crunching, speed is critical, so this flexibility is welcome, and number-crunching code is carefully written to avoid sensitivity to this kind of subtle difference. People get PhDs for figuring out how to make floating-point code fast and effective, so don't feel bad that the results you see don't seem to make sense. They don't, but they're close enough that, handled carefully, they give correct results without a speed penalty.
Since x86 floating-point unit performs its computations in extended precision floating point type (80 bits wide), the result might easily depend on whether the intermediate values were forcefully converted to double (64-bit floating-point type). In that respect, in non-optimized code it is not unusual to see compilers treat memory writes to double variables literally, but ignore "unnecessary" casts to double applied to temporary intermediate values.
In your example, the first part involves saving the intermediate result in a double variable
double doubleScaled = (double)scaleFactor * value;
int scaledvalue1 = doubleScaled; // = 700
The compiler takes it literally and does indeed store the product in a double variable doubleScaled, which unavoidably requires converting the 80-bit product to double. Later that double value is read from memory again and then converted to int type.
The second part
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??
involves conversions that the compiler might see as unnecessary (and they indeed are unnecessary from the point of view of abstract C++ machine). The compiler ignores them, which means that the final int value is generated directly from the 80-bit product.
The presence of that intermediate conversion to double in the first variant (and its absence in the second one) is what causes that difference.
I converted mindriot's example assembly code to Intel syntax to test with Visual Studio. I could only reproduce the error by setting the floating point control word to use extended precision.
The issue is that rounding is performed when converting from extended precision to double precision when storing a double, versus truncation is performed when converting from extended precision to integer when storing an integer.
The extended precision multiply produces a product of 699.999..., but the product is rounded to 700.000... during the conversion from extended to double precision when the product is stored into doubleScaled.
double doubleScaled = (double)scaleFactor * value;
Since doubleScaled == 700.000..., when truncated to integer, it's still 700:
int scaledvalue1 = doubleScaled; // = 700
The product 699.999... is truncated when it's converted into an integer:
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??
My guess here is that the compiler generated a compile time constant 0f 700.000... rather than doing the multiply at run time.
int scaledvalue3 = (double)(1000.0 * 0.7); // = 700
This truncation issue can be avoided by using the round() function from the C standard library.
int scaledvalue2 = (int)round(scaleFactor * value); // should == 700
Depending on compiler and optimization flags, scaledvalue_a, involving a variable, may be evaluated at runtime using your processors floating point instructions, whereas scaledvalue_b, involving constants only, may be evaluated at compile time using a math library (e.g. gcc uses GMP - the GNU Multiple Precision math library for this). The difference you are seeing seems to be the difference between the precision and rounding of the runtime vs compile time evaluation of that expression.
Due to rounding errors, most floating-point numbers end up being slightly imprecise.
For the below double to int conversion use std::ceil() API
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699
??

How to catch undefined behaviour without executing it?

In my software I am using the input values from the user at run time and performing some mathematical operations. Consider for simplicity below example:
int multiply(const int a, const int b)
{
if(a >= INT_MAX || B >= INT_MAX)
return 0;
else
return a*b;
}
I can check if the input values are greater than the limits, but how do I check if the result will be out of limits? It is quite possible that a = INT_MAX - 1 and b = 2. Since the inputs are perfectly valid, it will execute the undefined code which makes my program meaningless. This means any code executed after this will be random and eventually may result in crash. So how do I protect my program in such cases?
This really comes down to what you actually want to do in this case.
For a machine where long or long long (or int64_t) is a 64-bit value, and int is a 32-bit value, you could do (I'm assuming long is 64 bit here):
long x = static_cast<long>(a) * b;
if (x > MAX_INT || x < MIN_INT)
return 0;
else
return static_cast<int>(x);
By casting one value to long, the other will have to be converted as well. You can cast both if that makes you happier. The overhead here, above a normal 32-bit multiply is a couple of clock-cycles on modern CPU's, and it's unlikely that you can find a safer solution, that is also faster. [You can, in some compilers, add attributes to the if saying that it's unlikely to encourage branch prediction "to get it right" for the common case of returning x]
Obviously, this won't work for values where the type is as big as the biggest integer you can deal with (although you could possibly use floating point, but it may still be a bit dodgy, since the precision of float is not sufficient - could be done using some "safety margin" tho' [e.g. compare to less than LONG_INT_MAX / 2], if you don't need the entire range of integers.). Penalty here is a bit worse tho', especially transitions between float and integer isn't "pleasant".
Another alternative is to actually test the relevant code, with "known invalid values", and as long as the rest of the code is "ok" with it. Make sure you test this with the relevant compiler settings, as changing the compiler options will change the behaviour. Note that your code then has to deal with "what do we do when 65536 * 100000 is a negative number", and your code didn't expect so. Perhaps add something like:
int x = a * b;
if (x < 0) return 0;
[But this only works if you don't expect negative results, of course]
You could also inspect the assembly code generated and understand the architecture of the actual processor [the key here is to understand if "overflow will trap" - which it won't by default in x86, ARM, 68K, 29K. I think MIPS has an option of "trap on overflow"], and determine whether it's likely to cause a problem [1], and add something like
#if (defined(__X86__) || defined(__ARM__))
#error This code needs inspecting for correct behaviour
#endif
return a * b;
One problem with this approach, however, is that even the slightest changes in code, or compiler version may alter the outcome, so it's important to couple this with the testing approach above (and make sure you test the ACTUAL production code, not some hacked up mini-example).
[1] The "undefined behaviour" is undefined to allow C to "work" on processors that have trapping overflows of integer math, as well as the fact that that a * b when it overflows in a signed value is of course hard to determine unless you have a defined math system (two's complement, one's complement, distinct sign bit) - so to avoid "defining" the exact behaviour in these cases, the C standard says "It's undefined". It doesn't mean that it will definitely go bad.
Specifically for the multiplication of a by b the mathematically correct way to detect if it will overflow is to calculate log₂ of both values. If their sum is higher than the log₂ of the highest representable value of the result, then there is overflow.
log₂(a) + log₂(b) < log₂(UINT_MAX)
The difficulty is to calculate quickly the log₂ of an integer. For that, there are several bit twiddling hacks that can be used, like counting bit, counting leading zeros (some processors even have instructions for that). This site has several implementations
https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
The simplest implementation could be:
unsigned int log2(unsigned int v)
{
unsigned int r = 0;
while (v >>= 1)
r++;
return r;
}
In your program you only need to check then
if(log2(a) + log2(b) < MYLOG2UINTMAX)
return a*b;
else
printf("Overflow");
The signed case is similar but has to take care of the negative case specifically.
EDIT: My solution is not complete and has an error which makes the test more severe than necessary. The equation works in reality if the log₂ function returns a floating point value. In the implementation I limited thevalue to unsigned integers. This means that completely valid multiplication get refused. Why? Because log2(UINT_MAX) is truncated
log₂(UINT_MAX)=log₂(4294967295)≈31.9999999997 truncated to 31.
We have there for to change the implementation to replace the constant to compare to
#define MYLOG2UINTMAX (CHAR_BIT*sizeof (unsigned int))
You may try this:
if ( b > ULONG_MAX / a ) // Need to check a != 0 before this division
return 0; //a*b invoke UB
else
return a*b;

log base 2 precision error in c++

Please explain output of the below given code.I m getting different values of c for both the cases i.e,
Case 1 : Value of n is taken from the standard input.
Case 2 : Value of n is directly written in the code.
link : http://www.ideone.com/UjYFQd
#include <iostream>
#include <cstdio>
#include <math.h>
using namespace std;
int main()
{
int c;
int n;
scanf("%d", &n); //n = 64
c = (log(n) / log(2));
cout << c << endl; //OUTPUT = 5
n = 64;
c = (log(n) / log(2));
cout << c << endl; //OUTPUT = 6
return 0;
}
You may see this because of how the floating point number is stored:
double result = log(n) / log(2); // where you input n as 64
int c = (int)result; // this will truncate result. If result is 5.99999999999999, you will get 5
When you hardcode the value, the compiler will optimize it for you:
double result = log(64) / log(2); // which is the same as 6 * log(2) / log(2)
int c = (int)result;
Will more than likely be replaced entirely with:
int c = 6;
Because the compiler will see that you are using a bunch of compile-time constants to store the value in a variable (it will go ahead and crunch the value at compile time).
If you want to get the integer result for the operation, you should use std::round instead of just casting to an int.
int c = std::round(log(n) / log(2));
The first time around, log(n)/log(2) is computed and the result is very close to 6 but slightly less. This is just how floating point computation works: neither log(64) nor log(2) have an infinitely precise representation in binary floating point, so you can expect the result of dividing one by the other to be slightly off from the true mathematical value. Depending on the implementation you can expect to get 5 or 6.
In the second computation:
n = 64;
c = (log(n) / log(2));
The value assigned to c can be inferred to be a compile-time constant and can be computed by the compiler. The compiler does the computation in a different environment than the program while it runs, so you can expect to get slightly different results from computations performed at compile-time and at runtime.
For example, a compiler generating code for x86 may choose to use x87 floating point instructions that use 80bit floating point arithmetic, while the compiler itself uses standard 64bit floating point arithmetic to compute compile-time constants.
Check the assembler output from your compiler to confirm this. Using GCC 4.8 I get 6 from both computations.
The difference in output can be explained by the fact that gcc is optimizing out the calls to log in the constant cases for example, in this case:
n = 64;
c = (log(n) / log(2));
both calls to log are being done at compile time, these compile time evaluation can cause different results. This is documented in the gcc manual in the Other Built-in Functions Provided by GCC section which says:
GCC includes built-in versions of many of the functions in the standard C library. The versions prefixed with _builtin are always treated as having the same meaning as the C library function even if you specify the -fno-builtin option. (see C Dialect Options) Many of these functions are only optimized in certain cases; if they are not optimized in a particular case, a call to the library function is emitted.
and log is one of the many functions that has builtin versions. If I build using -fno-builtin all four calls to log are made but without that only one call to log is emitted you can check this by building with the -S flag which will output the assembly which gcc generate.

Why was no intrinsic access to the CPU's status register in the design of both C and C++?

In the case of the overflow flag, it would seem that access to this flag would be a great boon to cross-architecture programming. It would provide a safe alternative to relying on undefined behaviour to check for signed integer overflow such as:
if(a < a + 100) //detect overflow
I do understand that there are safe alternatives such as:
if(a > (INT_MAX - 100)) //detected overflow
However, it would seem that access to the status register or the individual flags within it is missing from both the C and C++ languages. Why was this feature not included or what language design decisions were made that prohibited this feature from being included?
Because C and C++ are designed to be platform independent. Status register is not.
These days, two's complement is universally used to implement signed integer arithmetic, but it was not always the case. One's complement or sign and absolute value used to be quite common. And when C was first designed, such CPUs were still in common use. E.g. COBOL distinguishes negative and positive 0, which existed on those architectures. Obviously overflow behaviour on these architectures is completely different!
By the way, you can't rely on undefined behaviour for detecting overflow, because reasonable compilers upon seeing
if(a < a + 100)
will write a warning and compile
if(true)
... (provided optimizations are turned on and the particular optimization is not turned off).
And note, that you can't rely on the warning. The compiler will only emit the warning when the condition ends up true or false after equivalent transformations, but there are many cases where the condition will be modified in presence of overflow without ending up as plain true/false.
Because C++ is designed as a portable language, i.e. one that compiles on many CPUs (e.g. x86, ARM, LSI-11/2, with devices like Game Boys, Mobile Phones, Freezers, Airplanes, Human Manipulation Chips and Laser Swords).
the available flags across CPUs may largely differ
even within the same CPU, flags may differ (take x86 scalar vs. vector instructions)
some CPUs may not even have the flag you desire at all
The question has to be answered: Should the compiler always deliver/enable that flag when it can't determine whether it is used at all?, which does not conform the pay only for what you use unwritten but holy law of both C and C++
Because compilers would have to be forbidden to optimize and e.g. reorder code to keep those flags valid
Example for the latter:
int x = 7;
x += z;
int y = 2;
y += z;
The optimizer may transform this to that pseudo assembly code:
alloc_stack_frame 2*sizeof(int)
load_int 7, $0
load_int 2, $1
add z, $0
add z, $1
which in turn would be more similar to
int x = 7;
int y = 2;
x += z;
y += z;
Now if you query registers inbetween
int x = 7;
x += z;
if (check_overflow($0)) {...}
int y = 2;
y += z;
then after optimizing and dissasembling you might end with this:
int x = 7;
int y = 2;
x += z;
y += z;
if (check_overflow($0)) {...}
which is then incorrect.
More examples could be constructed, like what happens with a constant-folding-compile-time-overflow.
Sidenotes: I remember an old Borland C++ compiler having a small API to read the current CPU registers. However, the argumentation above about optimization still applies.
On another sidenote: To check for overflow:
// desired expression: int z = x + y
would_overflow = x > MAX-y;
more concrete
auto would_overflow = x > std::numeric_limits<int>::max()-y;
or better, less concrete:
auto would_overflow = x > std::numeric_limits<decltype(x+y)>::max()-y;
I can think of the following reasons.
By allowing access to the register-flags, portability of the language across platforms is severily limited.
The optimizer can change expressions drastically, and render your flags useless.
It would make the language more complex
Most compilers have a big set of intrinsic functions, to do most common operations (e.g. addition with carry) without resorting to flags.
Most expressions can be rewritten in a safe way to avoid overflows.
You can always fall back to inline assembly if you have very specific needs
Access to status registers does not seem needed enough, to go through a standardization-effort.

Another double type trick in C++?

First, I understand that the double type in C++ has been discussed lots of time, but I wasn't able to answer my question after searching. Any help or idea is highly appreciated.
The simplified version of my question is: I got three different results (a=-0.926909, a=-0.926947 and a=-0.926862) when I computed a=b-c+d with three different approaches and the same values of b, c and d, and I don't know which one to trust.
The detailed version of my question is:
I was recently writing a program (in C++ on Ubuntu 10.10) to handle some data. One function looks like this:
void calc() {
double a, b;
...
a = b - c + d; // c, d are global variables of double
...
}
When I was using GDB to debug the above code, during a call to calc(), I recorded the values of b, c and d before the statement a = b - c + d as follows:
b = 54.7231
c = 55.4051
d = -0.244947
After the statement a = b - c + d excuted, I found that a=-0.926909 instead of -0.926947 which is calculated by a calculator. Well, so far it is not quite confusing yet, as I guess this might just be a precision problem. Later on I re-implemented another version of calc() for some reason. Let's call this new version calc_new(). calc_new() is almost the same as calc(), except for how and where b, c and d are calculated:
void calc_new() {
double a, b;
...
a = b - c + d; // c, d are global variables of double
...
}
This time when I was debugging, the values of b, c and d before the statement a = b - c + d are the same as when calc() was debugged: b = 54.7231, c = 55.4051, d = -0.244947. However, this time after the statement a = b - c + d executed, I got a=-0.926862. That being said, I got three different a when I computed a = b - c + d with the same values of b, c and d. I think differences between a=-0.926862, a=-0.926909 and a=-0.926947 are not small, but I cannot figure out the cause. And which one is correct?
With Many Thanks,
Tom
If you expect the answer to be accurate in the 5th and 6th decimal place, you need to know exactly what the inputs to the calculation are in those places. You are seeing inputs with only 4 decimal places, you need to display their 5th and 6th place as well. Then I think you would see a comprehensible situation that matches your calculator to 6 decimal places. Double has more than sufficient precision for this job, there would only be precision problems here if you were taking the difference of two very similar numbers (you're not).
Edit: Unsurprisingly, increasing the display precision would have also shown you that calc() and calc_new() were supplying different inputs to the calculation. Credit to Mike Seymour and Dietmar Kuhl in the comments who were the first to see your actual problem.
Let me try to answer the question I suspect that you meant to ask. If I have mistaken your intent, then you can disregard the answer.
Suppose that I have the numbers u = 500.1 and v = 5.001, each to four decimal places of accuracy. What then is w = u + v? Answer, w = 505.101, but to four decimal places, it's w = 505.1.
Now consider x = w - u = 5.000, which should equal v, but doesn't quite.
If I only change the order of operations however, I can get x to equal v exactly, not by x = w - u or by x = (u + v) - u, but by x = v + (u - u).
Is that trivial? Yes, in my example, it is; but the same principle applies in your example, except that they aren't really decimal places but bits of precision.
In general, to maintain precision, if you have some floating-point numbers to sum, you should try to add the small ones together first, and only bring the larger ones into the sum later.
We're discussing here about smoke. If nothing changed in the environment an expression like:
a = b + c + d
MUST ALWAYS RETURN THE SAME VALUE IF INPUTS AREN'T CHANGED.
No rounding errors. No esoteric pragmas, nothing at all.
If you check your bank account today and tomorrow (and nothing changed in that time) I suspect you'll go crazy if you see something different. We're speaking about programs, not random number generators!!!
The correct one is -0.926947.
The differences you see are far too large for rounding errors (even in single precision) as one can check in this encoder.
When using the encoder, you need to enter them like this: -55.926909 (to account for the potential effect of the operator commutativity effects nicely described in previously submitted answers.) Additionally, a difference in just the last significant bit may well be due to rounding effects, but you will not see any with your values.
When using the tool, 64bit format (Binary64) corresponds to your implementation's double type.
Rational numbers do not always have a terminating expansion in a given base. 1/3rd cannot be expressed in a finite number of digits in base ten. In base 2, rational numbers with a denominator that is a power of two will have a terminating expansion. The rest won't. So 1/2, 1/4, 3/8, 7/16.... any number that looks like x/(2^n) can be represented accurately. That turns out to be a fairly sparse subset of the infinite series of rational numbers. Everything else will be subject to the errors introduced by trying to represent an infinite number of binary digits within a finite container.
But addition is commutative, right? Yes. But when you start introducing rounding errors things change a little. With a = b + c + d as an example, let's say that d cannot be expressed in a finite number of binary digits. Neither can c. So adding them together will give us some inaccurate value, which itself may also be incapable of being represented in a finite number of binary digits. So error on top of error. Then we add that value to b, which may also not be a terminating expansion in binary. So taking one inaccurate result and adding it to another inaccurate number results in another inaccurate number. And because we're throwing away precision at every step, we potentially break the symmetry of commutativity at each step.
There's a post I made: (Perl-related, but it's a universal topic) Re: Shocking Imprecision (PerlMonks), and of course the canonical What Every Computer Scientist Should Know About Floating Point Math, both which discuss the topic. The latter is far more detailed.