log base 2 precision error in c++ - c++

Please explain output of the below given code.I m getting different values of c for both the cases i.e,
Case 1 : Value of n is taken from the standard input.
Case 2 : Value of n is directly written in the code.
link : http://www.ideone.com/UjYFQd
#include <iostream>
#include <cstdio>
#include <math.h>
using namespace std;
int main()
{
int c;
int n;
scanf("%d", &n); //n = 64
c = (log(n) / log(2));
cout << c << endl; //OUTPUT = 5
n = 64;
c = (log(n) / log(2));
cout << c << endl; //OUTPUT = 6
return 0;
}

You may see this because of how the floating point number is stored:
double result = log(n) / log(2); // where you input n as 64
int c = (int)result; // this will truncate result. If result is 5.99999999999999, you will get 5
When you hardcode the value, the compiler will optimize it for you:
double result = log(64) / log(2); // which is the same as 6 * log(2) / log(2)
int c = (int)result;
Will more than likely be replaced entirely with:
int c = 6;
Because the compiler will see that you are using a bunch of compile-time constants to store the value in a variable (it will go ahead and crunch the value at compile time).
If you want to get the integer result for the operation, you should use std::round instead of just casting to an int.
int c = std::round(log(n) / log(2));

The first time around, log(n)/log(2) is computed and the result is very close to 6 but slightly less. This is just how floating point computation works: neither log(64) nor log(2) have an infinitely precise representation in binary floating point, so you can expect the result of dividing one by the other to be slightly off from the true mathematical value. Depending on the implementation you can expect to get 5 or 6.
In the second computation:
n = 64;
c = (log(n) / log(2));
The value assigned to c can be inferred to be a compile-time constant and can be computed by the compiler. The compiler does the computation in a different environment than the program while it runs, so you can expect to get slightly different results from computations performed at compile-time and at runtime.
For example, a compiler generating code for x86 may choose to use x87 floating point instructions that use 80bit floating point arithmetic, while the compiler itself uses standard 64bit floating point arithmetic to compute compile-time constants.
Check the assembler output from your compiler to confirm this. Using GCC 4.8 I get 6 from both computations.

The difference in output can be explained by the fact that gcc is optimizing out the calls to log in the constant cases for example, in this case:
n = 64;
c = (log(n) / log(2));
both calls to log are being done at compile time, these compile time evaluation can cause different results. This is documented in the gcc manual in the Other Built-in Functions Provided by GCC section which says:
GCC includes built-in versions of many of the functions in the standard C library. The versions prefixed with _builtin are always treated as having the same meaning as the C library function even if you specify the -fno-builtin option. (see C Dialect Options) Many of these functions are only optimized in certain cases; if they are not optimized in a particular case, a call to the library function is emitted.
and log is one of the many functions that has builtin versions. If I build using -fno-builtin all four calls to log are made but without that only one call to log is emitted you can check this by building with the -S flag which will output the assembly which gcc generate.

Related

Compile time floating point division by zero in C++

It is well known, if one divides a floating-point number on zero in run-time then the result will be either infinity or not-a-number (the latter case if the dividend was also zero). But is it allowed to divide by zero in C++ constexpr expressions (in compile time), e.g.
#include <iostream>
int main() {
double x = 0.;
// run-time division: all compilers print "-nan"
std::cout << 0./x << std::endl;
// compile-time division, diverging results
constexpr double y = 0.;
std::cout << 0./y << std::endl;
}
In this program the first printed number is obtained from division in run-time, and all compilers are pretty consistent in printing -nan. (Side question: why not +nan?)
But in second case the compilers diverge. MSVC simply stops compilation with the error:
error C2124: divide or mod by zero
GCC still prints -nan, while Clang changes the sign printing “positive” nan, demo: https://gcc.godbolt.org/z/eP744er8n
Does the language standard permit all three behaviors of the compilers for compile-time division: 1) reject the program, 2) produce the same division result as in run-time, 3) produce a different (in sign bit) result?
Division by zero is undefined behavior. (So anything goes)
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/n4713.pdf
Section 8.5.5, point 4

Float operation difference in C vs C++ [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
Solution
Thanks to #Michael Veksler 's answer, I was put on the right track to search for a solution. #Christoph, in this post, suggests trying different compiler flags to set the precision of the floating-point operations.
For me, the -mpc32 flag solved the problem.
I have to translate C++ code to C code as the new target won't have a C++ compiler.
I am running into a strange thing, where a mathematical equation gives different results when run in a C program compared to when run in a C++ program.
Equation:
float result = (float)(a+b-c+d+e);
The elements of the equation are all floats. I check the contents of the memory of each element by using
printf("a : 0x%02X%02X%02X%02X\n",((unsigned char*)&a)[0],((unsigned char*)&a)[1],((unsigned char*)&a)[2],((unsigned char*)&a)[3]);
Both in C and in C++, a b c d and e are equal, but the results are different.
Sample of a calculation in C:
a : 0x1D9969BB
b : 0x6CEDC83E
c : 0xAC89452F
d : 0xD2DC92B3
e : 0x4FE9F23C
result : 0xCD48D63E
And a sample in C++:
a : 0x1D9969BB
b : 0x6CEDC83E
c : 0xAC89452F
d : 0xD2DC92B3
e : 0x4FE9F23C
result : 0xCC48D63E
When I separate the equation in smaller parts, as in r = a + b then r = r - c and so on, the results are equal.
I have a 64-bit Windows machine.
Can someone explain why this happens?
I am sorry for this noob question, I am just starting out.
EDIT
I use the latest version of MinGW with options
-O0 -g3 -Wall -c -fmessage-length=0
EDIT 2
Sorry for the long time...
Here are the values corresponding to the above hex ones in C:
a : -0.003564424114301801
b : 0.392436385154724120
c : 0.000000000179659565
d : -0.000000068388217755
e : 0.029652265831828117
r : 0.418524175882339480
And here are for C++:
a : -0.003564424114301801
b : 0.392436385154724120
c : 0.000000000179659565
d : -0.000000068388217755
e : 0.029652265831828117
r : 0.418524146080017090
They are printed like printf("a : %.18f\n",a);
The values are not known at compile time, the equation is in a function called multiple times throughout the execution. The elements of the equation are computed inside the function.
Also, I observed a strange thing: I ran the exact equation in a new "pure" project (for both C and C++), i.e. only the main itself. The values of the elements are the same as the ones above (in float). The result is r : 0xD148D63E for both. The same as in #geza 's comment.
Introduction: Given that the question is not detailed enough, I am left to speculate the infamous gcc's 323 bug. As the low bug-ID suggests, this bug has been there forever. The bug report existed since June 2000, currently has 94 (!) duplicates, and the last one reported only half a year ago (on 2018-08-28). The bug affects only 32 bit executable on Intel computers (like cygwin). I assume that OP's code uses x87 floating point instructions, which are the default for 32 bit executables, while SSE instructions are only optional. Since 64 bit executables are more prevalent than 32, and no longer depend on x87 instructions, this bug has zero chance of ever being fixed.
Bug description: The x87 architecture has 80 bit floating point registers. The float requires only 32 bits. The bug is that x87 floating point operations are always done with 80 bits accuracy (subject to hardware configuration flag). This extra accuracy makes precision very flaky, because it depends on when the registers are being spilled (written) to memory.
If a 80 bit register is spilled into a 32 bit variable in memory, then extra precision is lost. This is the correct behavior if this happened after each floating point operation (since float is supposed to be 32 bits). However, spilling to memory slows things down and no compiler writer wants the executable to run slow. So by default the values are not spilled to memory.
Now, sometimes the value is spilled to memory and sometimes it is not. It depends on optimization level, on compiler heuristics, and on other seemingly random factors. Even with -O0 there could be slightly different strategies for dealing with spilling the x87 registers to memory, resulting in slightly different results. The strategy of spilling is probably the difference between your C and C++ compilers that you experience.
Work around:
For ways to handle this, please read c handling of excess precision. Try running your compiler with -fexcess-precision=standard and compare it with -fexcess-precision=fast. You can also try playing with -mfpmath=sse.
NOTE: According to the C++ standard this is not really a bug. However, it is a bug according to the documentation of GCC which claims to follow the IEEE-754 FP standard on Intel architectures (like it does on many other architectures). Obviously bug 323 violates the IEE-754 standard.
NOTE 2: On some optimization levels -fast-math is invoked, and all bets are off regarding to extra precision and evaluation order.
EDIT I have simulated the described behavior on an intel 64-bit system, and got the same results as the OP. Here is the code:
int main()
{
float a = hex2float(0x1D9969BB);
float b = hex2float(0x6CEDC83E);
float c = hex2float(0xAC89452F);
float d = hex2float(0xD2DC92B3);
float e = hex2float(0x4FE9F23C);
float result = (float)((double)a+b-c+d+e);
print("result", result);
result = flush(flush(flush(flush(a+b)-c)+d)+e);
print("result2", result);
}
The implementations of the support functions:
float hex2float(uint32_t num)
{
uint32_t rev = (num >> 24) | ((num >> 8) & 0xff00) | ((num << 8) & 0xff0000) | (num << 24);
float f;
memcpy(&f, &rev, 4);
return f;
}
void print(const char* label, float val)
{
printf("%10s (%13.10f) : 0x%02X%02X%02X%02X\n", label, val, ((unsigned char*)&val)[0],((unsigned char*)&val)[1],((unsigned char*)&val)[2],((unsigned char*)&val)[3]);
}
float flush(float x)
{
volatile float buf = x;
return buf;
}
After running this I have got exactly the same difference between the results:
result ( 0.4185241461) : 0xCC48D63E
result2 ( 0.4185241759) : 0xCD48D63E
For some reason this is different than the "pure" version described at the question. At one point I was also getting the same results as the "pure" version, but since then the question has changed. The original values in the original question were different. They were:
float a = hex2float(0x1D9969BB);
float b = hex2float(0x6CEDC83E);
float c = hex2float(0xD2DC92B3);
float d = hex2float(0xA61FD930);
float e = hex2float(0x4FE9F23C);
and with these values the resulting output is:
result ( 0.4185242951) : 0xD148D63E
result2 ( 0.4185242951) : 0xD148D63E
The C and C++ standards both permit floating-point expressions to be evaluated with more precision than the nominal type. Thus, a+b-c+d+e may be evaluated using double even though the types are float, and the compiler may optimize the expression in other ways. In particular, using exact mathematics is essentially using an infinite amount of precision, so the compiler is free to optimize or otherwise rearrange the expression based on mathematical properties rather than floating-point arithmetic properties.
It appears, for whatever reason, your compiler is choosing to use this liberty to evaluate the expression differently in different circumstances (which may be related to the language being compiled or due to other variations between your C and C++ code). One may be evaluating (((a+b)-c)+d)+e while the other does (((a+b)+d)+e)-c, or other variations.
In both languages, the compiler is required to “discard” the excess precision when a cast or assignment is performed. So you can compel a certain evaluation by inserting casts or assignments. Casts would make a mess of the expression, so assignments may be easier to read:
float t0 = a+b;
float t1 = t0-c;
float t2 = t1+d;
float result = t2+e;

Can 'inf' be assigned to a variable like regular numeric values in c++

When I wrote the following code, instead of causing runtime error, it outputs 'inf'. Now, is there any way to assign this value ('inf') to a variable, like regular numeric values? How to check if a division yields 'inf'?
#include<iostream>
int main(){
double a = 1, b = 0;
std::cout << a / b << endl;
return 0;
}
C++ does not require implementations to support infinity or division by zero. Many implementations will, as many implementations use IEEE 754 formats even if they do not fully support IEEE 754 semantics.
When you want to use infinity as a value (that is, you want to refer to infinity in source code), you should not generate it by dividing by zero. Instead, include <limits> and use std::numeric_limits<T>::infinity() with T specified as double.
Returns the special value "positive infinity", as represented by the floating-point type T. Only meaningful if std::numeric_limits< T >::has_infinity== true.
(You may also see code that includes <cmath> and uses INFINITY, which are inherited from C.)
When you want to check if a number is finite, include <cmath> and use std::isfinite. Note that computations with infinite values tend to ultimately produce NaNs, and std::isfinite(x) is generally more convenient than !std::isinf(x) && !std::isnan(x).
A final warning in case you are using unsafe compiler flags: In case you use, e.g., GCC's -ffinite-math-only (included in -ffast-math) then std::isfinite does not work.
It appears to be I can:
#include<iostream>
int main(){
double a = 1, b = 0, c = 1/0.0;
std::cout << a / b << endl;
if (a / b == c) std::cout << "Yes you can.\n";
return 0;
}

c++ (double)0.700 * int(1000) => 699 (Not the double precision issue)

using g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
I have tried different typecasting of scaledvalue2 but not until I stored the multiplication in a double variable and then to an int could I get desired result.. but I can't explain why ???
I know double precission(0.6999999999999999555910790149937383830547332763671875) is an issue but I don't understand why one way is OK and the other is not ??
I would expect both to fail if precision is a problem.
I DON'T NEED solution to fix it.. but just a WHY ??
(the problem IS fixed)
void main()
{
double value = 0.7;
int scaleFactor = 1000;
double doubleScaled = (double)scaleFactor * value;
int scaledvalue1 = doubleScaled; // = 700
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??
int scaledvalue3 = (double)(1000.0 * 0.7); // = 700
std::ostringstream oss;
oss << scaledvalue2;
printf("convert FloatValue[%f] multi with %i to get %f = %i or %i or %i[%s]\r\n",
value,scaleFactor,doubleScaled,scaledvalue1,scaledvalue2,scaledvalue3,oss.str().c_str());
}
or in short:
value = 0.6999999999999999555910790149937383830547332763671875;
int scaledvalue_a = (double)(1000 * value); // = 699??
int scaledvalue_b = (double)(1000 * 0.6999999999999999555910790149937383830547332763671875); // = 700
// scaledvalue_a = 699
// scaledvalue_b = 700
I can't figure out what is going wrong here.
Output :
convert FloatValue[0.700000] multi with 1000 to get 700.000000 = 700 or 699 or 700[699]
vendor_id : GenuineIntel
cpu family : 6
model : 54
model name : Intel(R) Atom(TM) CPU N2600 # 1.60GHz
This is going to be a bit handwaving; I was up too late last night watching the Cubs win the World Series, so don't insist on precision.
The rules for evaluating floating-point expressions are somewhat flexible, and compilers typically treat floating-point expressions even more flexibly than the rules formally allow. This makes evaluation of floating-point expressions faster, at the expense of making the results somewhat less predictable. Speed is important for floating-point calculations. Java initially made the mistake of imposing exact requirements on floating-point expressions and the numerics community screamed with pain. Java had to give in to the real world and relax those requirements.
double f();
double g();
double d = f() + g(); // 1
double dd1 = 1.6 * d; // 2
double dd2 = 1.6 * (f() + g()); // 3
On x86 hardware (i.e., just about every desktop system in existence), floating-point calculations are in fact done with 80 bits of precision (unless you set some switches that kill performance, as Java required), even though double and float are 64 bits and 32 bits, respectively. So for arithmetic operations the operands are converted up to 80 bits and the results are converted back down to 64 or 32 bits. That's slow, so the generated code typically delays doing conversions as long as possible, doing all of the calculation with 80-bit precision.
But C and C++ both require that when a value is stored into a floating-point variable, the conversion has to be done. So, formally, in line //1, the compiler must convert the sum back to 64 bits to store it into the variable d. Then the value of dd1, calculated in line //2, must be computed using the value that was stored into d, i.e., a 64-bit value, while the value of dd2, calculated in line //3, can be calculated using f() + g(), i.e., a full 80-bit value. Those extra bits can make a difference, and the value of dd1 might be different from the value of dd2.
And often the compiler will hang on to the 80-bit value of f() + g() and use that instead of the value stored in d when it calculates the value of dd1. That's a non-conforming optimization, but as far as I know, every compiler does that sort of thing by default. They all have command-line switches to enforce the strictly-required behavior, so if you want slower code you can get it. <g>
For serious number crunching, speed is critical, so this flexibility is welcome, and number-crunching code is carefully written to avoid sensitivity to this kind of subtle difference. People get PhDs for figuring out how to make floating-point code fast and effective, so don't feel bad that the results you see don't seem to make sense. They don't, but they're close enough that, handled carefully, they give correct results without a speed penalty.
Since x86 floating-point unit performs its computations in extended precision floating point type (80 bits wide), the result might easily depend on whether the intermediate values were forcefully converted to double (64-bit floating-point type). In that respect, in non-optimized code it is not unusual to see compilers treat memory writes to double variables literally, but ignore "unnecessary" casts to double applied to temporary intermediate values.
In your example, the first part involves saving the intermediate result in a double variable
double doubleScaled = (double)scaleFactor * value;
int scaledvalue1 = doubleScaled; // = 700
The compiler takes it literally and does indeed store the product in a double variable doubleScaled, which unavoidably requires converting the 80-bit product to double. Later that double value is read from memory again and then converted to int type.
The second part
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??
involves conversions that the compiler might see as unnecessary (and they indeed are unnecessary from the point of view of abstract C++ machine). The compiler ignores them, which means that the final int value is generated directly from the 80-bit product.
The presence of that intermediate conversion to double in the first variant (and its absence in the second one) is what causes that difference.
I converted mindriot's example assembly code to Intel syntax to test with Visual Studio. I could only reproduce the error by setting the floating point control word to use extended precision.
The issue is that rounding is performed when converting from extended precision to double precision when storing a double, versus truncation is performed when converting from extended precision to integer when storing an integer.
The extended precision multiply produces a product of 699.999..., but the product is rounded to 700.000... during the conversion from extended to double precision when the product is stored into doubleScaled.
double doubleScaled = (double)scaleFactor * value;
Since doubleScaled == 700.000..., when truncated to integer, it's still 700:
int scaledvalue1 = doubleScaled; // = 700
The product 699.999... is truncated when it's converted into an integer:
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??
My guess here is that the compiler generated a compile time constant 0f 700.000... rather than doing the multiply at run time.
int scaledvalue3 = (double)(1000.0 * 0.7); // = 700
This truncation issue can be avoided by using the round() function from the C standard library.
int scaledvalue2 = (int)round(scaleFactor * value); // should == 700
Depending on compiler and optimization flags, scaledvalue_a, involving a variable, may be evaluated at runtime using your processors floating point instructions, whereas scaledvalue_b, involving constants only, may be evaluated at compile time using a math library (e.g. gcc uses GMP - the GNU Multiple Precision math library for this). The difference you are seeing seems to be the difference between the precision and rounding of the runtime vs compile time evaluation of that expression.
Due to rounding errors, most floating-point numbers end up being slightly imprecise.
For the below double to int conversion use std::ceil() API
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699
??

Fortran COMPLEX calculates different from C++

I have completed a port from Fortran to C++ but have discovered some differences in the COMPLEX type. Consider the following codes:
PROGRAM CMPLX
COMPLEX*16 c
REAL*8 a
c = (1.23456789, 3.45678901)
a = AIMAG(1.0 / c)
WRITE (*, *) a
END
And the C++:
#include <complex>
#include <iostream>
#include <iomanip>
int main()
{
std::complex<double> c(1.23456789, 3.45678901);
double a = (1.0 / c).imag();
std::cout << std::setprecision(15) << " " << a << std::endl;
}
Compiling the C++ version with clang++ or g++, I get the output: -0.256561150444368
Compiling the Fortran version however gives me: -0.25656115049876993
I mean, doesn't both languages follow the IEEE 754? If I run the following in Octave (Matlab):
octave:1> c=1.23456789+ 3.45678901i
c = 1.2346 + 3.4568i
octave:2> c
c = 1.2346 + 3.4568i
octave:3> output_precision(15)
octave:4> c
c = 1.23456789000000e+00 + 3.45678901000000e+00i
octave:5> 1 / c
ans = 9.16290109820952e-02 - 2.56561150444368e-01i
I get the same as the C++ version. What is up with the Fortran COMPLEX type? Am I missing some compiler flags? -ffast-math doesn't change anything. I want to produce the exact same 15 decimals in C++ and Fortran, so I easier can spot porting differences.
Any Fortran gurus around? Thanks!
In the Fortran code replace
c = (1.23456789, 3.45678901)
with
c = (1.23456789d0, 3.45678901d0)
Without a kind the real literals you use on the rhs are, most likely, 32-bit reals and you probably want 64-bit reals. The suffix d0 causes the compiler to create 64-bit reals closest to the values you provide. I've glossed over some details in this, and there are other (possibly better) ways of specifying the kind of a real number literal but this approach should work OK on any current Fortran compiler.
I don't know C++ very well, I'm not sure if the C++ code has the same problem.
If I read your question correctly the two codes produce the same answer to 8sf, the limit of single precision.
As for IEEE-754 compliance, that standard does not cover, so far as I am aware, the issues of complex arithmetic. I expect the f-p arithmetic used behind the scenes produces results on complex numbers within expected error bounds in most cases, but I'm not aware that they are guaranteed as error bounds on f-p arithmetic are.
I would propose to change all Fortran contants to DP
1.23456789_8 (or 1.23456789D00) etc
and use DIMAG instead of AIMAG