Fortran handling large real numbers with precision [duplicate] - fortran

This question already has answers here:
Is There a Better Double-Precision Assignment in Fortran 90?
(2 answers)
Closed 5 years ago.
When I run the code below I get an output of 6378136.5 instead of 6378136.3
PROGRAM test
implicit none
real*8 radius
radius = 6378136.3
print*,radius
END
I have read this other link (Precision problems of real numbers in Fortran) but it doesn't explain how to fix this problem.

The reason this is happening is not because the variable you are using lacks precision, but because you initialized the value using a single precision number.
Take a look at this answer for a good explanation, and an elegant solution to your problem for any larger programs.
If you just want to solve it quickly, then you only have to change one line:
radius = 6378136.3d0
Though this will still give you a value of 6378136.2999999998 because of floating point precision.

Related

Why does 12.0 == 11.999999999999999999999 is considered as true? [duplicate]

This question already has answers here:
Why does floating-point arithmetic not give exact results when adding decimal fractions?
(31 answers)
Closed 3 years ago.
Can somebody give me an in-depth explanation of what's going on?
The system considers approximation right?(correct me if I'm wrong)
I would like to know how the computer behaves in these kinds of situation. Thank you.
Normal numbers in computers are stored with only so many bits of precision. A float in C++ is typically 4 bytes. 32 bits can't store that many 9s of precision, so the compiler does rounding to the precision it can handle.
Basically, you get approximately 10 digits of precision in total, and you have a lot more 9s than that.

Is there a C++ function for converting single precision IBM floats to IEEE-754 floats? [duplicate]

This question already has answers here:
Convert from the IBM floating point to the IEEE floating point standard and vice versa in C#
(4 answers)
IBM Single Precision Floating Point data conversion to intended value
(1 answer)
Closed 3 years ago.
I'm trying to read single precision floating point numbers from a binary (.segy) file on windows, using C++. These numbers follow the IBM floating point architecture, so I need to convert them into IEEE-754 floats after reading.
I have found this C code:
https://www.thecodingforums.com/threads/c-code-for-converting-ibm-370-floating-point-to-ieee-754.438469/
Unfortunately it does not compile on windows.
I also found this code:
https://www.codeproject.com/Articles/12363/Transform-between-IEEE-IBM-or-VAX-floating-point-n
This solution seems a bit too complicated for me, and I'm not sure how to change it to read (IBM format) binary data directly.
Is there a simple C++ code available to solve this problem?

c++ double more accuracy/precision [duplicate]

This question already has answers here:
C++ calculating more precise than double or long double
(2 answers)
Closed 6 years ago.
Is there any floating point type that stores more digits after the decimal point than double in c++ (or any alternative, which makes double stored more digits)?
I've read that long double is maybe more accurate.
In my program we can zoom into the Mandelbrot set, but after some zoom the picture gets pixelated. I think it is because the length between two complex numbers associated with two neighboring pixels is less than the difference between two consecutive value of double. In the program I used long double.
If it's important, then the processor of my computer is Intel® Core™ i3 CPU M 380 # 2.53GHz × 4, the computer is 64 bit, the operating system is Ubuntu and the compiler is gcc.
You should take a look at third party libraries like boost.multiprecision or even GMP.
You can also do it "by hand" but that would be a lot of work. You would have to keep numbers as their string representation and manually make the arithmetic operations yourself.

What does -1.#IND mean (double stream output) [duplicate]

This question already has answers here:
Why are the return values of these doubles -1.#IND?
(3 answers)
Closed 8 years ago.
I could neither find it via google, search here or on Microsofts helppages...
After some extensive calculations, sometimes, when outputting my doubles via std::cout i prints as result on console:
-1.#IND
There are no modifcations(like precision etc) to the cout-stream. I assume the program wants to tell me about some sort of error, but I can't figure it out :/
It doesn't happen that often but with a low frequency (it is a genetical algorithm, so i have an output after every generation, and in about every 5th to 10th generation this seems to happen...)
For information, I'm using Visual Studio Pro 2013.
Windows displays NaN as -1.#IND. NaN is a result of a mathematical operation that does not make sense. For example, 0.0 / 0.0, or sqrt(-1.0) will return NaN. I can't really help further without more details about the underlying operation. Hopefully this is enough to point you in the right direction, though.

Working around float or double numbers in C++. Errors of representation. Loss of decimal values [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 8 years ago.
I heard that C/C++ has a problem with the management of the float point numbers.
I've implemented a simple program to try it. It consists in a change machine: the user enter the quantity to charge and the quantity paid, and the program calculates the number of coins for each coin type to give as change.
Here is the code: Link to my google drive folder with the code
The thing is, when you insert a non-integer value, the program enter in a loop and never ends.
I've printed the content of the variables to find out what's going on, and, somehow, from a 2 decimal value let's say: 0.10, the program changes its value to a 0.0999998.
Then, the remaining change to be processed never is 0 and it enters in a infinite loop.
I've heard that this is due to the machine representation of the float point numbers. I've experimented the same either windows and Linux; and also programming it in Java, but I don't remember to have had the same issue in pascal.
Well, Now the question is: what is the best workaround for this?
I've thought that one possible solution is using fixed point representation, via external libraries as: http://www.trenki.net/content/view/17/1/ or http://www.codef00.com/code/Fixed.h . Other maybe is to use a precision arithmetic library as: GMP
Neither C nor C++ has a problem with floating point values. You as the programmer are trusted to use floating point appropriately in any language supporting it.
While integer variables cannot store fractions nor out of bounds values, floating point can only store a specific subset of fractions. A high quality floating point implementation also gives tight guarantees for the accuracy of calculation.
Floating point numbers are not rational numbers, which would need infinite space to store reliably.