How are these double precision values accurate to 20 decimals?

How are these double precision values accurate to 20 decimals? - fortran

I am testing some very simple equivalence errors when precision is an issue and was hoping to perform the operations in extended double precision (so that I knew what the answer would be in ~19 digits) and then perform the same operations in double precision (where there would be roundoff error in the 16th digit), but somehow my double precision arithmetic is maintaining 19 digits of accuracy.
When I perform the operations in extended double, then hardcode the numbers into another Fortran routine, I get the expected errors, but is there something strange going on when I assign an extended double precision variable to a double precision variable here?
program code_gen
implicit none
integer, parameter :: Edp = selected_real_kind(17)
integer, parameter :: dp = selected_real_kind(8)
real(kind=Edp) :: alpha10, x10, y10, z10
real(kind=dp) :: alpha8, x8, y8, z8
real(kind = dp) :: pi_dp = 3.1415926535897932384626433832795028841971693993751058209749445
integer :: iter
integer :: niters = 10
print*, 'tiny(x10) = ', tiny(x10)
print*, 'tiny(x8) = ', tiny(x8)
print*, 'epsilon(x10) = ', epsilon(x10)
print*, 'epsilon(x8) = ', epsilon(x8)
do iter = 1,niters
x10 = rand()
y10 = rand()
z10 = rand()
alpha10 = x10*(y10+z10)
x8 = x10
x8 = x8 - pi_dp
x8 = x8 + pi_dp
y8 = y10
y8 = y8 - pi_dp
y8 = y8 + pi_dp
z8 = z10
z8 = z8 - pi_dp
z8 = z8 + pi_dp
alpha8 = alpha10
write(*, '(a, es30.20)') 'alpha8 .... ', x8*(y8+z8)
write(*, '(a, es30.20)') 'alpha10 ... ', alpha10
if( alpha8 .gt. x8*(y8+z8) ) then
write(*, '(a)') 'ERROR(.gt.)'
elseif( alpha8 .lt. x8*(y8+z8) ) then
write(*, '(a)') 'ERROR(.lt.)'
endif
enddo
end program code_gen
where rand() is the gfortran function found here.
If we are speaking about only one precision type (take, for example, double), then we can denote machine epsilon as E16 which is approximately 2.22E-16. If we take a simple addition of two Real numbers, x+y, then the resulting machine expressed number is (x+y)*(1+d1) where abs(d1) < E16. Likewise, if we then multiply that number by z, the resulting value is really (z*((x+y)*(1+d1))*(1+d2)) which is nearly (z*(x+y)*(1+d1+d2)) where abs(d1+d2) < 2*E16. If we now move to extended double precision, then the only thing that changes is that E16 turns to E20 and has a value of around 1.08E-19.
My hope was to perform the analysis in extended double precision so that I could compare two numbers which should be equal but show that, on occasion, roundoff error will cause comparisons to fail. By assigning x8=x10, I was hoping to create a double precision 'version' of the extended double precision value x10, where only the first ~16 digits of x8 conform to the values of x10, but upon printing out the values, it shows that all 20 digits are the same and the expected double precision roundoff error is not occurring as I would expect.
It should also be noted that before this attempt, I wrote a program which actually writes another program where the values of x, y, and z are 'hardcoded' to 20 decimal places. In this version of the program, the comparisons of .gt. and .lt. failed as expected, but I am not able to duplicate the same failures by casting an extended double precision value as a double precision variable.
In an attempt to further 'perturb' the double precision values and add roundoff error, I have added, then substracted, pi from my double precision variables which should leave the remaining variables with some double precision roundoff error, but I am still not seeing that in the final result.

As the gfortran documentation you link states, the function result of rand is a default real value (single precision). Such a value can be represented exactly by each of your other real types.
That is, x10=rand() assigns a single precision value to the extended precision variable x10. It does so exactly. This same value now stored in x10 is assigned to the double precision variable x8, but this remains exactly representable as double precision.
There is sufficient precision in the single-as-double that the calculations using double and extended types return the same value. [See the note at the end of this answer.]
If you wish to see real effects of loss of precision, then start by using an extended or double precision value. For example, rather than using rand (returning a single precision value), use the intrinsic random_number
call random_number(x10)
(which has the advantage of being standard Fortran). Unlike a function, which in (nearly) all cases returns a value type regardless of the end use of the value, this subroutine will give you a precision corresponding to the argument. You will (hopefully) see much as you will from your "hard-coded" experiment.
Alternatively, as agentp commented, it may be more intuitive to start with a double precision value
call random_number(x8); x10=x8 ! x8 and x10 have the precision of double precision
call random_number(y8); y10=y8
call random_number(z8); z10=z8
and perform the calculations from that starting point: those extra bits will then start to show.
In summary, when you do x8=x10 you are getting the first few bits of x8 corresponding to those of x10, but many of those bits and those that follow in x10 are all zero.
When it comes to your pi_dp perturbation, you are again assigning a single precision (this time a literal constant) value to a double precision variable. Just having all those digits doesn't make it anything other than a default real literal. You can specify a different kind of literal with a _Edp suffix, as described in other answers.
Finally, one also then has to worry about what the compiler does with regards to optimization.
My thesis is that starting from the single precision value, the calculations performed are representable exactly in both double and extended precision (with the same values). For other calculations, or from a starting point with more bits set, or representations (for example, on some systems or with other compilers the numeric type with kind selected_real_kind(17) may have quite different characteristics such as a different radix) that needn't be the case.
While this was largely based on guessing and hoping it explained the observation. Fortunately, there are ways to test this idea. As we're talking about IEEE arithmetic we can consider the inexact flag. If that flag isn't raised during the computation we can be happy.
With gfortran there is the compilation option -ffpe=inexact which will make the inexact flag signalling. With gfortran 5.0 the intrinsic module ieee_exceptions is supported which can be used in a portable/standard manner.
You can consider this flag for further experimentation: if it is raised then you can expect to see differences between the two precisions.

Related

TINY(x) intrinsic function

I am a university lecturer, and I will teach the Numerical Methods course this semester using Fortran 90/95 as the programming language. The beginning of the course starts with the representation of numbers, and I would like to talk about the limits of numbers that can be represented with REAL(4), REAL(8) and REAL(16). I intend to use the following code on OnlineGDB (so that students won't have to install anything on their computers, which may be a pain in times of remote learning):
Program declare_reals
implicit none
real(kind = 4) :: a_huge, a_tiny ! single precision ; default if kind not specified
!real(4) :: a ! Equivalent to real(kind = 4) :: a
a_huge = huge(a_huge)
print*, "Max positive for real(4) : ", a_huge
a_tiny = tiny(a_tiny)
print*, "Min positive for real(4) : ", a_tiny
print*,
End Program declare_reals
With this code, I get
Max positive for real(4) : 3.40282347E+38
Min positive for real(4) : 1.17549435E-38
However, if I write a_tiny = tiny(a_tiny)/2.0, the output becomes
Min positive for real(4) : 5.87747175E-39
Looking at the documentation for gfortran (which OnlineGDB uses as the f95 compiler), I had the impression that anything below tiny(x) could result in an underflow and zero would show instead of a non-zero number. Could anyone help me understand what is happening here? If tiny(x) doesn't yield the smallest positive representable number, what is being shown due to the function call?

The Fortran Standard states the following about a real value:
The model set for real x is defined by
where b and p are integers exceeding one; each fk is
a non-negative integer less than b, with f1
nonzero; s is
+1 or −1; and e is an integer that lies between some integer maximum emax and some integer minimum emin inclusively. For x = 0, its exponent e and digits fk
are defined to be zero. The integer parameters b, p,
emin, and emax determine the set of model floating-point numbers.
Real values which satisfy this definition, are referenced to be model numbers or normal floating point numbers. The floating point numbers your system can represent, i.e. the machine-representable numbers are a superset of the model numbers. They can, but not necessarily must, include the values with f1 zero — also known as subnormal floating point numbers — and are there to fill the underflow gap around zero.
The Fortran functions tiny(x), huge(x), epsilon(x), spacing(x) are all defined for model numbers.
The value of tiny(x) is given by bemin − 1, which for a single-precision floating-point number (binary32) is given by 2−126 and is the smallest model (normal) number. When your system follows IEEE754, the machine representable numbers will also contain the subnormal numbers. The smallest subnormal positive number is given bytiny(x)*epsilon(x) which in binary32 is 2−126 × 2−23. This explains why you can divide tiny(x) by two, i.e. the transition from normal to subnormal.
# smallest normal number
0 00000001 000000000000000000000002 = 0080 000016 = 2−126 ≈ 1.1754943508 × 10−38
# smallest subnormal number
0 00000000 000000000000000000000012 = 0000 000116 = 2−126 × 2−23 ≈ 1.4012984643 × 10−45
Note: when you divide tiny(x)*epsilon(x) by two, gfortran returns an arithmetic underflow error.
Ref: values taken from Wikipedia: Single precision floating-point format

C++ int64 * double == off by one

Below is the code I've tested in a 64-bit environment and 32-bit. The result is off by one precisely each time. The expected result is: 1180000000 with the actual result being 1179999999. I'm not sure exactly why and I was hoping someone could educate me:
#include <stdint.h>
#include <iostream>
using namespace std;
int main() {
double odds = 1.18;
int64_t st = 1000000000;
int64_t res = st * odds;
cout << "result: " << res << endl;
return 1;
}
I appreciate any feedback.

1.18, or 118 / 100 can't be exactly represented in binary, it will have repeating decimals. The same happens if you write 1 / 3 in decimal.
So let's go over a similar case in decimal, let's calculate (1 / 3) × 30000, which of course should be 10000:
odds = 1 / 3 and st = 30000
Since computers have only a limited precision we have to truncate this number to a limited number of decimals, let's say 6, so:
odds = 0.333333
0.333333 × 10000 = 9999.99. The cast (which in your program is implicit) will truncate this number to 9999.
There is no 100% reliable way to work around this. float and double just have only limited precision. Dealing with this is a hard problem.
Your program contains an implicit cast from double to an integer on the line int64_t res = st * odds;. Many compilers will warn you about this. It can be the source of bugs of the type you are describing. This cast, which can be explicitly written as (int64_t) some_double, rounds the number towards zero.
An alternative is rounding to the nearest integer with round(some_double);. That will—in this case—give the expected result.

First of all - 1.18 is not exactly representable in double. Mathematically the result of:
double odds = 1.18;
is 1.17999999999999993782751062099 (according to an online calculator).
So, mathematically, odds * st is 1179999999.99999993782751062099.
But in C++, odds * st is an expression with type double. So your compiler has two options for implementing this:
Do the computation in double precision
Do the computation in higher precision and then round the result to double
Apparently, doing the computation in double precision in IEEE754 results in exactly 1180000000.
However, doing it in long double precision produces something more like 1179999999.99999993782751062099
Converting this to double is now implementation-defined as to whether it selects the next-highest or next-lowest value, but I believe it is typical for the next-lowest to be selected.
Then converting this next-lowest result to integer will truncate the fractional part.
There is an interesting blog post here where the author describes the behaviour of GCC:
It uses long double intermediate precision for x86 code (due to the x87 FPUs long double registers)
It uses actual types for x64 code (because the SSE/SSE2 FPU supports this more naturally)
According to the C++11 standard you should be able to inspect which intermediate precision is being used by outputting FLT_EVAL_METHOD from <cfloat>. 0 would mean actual values, 2 would mean long double is being used.

How do I make all the calculations in double precision in fortran?

In the Fortran code given below, I have made all numbers involving calculation of PI as double precision but the value of PI I get is just a real number with a large number of zero or 9 at the end. How do I make the program give PI in double precision? I am using gfortran compiler.
!This program determines the value of pi using Monte-Carlo algorithm.
program findpi
implicit none
double precision :: x,y,radius,truepi,cnt
double precision,allocatable,dimension(:) :: pi,errpi
integer :: seedsize,i,t,iter,j,k,n
integer,allocatable,dimension(:) :: seed
!Determining the true value of pi to compare with the calculated value
truepi=4.D0*ATAN(1.D0)
call random_seed(size=seedsize)
allocate(seed(seedsize))
do i=1,seedsize
call system_clock(t) !Using system clock to randomise the seed to
!random number generator
seed(i)=t
enddo
call random_seed(put=seed)
n=2000 !Number of times value of pi is determined
allocate(pi(n),errpi(n))
do j=1,n
iter=n*100 !Number of random points
cnt=0.D0
do i=1,iter
call random_number(x)
call random_number(y)
radius=sqrt(x*x + y*y)
if (radius < 1) then
cnt = cnt+1.D0
endif
enddo
pi(j)=(4.D0*cnt)/dble(iter)
print*, j,pi(j)
enddo
open(10,file="pi.dat",status="replace")
write(10,"(F15.8,I10)") (pi(k),k,k=1,n)
call system("gnuplot --persist piplot.gnuplot")
end program findpi

Your calculation is in double precision, but I see two issues:
The first is a systematic error... You determine pi by
pi(j)=(4.D0*cnt)/dble(iter)
iter is at most 2000*100, so 1/iter is at least 5e-6, so you can't resolve anything finder than that ;-)
The second issue is that your IO routines print the results in single precision! The line
write(10,"(F15.8,I10)") (pi(k),k,k=1,n)
and more specifically the format specifier "(F15.8,I10)" needs to be adjusted. At the moment it tells the compiler to use 15 characters overall to print the number, with 8 digits after the decimal point. As a first measure, you could use *:
write(10,*) (pi(k),k,k=1,n)
This uses 22 characters in total with all 15 digits for double precision:
write(10,"(F22.15,I10)") (pi(k),k,k=1,n)

Fortran fomat statement with highest precision in the system

Someone wanting less precision would write
999 format ('The answer is x = ', F8.3)
Others wanting higher output precision may write
999 format ('The answer is x = ', F18.12)
Thus it totally depends on what the user desires. What is the format
statement that exactly matches the precision used in the calculation?
(Note, this may vary from system to system)

It is a difficult question because you request "the precision of the calculation", which depends on so many factors. For example: if I solve f(x)=0 via Newton's method to a tolerance of 1E-6, would you want a format with seven digits?
On the other hand, if you mean the "highest precision attainable by the type" (e. g., double or single precision) then you can simply find the corresponding epsilon (machine eps, or precision) and use that as the format flag. If epsilon is 1E-15, then you can use a format flag that does not have more than 16 digits.
In Fortran you can use the EPSILON(X) function to get this number (the answer will depend on the type of X), the you can take the floor of the absolute value of the logarithm (base 10) of epsilon, and make that the number of decimals in your float representation.
For example, if epsilon is 1E-12, the log is -12, the abs is 12, and the floor is 12, so you want a format like 15.12F (12 decimals + 1 point + the zero + the sign = 15 places)

The problem with floating point numbers is that there is no precision as such: only significant digits.
For instance, if you are calculating longitudes in real*1, near the UK, you'd be accurate to 6 decimal places but if you were in Colorado Springs, it would only be accurate to 4 decimal places. It would not make any sense to print the number in F format it is just rubbish after the 4th decimal place.
If you wish to print to maximum precision, print in E format. Since it is always n.nn..nEnn, you get all the significant digits.
Edit - user4050's query
Try the following example
program main
real intpart, multiplier
integer ii
multiplier = 1
do ii = 1, 6
intpart = 9.87654321
intpart = intpart * multiplier
print '(F15.7 E15.7 G15.8)', intpart, intpart, intpart
multiplier = multiplier * 10
end do
stop
end program
What you will get is something like
9.8765430 0.9876543E+01 9.8765430
98.7654266 0.9876543E+02 98.765427
987.6542969 0.9876543E+03 987.65430
9876.5429688 0.9876543E+04 9876.5430
98765.4296875 0.9876543E+05 98765.430
987654.3125000 0.9876543E+06 987654.31
Notice that the precision changes as the number gets bigger because a float only has 7 significant figures.

Discrepancy between the values computed by Fortran and C++

I would have dared say that the numeric values computed by Fortran and C++ would be way more similar. However, from what I am experiencing, it turns out that the calculated numbers start to diverge after too few decimal digits. I have come across this problem during the process of porting some legacy code from the former language to the latter. The original Fortran 77 code...
INTEGER M, ROUND
DOUBLE PRECISION NUMERATOR, DENOMINATOR
M = 2
ROUND = 1
NUMERATOR=5./((M-1+(1.3**M))**1.8)
DENOMINATOR = 0.7714+0.2286*(ROUND**3.82)
WRITE (*, '(F20.15)') NUMERATOR/DENOMINATOR
STOP
... outputs 0.842201471328735, while its C++ equivalent...
int m = 2;
int round = 1;
long double numerator = 5.0 / pow((m-1)+pow(1.3, m), 1.8);
long double denominator = 0.7714 + 0.2286 * pow(round, 3.82);
std::cout << std::setiosflags(std::ios::fixed) << std::setprecision(15)
<< numerator/denominator << std::endl;
exit(1);
... returns 0.842201286195064. That is, the computed values are equal only up to the sixth decimal. Although not particularly a Fortran advocator, I feel inclined to consider its results as the 'correct' ones, given its legitimate reputation of number cruncher. However, I am intrigued about the cause of this difference between the computed values. Does anyone know what the reason for this discrepancy could be?

In Fortran, by default, floating point literals are single precision, whereas in C/C++ they are double precision.
Thus, in your Fortran code, the expression for calculating NUMERATOR is done in single precision; it is only converted to double precision when assigning the final result to the NUMERATOR variable.
And the same thing for the expression calculating the value that is assigned to the DENOMINATOR variable.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How are these double precision values accurate to 20 decimals? - fortran

Related

TINY(x) intrinsic function

C++ int64 * double == off by one

How do I make all the calculations in double precision in fortran?

Fortran fomat statement with highest precision in the system

Discrepancy between the values computed by Fortran and C++

Categories

Resources