Order of magnitude for double precision - fortran

What order of magnitude difference should I be expecting for a subtraction between two theoretically equal double precision numbers?
I have two double precision arrays. They are expected to be theoretically same. They are both calculated by two completely different methodologies, so there is some numerical difference between them. I checked them element by element and my maximum difference is coming out to be
6.5557799910909154E-008. My boss says that for a double precision this is a very high difference, but I thought that if the difference if of the order of E-008, then its alright.
Thank you,
Pradeep

Double precision floating point has the following format
Sign bit: 1 bit
Exponent width: 11 bits
Significand precision: 53 bits (52 explicitly stored)
This gives from 15 - 17 significant decimal digits precision. If a decimal string with at most 15 significant decimal is converted to IEEE 754 double precision and then converted back to the same number of significant decimal, then the final string should match the original; and if an IEEE 754 double precision is converted to a decimal string with at least 17 significant decimal and then converted back to double, then the final number must match the original.
Single precision floating point has the following format
Sign bit: 1 bit
Exponent width: 8 bits
Significand precision: 24 (23 explicitly stored)
This gives from 6 to 9 significant decimal digits precision (if a decimal string with at most 6 significant decimal is converted to IEEE 754 single precision and then converted back to the same number of significant decimal, then the final string should match the original; and if an IEEE 754 single precision is converted to a decimal string with at least 9 significant decimal and then converted back to single, then the final number must match the original.
The maximum difference you are encountering indicates a loss of precision akin to converting to single precision.
Do you know which of the two methods is more accurate? Is it a trade-off between speed of computation and precision that is the main difference or is one of the algorithms less numerically stable? What is the precision of the inputs? A difference of 8 decimal digits of precision may not be relevant if your inputs aren't that precise... or it could mean missing Mars on a planetary trajectory.

Related

fibonacci series Precision [duplicate]

From what I have read, a value of data type double has an approximate precision of 15 decimal places. However, when I use a number whose decimal representation repeats, such as 1.0/7.0, I find that the variable holds the value of 0.14285714285714285 - which is 17 places (via the debugger).
I would like to know why it is represented as 17 places internally, and why a precision of 15 is always written at ~15?
An IEEE double has 53 significant bits (that's the value of DBL_MANT_DIG in <cfloat>). That's approximately 15.95 decimal digits (log10(253)); the implementation sets DBL_DIG to 15, not 16, because it has to round down. So you have nearly an extra decimal digit of precision (beyond what's implied by DBL_DIG==15) because of that.
The nextafter() function computes the nearest representable number to a given number; it can be used to show just how precise a given number is.
This program:
#include <cstdio>
#include <cfloat>
#include <cmath>
int main() {
double x = 1.0/7.0;
printf("FLT_RADIX = %d\n", FLT_RADIX);
printf("DBL_DIG = %d\n", DBL_DIG);
printf("DBL_MANT_DIG = %d\n", DBL_MANT_DIG);
printf("%.17g\n%.17g\n%.17g\n", nextafter(x, 0.0), x, nextafter(x, 1.0));
}
gives me this output on my system:
FLT_RADIX = 2
DBL_DIG = 15
DBL_MANT_DIG = 53
0.14285714285714282
0.14285714285714285
0.14285714285714288
(You can replace %.17g by, say, %.64g to see more digits, none of which are significant.)
As you can see, the last displayed decimal digit changes by 3 with each consecutive value. The fact that the last displayed digit of 1.0/7.0 (5) happens to match the mathematical value is largely coincidental; it was a lucky guess. And the correct rounded digit is 6, not 5. Replacing 1.0/7.0 by 1.0/3.0 gives this output:
FLT_RADIX = 2
DBL_DIG = 15
DBL_MANT_DIG = 53
0.33333333333333326
0.33333333333333331
0.33333333333333337
which shows about 16 decimal digits of precision, as you'd expect.
It is actually 53 binary places, which translates to 15 stable decimal places, meaning that if you round a start out with a number with 15 decimal places, convert it to a double, and then round the double back to 15 decimal places you'll get the same number. To uniquely represent a double you need 17 decimal places (meaning that for every number with 17 decimal places, there's a unique closest double) which is why 17 places are showing up, but not all 17-decimal numbers map to different double values (like in the examples in the other answers).
Decimal representation of floating point numbers is kind of strange. If you have a number with 15 decimal places and convert that to a double, then print it out with exactly 15 decimal places, you should get the same number. On the other hand, if you print out an arbitrary double with 15 decimal places and the convert it back to a double, you won't necessarily get the same value back—you need 17 decimal places for that. And neither 15 nor 17 decimal places are enough to accurately display the exact decimal equivalent of an arbitrary double. In general, you need over 100 decimal places to do that precisely.
See the Wikipedia page for double-precision and this article on floating-point precision.
A double holds 53 binary digits accurately, which is ~15.9545898 decimal digits. The debugger can show as many digits as it pleases to be more accurate to the binary value. Or it might take fewer digits and binary, such as 0.1 takes 1 digit in base 10, but infinite in base 2.
This is odd, so I'll show an extreme example. If we make a super simple floating point value that holds only 3 binary digits of accuracy, and no mantissa or sign (so range is 0-0.875), our options are:
binary - decimal
000 - 0.000
001 - 0.125
010 - 0.250
011 - 0.375
100 - 0.500
101 - 0.625
110 - 0.750
111 - 0.875
But if you do the numbers, this format is only accurate to 0.903089987 decimal digits. Not even 1 digit is accurate. As is easy to see, since there's no value that begins with 0.4?? nor 0.9??, and yet to display the full accuracy, we require 3 decimal digits.
tl;dr: The debugger shows you the value of the floating point variable to some arbitrary precision (19 digits in your case), which doesn't necessarily correlate with the accuracy of the floating point format (17 digits in your case).
IEEE 754 floating point is done in binary. There's no exact conversion from a given number of bits to a given number of decimal digits. 3 bits can hold values from 0 to 7, and 4 bits can hold values from 0 to 15. A value from 0 to 9 takes roughly 3.5 bits, but that's not exact either.
An IEEE 754 double precision number occupies 64 bits. Of this, 52 bits are dedicated to the significand (the rest is a sign bit and exponent). Since the significand is (usually) normalized, there's an implied 53rd bit.
Now, given 53 bits and roughly 3.5 bits per digit, simple division gives us 15.1429 digits of precision. But remember, that 3.5 bits per decimal digit is only an approximation, not a perfectly accurate answer.
Many (most?) debuggers actually look at the contents of the entire register. On an x86, that's actually an 80-bit number. The x86 floating point unit will normally be adjusted to carry out calculations to 64-bit precision -- but internally, it actually uses a couple of "guard bits", which basically means internally it does the calculation with a few extra bits of precision so it can round the last one correctly. When the debugger looks at the whole register, it'll usually find at least one extra digit that's reasonably accurate -- though since that digit won't have any guard bits, it may not be rounded correctly.
It is because it's being converted from a binary representation. Just because it has printed all those decimal digits doesn't mean it can represent all decimal values to that precision. Take, for example, this in Python:
>>> 0.14285714285714285
0.14285714285714285
>>> 0.14285714285714286
0.14285714285714285
Notice how I changed the last digit, but it printed out the same number anyway.
In most contexts where double values are used, calculations will have a certain amount of uncertainty. The difference between 1.33333333333333300 and 1.33333333333333399 may be less than the amount of uncertainty that exists in the calculations. Displaying the value of "2/3 + 2/3" as "1.33333333333333" is apt to be more meaningful than displaying it as "1.33333333333333319", since the latter display implies a level of precision that doesn't really exist.
In the debugger, however, it is important to uniquely indicate the value held by a variable, including essentially-meaningless bits of precision. It would be very confusing if a debugger displayed two variables as holding the value "1.333333333333333" when one of them actually held 1.33333333333333319 and the other held 1.33333333333333294 (meaning that, while they looked the same, they weren't equal). The extra precision shown by the debugger isn't apt to represent a numerically-correct calculation result, but indicates how the code will interpret the values held by the variables.

std::numeric_limits::digits10<float> and precision after the dot

When std::numeric_limits::digits10<float> return 7 does it mean that I have 7 significatif figures after the dot or that 7 with the left part?
For instance is it like:
1.123456
12.12345
or is it like
12.1234657
From cppreference
The value of std::numeric_limits::digits10 is the number of base-10 digits that can be represented by the type T without change, that is, any number with this many decimal digits can be converted to a value of type T and back to decimal form, without change due to rounding or overflow. For base-radix types, it is the value of digits (digits-1 for floating-point types) multiplied by log
10(radix) and rounded down.
And later
The standard 32-bit IEEE 754 floating-point type has a 24 bit fractional part (23 bits written, one implied), which may suggest that it can represent 7 digit decimals (24 * std::log10(2) is 7.22), but relative rounding errors are non-uniform and some floating-point values with 7 decimal digits do not survive conversion to 32-bit float and back: the smallest positive example is 8.589973e9, which becomes 8.589974e9 after the roundtrip. These rounding errors cannot exceed one bit in the representation, and digits10 is calculated as (24-1)*std::log10(2), which is 6.92. Rounding down results in the value 6.
That means, e.g.
cout << std::numeric_limits<float>::digits10; // 6
cout << std::numeric_limits<float>::digits; // 24
the second one is the number of digits in the mantissa while the first one the number of decimal digits that can safely be represented across aforementioned conversions.
TL;DR: it's your first case.

Errors multiplying large doubles

I've made a BOMDAS calculator in C++ that uses doubles. Whenever I input an expression like
1000000000000000000000*1000000000000000000000
I get a result like 1000000000000000000004341624882808674582528.000000. I suspect it has something to do with floating-point numbers.
Floating point number represent values with a fixed size representation. A double can represent 16 decimal digits in form where the decimal digits can be restored (internally, it normally stores the value using base 2 which means that it can accurately represent most fractional decimal values). If the number of digits is exceeded, the value will be rounded appropriately. Of course, the upshot is that you won't necessarily get back the digits you're hoping for: if you ask for more then 16 decimal digits either explicitly or implicitly (e.g. by setting the format to std::ios_base::fixed with numbers which are bigger than 1e16) the formatting will conjure up more digits: it will accurately represent the internally held binary values which may produce up to, I think, 54 non-zero digits.
If you want to compute with large values accurately, you'll need some variable sized representation. Since your values are integers a big integer representation might work. These will typically be a lot slower to compute with than double.
A double stores 53 bits of precision. This is about 15 decimal digits. Your problem is that a double cannot store the number of digits you are trying to store. Digits after the 15th decimal digit will not be accurate.
That's not an error. It's exactly because of how floating-point types are represented, as the result is precise to double precision.
Floating-point types in computers are written in the form (-1)sign * mantissa * 2exp so they only have broader ranges, not infinite precision. They're only accurate to the mantissa precision, and the result after every operation will be rounded as such. The double type is most commonly implemented as IEEE-754 64-bit double precision with 53 bits of mantissa so it can be correct to log(253) ≈ 15.955 decimal digits. Doing 1e21*1e21 produces 1e42 which when rounding to the closest value in double precision gives the value that you saw. If you round that to 16 digits it's exactly the same as 1e42.
If you need more range, use double or long double. If you only works with integer then int64_t (or __int128 with gcc and many other compilers on 64-bit platforms) has a much larger precision (64/128 bits compared to 53 bits). If you need even more precision, use an arbitrary-precision arithmetic library instead such as GMP

Want to calculate 18 digits but will only calculate to 6? Using long double but still won't work

I've written a program to estimate pi using the Gregory Leibniz formula, however, it will not calculate to 18 decimal points. It will only calculate up to 5 decimal points. Any suggestions?
Use
cout.precision(50);
To increase the precision of the printed output. Here 50 is the number of decimal digits in your output.
The default printing precision for printf is 6
Precision specifies the exact number of digits to appear after the decimal point character. The default precision is 6
Similarly when std::cout was introduced into C++ the same default value was used
Manages the precision (i.e. how many digits are generated) of floating point output performed by std::num_put::do_put.
Returns the current precision.
Sets the precision to the given one. Returns the previous precision.
The default precision, as established by std::basic_ios::init, is 6.
https://en.cppreference.com/w/cpp/io/ios_base/precision
Therefore regardless of how precise the type is, only 6 fractional digits will be printed out. To get more digits you'll need to use std::setprecision or std::cout.precision
However calling std::cout.precision only affects the number of decimal digits in the output, not the number's real precision. Any digits over that type's precision would be just garbage
Most modern systems use IEEE-754 where float is single-precision with 23 bits of mantissa and double maps to double-precision with 52 bits of mantissa. As a result they're accurate to ~6-7 digits and ~15-16 decimal digits respectively. That means they can't represent numbers to 18 decimal points as you expected
On some platforms there may be some extended precision types so you'll be able to store numbers more precisely. For example long double on most compilers on x86 has 64 bits of precision and can represent ~18 significant digits, but it's not 18 digits after the decimal point. Higher precision can be obtained with quadruple-precision on some compilers. To achieve even more precision, the only way is to use a big number library or write one for your own.

Double precision - decimal places

From what I have read, a value of data type double has an approximate precision of 15 decimal places. However, when I use a number whose decimal representation repeats, such as 1.0/7.0, I find that the variable holds the value of 0.14285714285714285 - which is 17 places (via the debugger).
I would like to know why it is represented as 17 places internally, and why a precision of 15 is always written at ~15?
An IEEE double has 53 significant bits (that's the value of DBL_MANT_DIG in <cfloat>). That's approximately 15.95 decimal digits (log10(253)); the implementation sets DBL_DIG to 15, not 16, because it has to round down. So you have nearly an extra decimal digit of precision (beyond what's implied by DBL_DIG==15) because of that.
The nextafter() function computes the nearest representable number to a given number; it can be used to show just how precise a given number is.
This program:
#include <cstdio>
#include <cfloat>
#include <cmath>
int main() {
double x = 1.0/7.0;
printf("FLT_RADIX = %d\n", FLT_RADIX);
printf("DBL_DIG = %d\n", DBL_DIG);
printf("DBL_MANT_DIG = %d\n", DBL_MANT_DIG);
printf("%.17g\n%.17g\n%.17g\n", nextafter(x, 0.0), x, nextafter(x, 1.0));
}
gives me this output on my system:
FLT_RADIX = 2
DBL_DIG = 15
DBL_MANT_DIG = 53
0.14285714285714282
0.14285714285714285
0.14285714285714288
(You can replace %.17g by, say, %.64g to see more digits, none of which are significant.)
As you can see, the last displayed decimal digit changes by 3 with each consecutive value. The fact that the last displayed digit of 1.0/7.0 (5) happens to match the mathematical value is largely coincidental; it was a lucky guess. And the correct rounded digit is 6, not 5. Replacing 1.0/7.0 by 1.0/3.0 gives this output:
FLT_RADIX = 2
DBL_DIG = 15
DBL_MANT_DIG = 53
0.33333333333333326
0.33333333333333331
0.33333333333333337
which shows about 16 decimal digits of precision, as you'd expect.
It is actually 53 binary places, which translates to 15 stable decimal places, meaning that if you round a start out with a number with 15 decimal places, convert it to a double, and then round the double back to 15 decimal places you'll get the same number. To uniquely represent a double you need 17 decimal places (meaning that for every number with 17 decimal places, there's a unique closest double) which is why 17 places are showing up, but not all 17-decimal numbers map to different double values (like in the examples in the other answers).
Decimal representation of floating point numbers is kind of strange. If you have a number with 15 decimal places and convert that to a double, then print it out with exactly 15 decimal places, you should get the same number. On the other hand, if you print out an arbitrary double with 15 decimal places and the convert it back to a double, you won't necessarily get the same value back—you need 17 decimal places for that. And neither 15 nor 17 decimal places are enough to accurately display the exact decimal equivalent of an arbitrary double. In general, you need over 100 decimal places to do that precisely.
See the Wikipedia page for double-precision and this article on floating-point precision.
A double holds 53 binary digits accurately, which is ~15.9545898 decimal digits. The debugger can show as many digits as it pleases to be more accurate to the binary value. Or it might take fewer digits and binary, such as 0.1 takes 1 digit in base 10, but infinite in base 2.
This is odd, so I'll show an extreme example. If we make a super simple floating point value that holds only 3 binary digits of accuracy, and no mantissa or sign (so range is 0-0.875), our options are:
binary - decimal
000 - 0.000
001 - 0.125
010 - 0.250
011 - 0.375
100 - 0.500
101 - 0.625
110 - 0.750
111 - 0.875
But if you do the numbers, this format is only accurate to 0.903089987 decimal digits. Not even 1 digit is accurate. As is easy to see, since there's no value that begins with 0.4?? nor 0.9??, and yet to display the full accuracy, we require 3 decimal digits.
tl;dr: The debugger shows you the value of the floating point variable to some arbitrary precision (19 digits in your case), which doesn't necessarily correlate with the accuracy of the floating point format (17 digits in your case).
IEEE 754 floating point is done in binary. There's no exact conversion from a given number of bits to a given number of decimal digits. 3 bits can hold values from 0 to 7, and 4 bits can hold values from 0 to 15. A value from 0 to 9 takes roughly 3.5 bits, but that's not exact either.
An IEEE 754 double precision number occupies 64 bits. Of this, 52 bits are dedicated to the significand (the rest is a sign bit and exponent). Since the significand is (usually) normalized, there's an implied 53rd bit.
Now, given 53 bits and roughly 3.5 bits per digit, simple division gives us 15.1429 digits of precision. But remember, that 3.5 bits per decimal digit is only an approximation, not a perfectly accurate answer.
Many (most?) debuggers actually look at the contents of the entire register. On an x86, that's actually an 80-bit number. The x86 floating point unit will normally be adjusted to carry out calculations to 64-bit precision -- but internally, it actually uses a couple of "guard bits", which basically means internally it does the calculation with a few extra bits of precision so it can round the last one correctly. When the debugger looks at the whole register, it'll usually find at least one extra digit that's reasonably accurate -- though since that digit won't have any guard bits, it may not be rounded correctly.
It is because it's being converted from a binary representation. Just because it has printed all those decimal digits doesn't mean it can represent all decimal values to that precision. Take, for example, this in Python:
>>> 0.14285714285714285
0.14285714285714285
>>> 0.14285714285714286
0.14285714285714285
Notice how I changed the last digit, but it printed out the same number anyway.
In most contexts where double values are used, calculations will have a certain amount of uncertainty. The difference between 1.33333333333333300 and 1.33333333333333399 may be less than the amount of uncertainty that exists in the calculations. Displaying the value of "2/3 + 2/3" as "1.33333333333333" is apt to be more meaningful than displaying it as "1.33333333333333319", since the latter display implies a level of precision that doesn't really exist.
In the debugger, however, it is important to uniquely indicate the value held by a variable, including essentially-meaningless bits of precision. It would be very confusing if a debugger displayed two variables as holding the value "1.333333333333333" when one of them actually held 1.33333333333333319 and the other held 1.33333333333333294 (meaning that, while they looked the same, they weren't equal). The extra precision shown by the debugger isn't apt to represent a numerically-correct calculation result, but indicates how the code will interpret the values held by the variables.