Double precision in FORMAT function

Double precision in FORMAT function - fortran

I don't have any programming experience in Fortran, but for one of my courses at school we have to translate a program from Fortran into Java. The line of code that I'm having an issue with is
295 FORMAT(1X,'Y(X) =',D25.16,' * X ',A1,D25.16,///)
I don't think the entire line is necessarily needed, but I wanted to give the whole line for some context. The part that says D25.16 has kind of thrown me off since I can't seem to find any information about this anywhere. I was originally thinking it was formatting a double precision number to be able to have 25 digits on the left side of the .(dot) and 16 digits on the right, but I can't seem to find any information on what that means and don't know if I'm for sure right or wrong about that. I was just seeing if anyone could give some insight to what that does.

The D edit descriptor is closely related to the E edit descriptor (which may be easier to find reading material about), but is distinct from the F edit descriptor mentioned in a comment.
E and D specify that the real number will be presented with an exponent. For output using E a number may be written as, say,
+0.1234e+12
For D25.16 you are correct that the number of digits after the decimal point is 16 (well, the fractional part), but it is the overall width of the field that is 25, not the number of digits to the left. [To the left is either a 0 or nothing.] The field width has contributions from the (obligatory) sign, (optional) leading 0, the (potentially optional) exponent marker, the (obligatory) exponent sign, and (at least) two exponent digits.
If there are three exponent digits then the exponent marker is missing (leading to such things as .1234+110).
There are differences between D and E:
D doesn't allow specification of the exponent width (cf., E15.5E5);
D allows (but doesn't oblige) use of D as the exponent marker (instead of E).

Related

Fortran format - problem with numbers < 1e-100

I'm having a problem when some numbers get to small in my program, because I write them on a file and exponential format gets different:
for example, numbers > 1e-100:
0.3979111076224349D-98
smaller numbers:
0.2306878464709676-101 (The D disappears)
And since it is read by another program, those numbers are not read properly.
Currently I'm using the format 3D25.16
A possible solution would be forcing 3E25.15E3
The problem is that I lose 1 digit for any number
I want to avoid losing a digit, and I want to avoid losing performance with tests before printing.
Is there any other solution? The ideal solution for me would be a format that prints exponential with 2 digits on exponent and change to 3 digit when <1e-100
Other good solution would be a format option that transforms very small numbers into zero
Other doubt is: when changing from 3D25.16 to 3E25.15E3 do I lose precision by changing D to E? Because 3D25.15E3 is not accepted
Thank you

As everyone said in the comments (and also you in the question), you can just use E descriptor instead of D, so you are allowed to specify the number of digits for the exponent part.
Currently I'm using the format 3D25.16 A possible solution would be forcing 3E25.15E3 The problem is that I lose 1 digit for any number
Well, why not just increase the width of the output with 3E26.15E3?
Other doubt is: when changing from 3D25.16 to 3E25.15E3 do I lose precision by changing D to E?
With 3E25.15E3: yes, you possibly do (1 digit). With 3E26.15E3: no, you don't.

VBA debugger precision

I had a single which I believe the C++ equivalent is float in VBA in an Excel workbook module. Anyways, the value I originally assigned (876.34497) is rounded off to 876.345 in the Immediate Window, and Watch, and hover tooltip when I set a breakpoint on the VBA. However, if I pass this Single to a C++ DLL C++ reports it as the original value 876.34497.
So, is it actually stored in memory as the original value? Is this some limitation of the debugger? Unsure what is going on here. Makes it difficult to test if what I'm passing is what I'm getting on the C++ side.
I tried:
?CStr(test)
876.345
?CDbl(test)
876.344970703125
?CSng(test)
876.345
VBA isn't very straightforward, so at some level it must be stored as 876.34497 in memory. Otherwise, I don't think CDbl would be correct like it is.

VBA variables of type "single" are stored as "32-bit hardware implementation of IEEE 754[-]1985 [sic]." [see: https://msdn.microsoft.com/en-us/library/ee177324.aspx].
What this means in English is, "single" precision numbers are converted to binary then truncated to fit in a 4 byte (32-bit) sequence. The exact process is very well described in Wikipedia under http://en.wikipedia.org/wiki/Single-precision_floating-point_format . The upshot is that all single precision numbers are expressed as
(1) a 23 bit "fraction" between 0 and 1, *times*
(2) an 8-bit exponent which represents a multiplier between 2^(-127) and 2^128, *times*
(3) one more bit for positive or negative.
The process of converting numbers to binary and back causes two types of rounding errors:
(1) Significant Digits -- as you have noticed, there is a limit on significant digits. A 22 bit integer can only have 8,388,607 unique values. Stated another way, no number can be expressed with greater than +/- 0.000012% precision. Reaching back to high school science, you may recall that that is another way of saying you cannot count on more than six significant digits (well, decimal digits, at least ... of course you have 22 significant binary digits). So any representation of a number with more than six significant digits will get rounded off. However, it won't get rounded off to the nearest decimal digit ... it will get rounded off to the nearest binary digit. This often causes some unexpected results (like yours).
(2) Binary conversion -- The other type of error is even more pernicious. There are some numbers with significantly less than six (decimal) digits that will get rounded off. For example, 1/5 in decimal is 0.2000000. It never gets "rounded off." But the same number in binary is 0.00110011001100110011.... repeating forever. (That sequence is equivalent to 1/8 + 1/16 + 1/16*(1/8+1/16) + 1/256*(1/8+1/16) ... ) If you used an arbitrary number of binary digits to represent 0.20, then converted it back to decimal, you will NEVER get exactly 0.20. For example, if you used eight bits, you would have 0.00110011 in binary which is:
0.12500000
0.06250000
0.00781250
+ 0.00390625
------------
0.19921875
No matter how many binary digits you use, you will never get exactly 0.20, because 0.20 cannot be expressed as the sum of powers of two.
That in a nutshell explains what's going on. When you assign 876.34497 to "test," it gets converted internally to:
1 10001000 0110110001011000010011
136 5,969,427
Which is (+1) * 2^(136-127) * (5,969,427)/(2^23)
Excel is automatically truncating the display of your single-precision number to show only six significant digits, because it knows that the seventh digit might be wrong. I can't tell you what the number is exactly because my excel doesn't display enough significant digits! But you get the point.
When you coerce the value into double precision, it uses the entire binary string and then adds another 4 bytes worth of zeroes to the end. It now allows you to display twice as many significant figures because it is double precision, but as you can see, the conversion from 8 decimal digits to 23 binary digits and then appending another long string of zeros has introduced some errors. Not really errors, if you understand what it's doing; just artifacts. After all, it's doing exactly what you told it to do ... you just didn't know what you were telling it to do!

Why the digits after decimal are all zero?

I want to perform some calculations and I want the result correct up to some decimal places, say 12.
So I wrote a sample:
#define PI 3.1415926535897932384626433832795028841971693993751
double d, k, h;
k = 999999/(2*PI);
h = 999999;
d = PI*k*k*h;
printf("%.12f\n", d);
But it gives the output:
79577232813771760.000000000000
I even used setprecision(), but same answer rather in exponential form.
cout<<setprecision(12)<<d<<endl;
prints
7.95772328138e+16
Used long double also, but in vain.
Now is there any way other than storing the integer part and the fractional part separately in long long int types?
If so, what can be done to get the answer precisely?

A double has only about 16 decimal digits of precision. Everything after the decimal point would be nonsense. (In fact, the last digit or two left of the point may not agree with an infinite-precision calculation.)
Long double is not standardized, AFAIK. It may be that on your system it is the same as double, or no more precise. That would slightly surprise me, but it doesn't violate anything.

You need to read Double-Precision concepts again; more carefully.
The double has increased precision by using 64 bits.
Stuff before the decimal is more important than that after it.
So, when you have a large integer part, it will truncate the lower precision -- this is being described to you in various answers here as rounding off.
Update:
To increase precision, you'll need to use some library or change your language.
Check this other question: Best coding language for dealing with large numbers (50000+ digits)
Yet, I'll ask you to re-check your intent once more.
Do you really need 12 decimal places for numbers that have really high values
(over 10 digits in the integer part like in your example)?
Maybe you won't really have large integer parts
(in which case such code should work fine).
But if you are tracking a value like 10000000000.123456789,
I am really interested in exactly which application you are working on (astronomy?).
If the integer part of your values is some way under 10000, you should be fine here.
Update2:
IF you must demonstrate the ability of a specific formula to work accurately within constrained error limits, the way to go is fixing the processing of your formula such that the least error is introduced.
Example,
If you want to do say, (x * y) / z
it would be prudent to try something like max(x,y)/z * min(x,y)
rather than, the original form which may overflow after (x * y), loosing precision if that did not fit in the 16 decimals of double
If you had just 2 digit precision,
. 2-digit regular-precision
`42 * 7 290 297
(42 * 7)/2 290/2 294/2
Result ==> 145 147
But ==> 42/2 = 21
21 * 7 = 147
This is probably the intent of your contest.

The double-precision binary format used by most computers can only hold about 16 digits, after that you'll get rounding. See http://en.wikipedia.org/wiki/Double-precision_floating-point_format

Floating point values have a limit range of digits. Just because your "PI" value has six times as many digits as a double will support doesn't alter the way the hardware works.
A typical (IEEE754) double will produce approximately 15-16 decimal places. Whether that's 0.12345678901235, 1234567.8901235, 12345678901235 or 12345678901235000000000, or some other variation.
In other words, yes, if you calculate your calculation EXACTLY, you'll get lots of decimal places, because pi never ends. On a computer, you get about 15-16 digits, no matter what input values you use - all that changes is where in that sequence the decimal place sits. To get more, you need "big number support", such as the Gnu Multiprcession (GMP) library.

You're looking for std::fixed. That tells the ostream not to use exponential form.
cout << setprecision(12) << std::fixed << d << endl;

Does the dot in the end of a float suggest lack of precision?

When I debug my software in VS C++ by stepping the code I notice that some float calculations show up as a number with a trailing dot, i.e.:
1232432.
One operation that lead up to this result is this:
float result = pow(10, a * 0.1f) / b
where a is a large negative number around -50 to -100 and b is most often around 1. I read some articles about problem with precision when it comes to floating-points. My question is just if the trailing dot is a Visual-Studio-way of telling me that the precision is very low on this number, i.e. in the variable result. If not, what does it mean?
This came up at work today and I remember that there was a problem for larger numbers so this did to occur every time (and by "this" I mean that trailing dot). But I do remember that it happened when there was seven digits in the number. Here they wright that the precision of floats are seven digits:
C++ Float Division and Precision
Can this be the thing and Visual Studio tells me this by putting a dot in the end?
I THINK I FOUND IT! It says "The mantissa is specified as a sequence of digits followed by a period". What does the mantissa mean? Can this be different on a PC and when running the code on a DSP? Because the thing is that I get different results and the only thing that looks strange to me is this period-thing, since I don't know what it means.
http://msdn.microsoft.com/en-us/library/tfh6f0w2(v=vs.71).aspx

If you're referring to the "sig figs" convention where "4.0" means 4±0.1 and "4.00" means 4±0.01, then no, there's no such concept in float or double. Numbers are always* stored with 24 or 53 significant bits (7.22 or 15.95 decimal digits) regardless of how many are actually "significant".
The trailing dot is just a decimal point without any digits after it (which is a legal C literal). It either means that
The value is 1232432.0 and they trimed the unnecessary trailing zero, OR
Everything is being rounded to 7 significant digits (in which case the true value might also be 1232431.5, 1232431.625, 1232431.75, 1232431.875, 1232432.125, 1232432.25, 1232432.375, or 1232432.5.)
The real question is, why are you using float? double is the "normal" floating-point type in C(++), and float a memory-saving optimization.
* Pedants will be quick to point out denormals, x87 80-bit intermediate values, etc.

The precision is not variable, that is simply how VS is formatting it for display. The precision (or lackof) is always constant for a given floating point number.

The MSDN page you linked to talks about the syntax of a floating-point literal in source code. It doesn't define how the number will be displayed by whatever tool you're using. If you print a floating-point number using either printf or std:cout << ..., the language standard specifies how it will be printed.
If you print it in the debugger (which seems to be what you're doing), it will be formatted in whatever way the developers of the debugger decided on.
There are a number of different ways that a given floating-point number can be displayed: 1.0, 1., 10.0E-001, and .1e+1 all mean exactly the same thing. A trailing . does not typically tell you anything about precision. My guess is that the developers of the debugger just used 1232432. rather than 1232432.0 to save space.
If you're seeing the trailing . for some values, and a decimal number with no . at all for others, that sounds like an odd glitch (possibly a bug) in the debugger.
If you're wondering what the actual precision is, for IEEE 32-bit float (the format most computers use these days), the next representable numbers before and after 1232432.0 are 1232431.875 and 1232432.125. (You'll get much better precision using double rather than float.)

Floating-point: "The leading 1 is 'implicit' in the significand." -- ...huh?

I'm learning about the representation of floating-point IEEE 754 numbers, and my textbook says:
To pack even more bits into the significand, IEEE 754 makes the leading 1-bit of normalized binary numbers implicit. Hence, the number is actually 24 bits long in single precision (implied 1 and 23-bit fraction), and 53 bits long in double precision (1 + 52).
I don't get what "implicit" means here... what's the difference between an explicit bit and an implicit bit? Don't all numbers have the bit, regardless of their sign?

Yes, all normalised numbers (other than the zeroes) have that bit set to one (a), so they make it implicit to prevent wasting space storing it.
In other words, they save that bit totally, and reuse it so that it can be used to increase the precision of your numbers.
Keep in mind that this is the first bit of the fraction, not the first bit of the binary pattern. The first bit of the binary pattern is the sign, followed by a few bits of exponent, followed by the fraction itself.
For example, a single precision number is (sign, exponent, fraction):
<1> <--8---> <---------23----------> <- bit widths
s eeeeeeee fffffffffffffffffffffff
If you look at the way the number is calculated, it's:
(-1)sign x 1.fraction x 2exponent-bias
So the fractional part used for calculating that value is 1.fffff...fff (in binary).
(a) There is actually a class of numbers (the denormalised ones and the zeroes) for which that property does not hold true. These numbers all have a biased exponent of zero but the vast majority of numbers follow the rule.

Here is what they are saying. The first non-zero bit is always going to be 1. So there is no need for the binary representation to include that bit, since you know what it is. So they don't. They tell you where that first 1 is, and then they give the bits after it. So there is a 1 that is not explicitly in the binary representation, whose location is implicit from the fact that they told you where it was.

It may also be helpful to note that we are dealing in binary representations of a number. The reason that the first digit of a normalized binary number (that is, no leading zeroes) has to be 1 is that 1 is the only non-zero value available to us in this representation. So, the same would not be true for, say, base-three representations.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js