I have to convert some code from Fortran so doesn't know how what this statement means:
var1 = 10.D00
Can someone explain me what it means?
It's just 10.0 in scientific notation with double precision (that's what the D stands for).
See: http://www.fortran.com/F77_std/rjcnf0001-sh-4.html#sh-4.2.1:
4.5.1 Double Precision Exponent.
The form of a double precision exponent is the letter D followed by an optionally signed integer constant. A double precision exponent denotes a power of ten. Note that the form and interpretation of a double precision exponent are identical to those of a real exponent, except that the letter D is used instead of the letter E.
Related
Is there any standard function available which can help me to compare the max() or min() between two float values ?
I have written the fixed point implementation for this min() and max() function from q0s32 to q32s0 type (33 types).
But I want to test the precision loss of my function with the std:min() and std::max() function .But results are not good from std functions .
I tried this way, but that did not work for me as result is not as per the expectation .
Code :
float num1 = 4.5000000054f;
float num2 = 4.5000000057f;
float resf = std::max(num1,num2);
printf("Result is :%20.15f\n",resf);
printf("num1 :%20.15f and num2 :%20.15f\n",num1,num2);
Output:
Result is : 4.500000000000000
num1 : 4.500000000000000 and num2 : 4.500000000000000
Most implementations of c++ use the IEEE 754 standard for floating point arithmetic. Here is some useful information regarding this issue
In IEEE 754 float is a 32 bit single precision Floating Point Number (1 bit for the sign, 8 bits for the exponent, and 23* for the value), i.e. float has 7 decimal digits of precision.
In IEEE 754 double is a 64 bit double precision Floating Point Number (1 bit for the sign, 11 bits for the exponent, and 52* bits for the value), i.e. double has 15 decimal digits of precision.
You need to use double instead to get the desired results.
Is there an algorithm in C++ that will allow me to, given a floating-point value V of type T (e.g. double or float), returns the closest value to V in a given direction (up or down) that can be represented exactly in less than or equal to a specified number of decimal places D ?
For example, given
T = double
V = 670000.08267799998
D = 6
For direction = towards +inf I would like the result to be 670000.082678, and for direction = towards -inf I would like the result to be 670000.082677
This is somewhat similar to std::nexttoward(), but with the restriction that the 'next' value needs to be exactly representable using at most D decimal places.
I've considered a naive solution involving separating out the fractional portion and scaling it by 10^D, truncating it, and scaling it again by 10^-D and tacking it back onto the whole number portion, but I don't believe that guarantees that the resulting value will be exactly representable in the underlying type.
I'm hopeful that there's a way to do this properly, but so far I've been unable to find one.
Edit: I think my original explanation didn't properly convey my requirements. At the suggestion of #patricia-shanahan I'll try to describing my higher-level goal and then reformulate the problem a little differently in that context.
At the highest level, the reason I need this routine is due to some business logic wherein I must take in a double value K and a percentage P, split it into two double components V1 and V2 where V1 ~= P percent of K and V1 + V2 ~= K. The catch is that V1 is used in further calculations before being sent to a 3rd party over a wire protocol that accepts floating-point values in string format with a max of D decimal places. Because the value sent to the 3rd party (in string format) needs to be reconcilable with the results of the calculations made using V1 (in double format) , I need to "adjust" V1 using some function F() so that it is as close as possible to being P percent of K while still being exactly representable in string format using at most D decimal places. V2 has none of the restrictions of V1, and can be calculated as V2 = K - F(V1) (it is understood and acceptable that this may result in V2 such that V1 + V2 is very close to but not exactly equal to K).
At the lower level, I'm looking to write that routine to 'adjust' V1 as something with the following signature:
double F(double V, unsigned int D, bool roundUpIfTrueElseDown);
where the output is computed by taking V and (if necessary, and in the direction specified by the bool param) rounding it to the Dth decimal place.
My expectation would be that when V is serialized out as follows
const auto maxD = std::numeric_limits<double>::digits10;
assert(D <= maxD); // D will be less than maxD... e.g. typically 1-6, definitely <= 13
std::cout << std::fixed
<< std::setprecision(maxD)
<< F(V, D, true);
then the output contains only zeros beyond the Dth decimal place.
It's important to note that, for performance reasons, I am looking for an implementation of F() that does not involve conversion back and forth between double and string format. Though the output may eventually be converted to a string format, in many cases the logic will early-out before this is necessary and I would like to avoid the overhead in that case.
This is a sketch of a program that does what is requested. It is presented mainly to find out whether that is really what is wanted. I wrote it in Java, because that language has some guarantees about floating point arithmetic on which I wanted to depend. I only use BigDecimal to get exact display of doubles, to show that the answers are exactly representable with no more than D digits after the decimal point.
Specifically, I depended on double behaving according to IEEE 754 64-bit binary arithmetic. That is likely, but not guaranteed by the standard, for C++. I also depended on Math.pow being exact for simple exact cases, on exactness of division by a power of two, and on being able to get exact output using BigDecimal.
I have not handled edge cases. The big missing piece is dealing with large magnitude numbers with large D. I am assuming that the bracketing binary fractions are exactly representable as doubles. If they have more than 53 significant bits that will not be the case. It also needs code to deal with infinities and NaNs. The assumption of exactness of division by a power of two is incorrect for subnormal numbers. If you need your code to handle them, you will have to put in corrections.
It is based on the concept that a number that is both exactly representable as a decimal with no more than D digits after the decimal point and is exactly representable as a binary fraction must be representable as a fraction with denominator 2 raised to the D power. If it needs a higher power of 2 in the denominator, it will need more than D digits after the decimal point in its decimal form. If it cannot be represented at all as a fraction with a power-of-two denominator, it cannot be represented exactly as a double.
Although I ran some other cases for illustration, the key output is:
670000.082678 to 6 digits Up: 670000.09375 Down: 670000.078125
Here is the program:
import java.math.BigDecimal;
public class Test {
public static void main(String args[]) {
testIt(2, 0.000001);
testIt(10, 0.000001);
testIt(6, 670000.08267799998);
}
private static void testIt(int d, double in) {
System.out.print(in + " to " + d + " digits");
System.out.print(" Up: " + new BigDecimal(roundUpExact(d, in)).toString());
System.out.println(" Down: "
+ new BigDecimal(roundDownExact(d, in)).toString());
}
public static double roundUpExact(int d, double in) {
double factor = Math.pow(2, d);
double roundee = factor * in;
roundee = Math.ceil(roundee);
return roundee / factor;
}
public static double roundDownExact(int d, double in) {
double factor = Math.pow(2, d);
double roundee = factor * in;
roundee = Math.floor(roundee);
return roundee / factor;
}
}
In general, decimal fractions are not precisely representable as binary fractions. There are some exceptions, like 0.5 (½) and 16.375 (16⅜), because all binary fractions are precisely representable as decimal fractions. (That's because 2 is a factor of 10, but 10 is not a factor of 2, or any power of two.) But if a number is not a multiple of some power of 2, its binary representation will be an infinitely-long cyclic sequence, like the representation of ⅓ in decimal (.333....).
The standard C library provides the macro DBL_DIG (normally 15); any decimal number with that many decimal digits of precision can be converted to a double (for example, with scanf) and then converted back to a decimal representation (for example, with printf). To go in the opposite direction without losing information -- start with a double, convert it to decimal and then convert it back -- you need 17 decimal digits (DBL_DECIMAL_DIG). (The values I quote are based on IEEE-754 64-bit doubles).
One way to provide something close to the question would be to consider a decimal number with no more than DBL_DIG digits of precision to be an "exact-but-not-really-exact" representation of a floating point number if that floating point number is the floating point number which comes closest to the value of the decimal number. One way to find that floating point number would be to use scanf or strtod to convert the decimal number to a floating point number, and then try the floating point numbers in the vicinity (using nextafter to explore) to find which ones convert to the same representation with DBL_DIG digits of precision.
If you trust the standard library implementation to not be too far off, you could convert your double to a decimal number using sprintf, increment the decimal string at the desired digit position (which is just a string operation), and then convert it back to a double with strtod.
Total re-write.
Based on OP's new requirement and using power-of-2 as suggested by #Patricia Shanahan, simple C solution:
double roundedV = ldexp(round(ldexp(V, D)),-D); // for nearest
double roundedV = ldexp(ceil (ldexp(V, D)),-D); // at or just greater
double roundedV = ldexp(floor(ldexp(V, D)),-D); // at or just less
The only thing added here beyond #Patricia Shanahan fine solution is C code to match OP's tag.
In C++ integers must be represented in binary, but floating point types can have a decimal representation.
If FLT_RADIX from <limits.h> is 10, or some multiple of 10, then your goal of exact representation of a decimal values is attainable.
Otherwise, in general, it's not attainable.
So, as a first step, try to find a C++ implementation where FLT_RADIX is 10.
I wouldn't worry about algorithm or efficiency thereof until the C++ implementation is installed and proved to be working on your system. But as a hint, your goal seems to be suspiciously similar to the operation known as “rounding”. I think, after obtaining my decimal floating point C++ implementation, I’d start by investigating techniques for rounding, e.g., googling that, maybe Wikipedia, …
In Visual C++ 2010, I tried this
double d= DBL_MAX;
double dblmaxintpart;
modf(DBL_MAX, &dblmaxintpart);
In the debugger window I put
d == dblmaxintpart
which gave true as result.
Can I assume that DBL_MAX is equal to its integer part as an always valid assertion?
Yes, the integer part of a double which represents an integer will always be the double itself, even at DBL_MAX. In fact, any double greater than 2^52 will have itself as an integer part, because doubles of that size don't have enough mantissal bits to represent a fraction.
For similar reasons, not all integers above 2^53 are representable as doubles (though when converted to doubles, they will still be integers).
Finally, the fractional part of any double less than 1 will be exactly itself, and the fractional and integer parts of any double, when added, will produce exactly the original double.
In Fortran 90 (using gfortran on Mac OS X) if I assign a value to a double-precision variable without explicitly tacking on a kind, the precision doesn't "take." What I mean is, if I run the following program:
program sample_dp
implicit none
integer, parameter :: sp = kind(1.0)
integer, parameter :: dp = kind(1.0d0)
real(sp) :: a = 0.
real(dp) :: b = 0., c = 0., d = 0.0_dp, e = 0_dp
! assign values
a = 0.12345678901234567890
b = 0.12345678901234567890
c = DBLE(0.12345678901234567890)
d = 0.12345678901234567890_dp
write(*,101) a, b, c, d
101 format(1x, 'Single precision: ', T27, F17.15, / &
1x, 'Double precisison: ', T27, F17.15, / &
1x, 'Double precision (DBLE): ', T27, F17.15, / &
1x, 'Double precision (_dp): ', T27, F17.15)
end program
I get the result:
Single precision: 0.123456791043282
Double precision: 0.123456791043282
Double precision (DBLE): 0.123456791043282
Double precision (_dp): 0.123456789012346
The single precision result starts rounding off at the 8th decimal as expected, but only the double precision variable I assigned explicitly with _dp keeps all 16 digits of precision. This seems odd, as I would expect (I'm relatively new to Fortran) that a double precision variable would automatically be double-precision. Is there a better way to assign double precision variables, or do I have to explicitly type them as above?
A real which isn't marked as double precision will be assumed to be single precision. Just because sometime later you assign it to a double precision variable, or convert it to double precision, that doesn't mean that the value will 'magically' be double precision. It doesn't look ahead to see how the value will be used.
There are several questions linking here so it is good to state some details more explicitly with examples, especially for beginners.
As stated by MRAB in his correct answer, an expression is always evaluated without any context, so
0.12345678901234567890
is a default (single) precision floating literal, no matter where does it appear. The same holds to floating point numbers in the exponential form
0.12345678901234567890E0
it is also a default precision number.
If one want to use a double precision constant, one can use D instead of E in the above form. Even if such a double precision constant is assigned to a default precision variable, it is first treated as a double precision number and then it is converted to default precision.
The way you are using in your question (employing the kind notation and several kind constants) is more general and more modern, but the principle is the same.
0.12345678901234567890_sp
is a number of kind sp and
0.12345678901234567890_dp
is a number of kind dp and it does not matter where do they appear.
As your example shows, it is not only about assignment. In the line
c = DBLE(0.12345678901234567890)
first the number 0.12345678901234567890 is default precision. Then it is converted to double precision by DBLE, but that is done after some of the digits are already lost. Then this new double precision number is assigned to c.
This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Incorrect floating point math?
Float compile-time calculation not happening?
Strange stuff going on today, I'm about to lose it...
#include <iomanip>
#include <iostream>
using namespace std;
int main()
{
cout << setprecision(14);
cout << (1/9+1/9+4/9) << endl;
}
This code outputs 0 on MSVC 9.0 x64 and x86 and on GCC 4.4 x64 and x86 (default options and strict math...). And as far as I remember, 1/9+1/9+4/9 = 6/9 = 2/3 != 0
1/9 is zero, because 1 and 9 are integers and divided by integer division. The same applies to 4/9.
If you want to express floating-point division through arithmetic literals, you have to either use floating-point literals 1.0/9 + 1.0/9 + 4.0/9 (or 1/9. + 1/9. + 4/9. or 1.f/9 + 1.f/9 + 4.f/9) or explicitly cast one operand to the desired floating-point type (double) 1/9 + (double) 1/9 + (double) 4/9.
P.S. Finally my chance to answer this question :)
Use a decimal point in your calculations to force floating point math optionally along with one of these suffixes: f l F L on your numbers. A number alone without a decimal point and without one of those suffixes is not considered a floating point literal.
C++03 2.13.3-1 on Floating literals:
A floating literal consists of an
integer part, a decimal point, a
fraction part, an e or E, an
optionally signed integer exponent,
and an optional type suffix. The
integer and fraction parts both
consist of a sequence of decimal (base
ten) digits. Either the integer part
or the fraction part (not both) can be
omitted; either the decimal point or
the letter e (or E) and the exponent
(not both) can be omitted. The integer
part, the optional decimal point and
the optional fraction part form the
significant part of the floating
literal. The exponent, if present,
indicates the power of 10 by which the
significant part is to be scaled. If
the scaled value is in the range of
representable values for its type, the
result is the scaled value if
representable, else the larger or
smaller representable value nearest
the scaled value, chosen in an
implementation-defined manner. The
type of a floating literal is double
unless explicitly specified by a
suffix. The suffixes f and F specify
float, the suffixes l and L specify
long double. If the scaled value is
not in the range of representable
values for its type, the program is
ill-formed. 18
They are all integers. So 1/9 is 0. 4/9 is also 0. And 0 + 0 + 0 = 0. So the result is 0. If you want fractions, cast your fractions to floats.
1/9(=0)+1/9(=0)+4/9(=0) = 0
well, in C++ (and many other languages), 1/9+1/9+4/9 is zero, because it is integer arithmetic.
You probably want to write 1/9.0+1/9.0+4/9.0
Unless you specifically specify the decimal, the numbers C++ uses are integers, so 1/9 = 4/9 = 0 and 0 + 0 + 0 = 0.
You should simply add the decimal 1.0 etc...
By the C rules of types, you're doing all integer math there. 1/9 and 4/9 are both truncated to 0 (as integers). If you wrote 1.0/9.0 etc, it would use double precision math and do what you want.
You might make it a habit to use more parentheses. They cost little time, make clear what you intend, and ensure you get what you wanted. Well mostly... ;)