The minimum and maximum floating point exponent of a real number - fortran

How can I get the minimum and maximum exponent for 32- and 64-bit real numbers? I am doing some work to avoid underflows and overflows and would need to know those numbers.
I would also need the base for floating point numbers.
Is it possible in fortran to get the equivalent of ilmach?

For non-zero reals, the numeric model looks like s*b^e*\sum_{k=1}^{p}f_k*b^{-k}.
To get the value of the base b use radix(). The minimum and maximum values of the exponent e can be found with exponent combined with tiny and huge.
use, intrinsic :: iso_fortran_env, only : real32, real64
print 1, "real32", RADIX(1._real32), EXPONENT(TINY(1._real32)), EXPONENT(HUGE(1._real32))
print 1, "real64", RADIX(1._real64), EXPONENT(TINY(1._real64)), EXPONENT(HUGE(1._real64))
1 FORMAT (A," has radix ", I0, " with exponent range ", I0, " to ", I0, ".")
end

The function range() returns the range of exponents. The intrinsic function huge() returns the maximum allowable number for a given numeric kind. From that you can see the exponent too by employing a logarithm. See also selected_real_kind().
The base for gfortran and other normal compilers is 2, but you can test it using radix(). They may be some base 10 kinds in the future.
In the same manual I linked you will find other useful intrinsics like tiny(), precision(), epsilon(), spacing(), you can follow the links in "See also:".

Related

What Are the Maximum Number of Base-10 Digits in the Integral Part of a Floating Point Number

I want to know if there is something in the standard, like a #define or something in numeric_limits which would tell me the maximum number of base-10 digits in the integral part of a floating point type.
For example, if I have some floating point type the largest value of which is: 1234.567. I'd like something defined in the standard that would tell me 4 for that type.
Is there an option to me doing this?
template <typename T>
constexpr auto integral_digits10 = static_cast<int>(log10(numeric_limits<T>::max())) + 1;
As Nathan Oliver points out in the comments, C++ provides std::numeric_limits<T>::digits10.
the number of base-10 digits that can be represented by the type T without change, that is, any number with this many decimal digits can be converted to a value of type T and back to decimal form, without change due to rounding or overflow. For base-radix types, it is the value of digits (digits-1 for floating-point types) multiplied by log10(radix) and rounded down.
The explanation for this is explained by Rick Regan here. In summary, if your binary floating point format can store b bits in the significand, then you are guaranteed to be able to round-trip up to d decimal digits, where d is the largest integer such that
10d < 2b-1
In the case of an IEEE754 binary64 (the standard double in C++ on most systems nowadays), then b = 53, and 2b-1 = 4,503,599,627,370,496, so the format is only guaranteed to be able to represent d = 15 digits.
However this result holds for all digits, whereas you just ask about the integral part. However we can easily find a counterexample by choosing x = 2b+1, which is the smallest integer not representable by the format: for binary64 this is 9,007,199,254,740,993, which also happens to have 16 digits, and so will need to be rounded.
The value that you are looking for is max_exponent10 which:
Is the largest positive number n such that 10n is a representable finite value of the floating-point type
Because of this relationship:
log10x = n
10n = x
Your calculation is doing, is finding n the way the first equation works:
log10(numeric_limits<T>::max())
The definition of max_exponent10 is explaining that it is using a 10n + 1 would be larger than numeric_limits<T>::max() but 10n is less than or equal to numeric_limits<T>::max(). So numeric_limits<T>::max_exponent10 is what you're looking for.
Note that you will still need the + 1 as in your example, to account for the 1's place. (Because log101 = 0) So your the number of 10-based digits required to represent numeric_limits<T>::max() will be:
numeric_limits<T>::max_exponent10 + 1
If you feel like validating that by hand you can check here:
http://coliru.stacked-crooked.com/a/443e4d434cbcb2f6

How to shift a floating-point value to the nearest one that can be represented exactly in a specific number of decimal places?

Is there an algorithm in C++ that will allow me to, given a floating-point value V of type T (e.g. double or float), returns the closest value to V in a given direction (up or down) that can be represented exactly in less than or equal to a specified number of decimal places D ?
For example, given
T = double
V = 670000.08267799998
D = 6
For direction = towards +inf I would like the result to be 670000.082678, and for direction = towards -inf I would like the result to be 670000.082677
This is somewhat similar to std::nexttoward(), but with the restriction that the 'next' value needs to be exactly representable using at most D decimal places.
I've considered a naive solution involving separating out the fractional portion and scaling it by 10^D, truncating it, and scaling it again by 10^-D and tacking it back onto the whole number portion, but I don't believe that guarantees that the resulting value will be exactly representable in the underlying type.
I'm hopeful that there's a way to do this properly, but so far I've been unable to find one.
Edit: I think my original explanation didn't properly convey my requirements. At the suggestion of #patricia-shanahan I'll try to describing my higher-level goal and then reformulate the problem a little differently in that context.
At the highest level, the reason I need this routine is due to some business logic wherein I must take in a double value K and a percentage P, split it into two double components V1 and V2 where V1 ~= P percent of K and V1 + V2 ~= K. The catch is that V1 is used in further calculations before being sent to a 3rd party over a wire protocol that accepts floating-point values in string format with a max of D decimal places. Because the value sent to the 3rd party (in string format) needs to be reconcilable with the results of the calculations made using V1 (in double format) , I need to "adjust" V1 using some function F() so that it is as close as possible to being P percent of K while still being exactly representable in string format using at most D decimal places. V2 has none of the restrictions of V1, and can be calculated as V2 = K - F(V1) (it is understood and acceptable that this may result in V2 such that V1 + V2 is very close to but not exactly equal to K).
At the lower level, I'm looking to write that routine to 'adjust' V1 as something with the following signature:
double F(double V, unsigned int D, bool roundUpIfTrueElseDown);
where the output is computed by taking V and (if necessary, and in the direction specified by the bool param) rounding it to the Dth decimal place.
My expectation would be that when V is serialized out as follows
const auto maxD = std::numeric_limits<double>::digits10;
assert(D <= maxD); // D will be less than maxD... e.g. typically 1-6, definitely <= 13
std::cout << std::fixed
<< std::setprecision(maxD)
<< F(V, D, true);
then the output contains only zeros beyond the Dth decimal place.
It's important to note that, for performance reasons, I am looking for an implementation of F() that does not involve conversion back and forth between double and string format. Though the output may eventually be converted to a string format, in many cases the logic will early-out before this is necessary and I would like to avoid the overhead in that case.
This is a sketch of a program that does what is requested. It is presented mainly to find out whether that is really what is wanted. I wrote it in Java, because that language has some guarantees about floating point arithmetic on which I wanted to depend. I only use BigDecimal to get exact display of doubles, to show that the answers are exactly representable with no more than D digits after the decimal point.
Specifically, I depended on double behaving according to IEEE 754 64-bit binary arithmetic. That is likely, but not guaranteed by the standard, for C++. I also depended on Math.pow being exact for simple exact cases, on exactness of division by a power of two, and on being able to get exact output using BigDecimal.
I have not handled edge cases. The big missing piece is dealing with large magnitude numbers with large D. I am assuming that the bracketing binary fractions are exactly representable as doubles. If they have more than 53 significant bits that will not be the case. It also needs code to deal with infinities and NaNs. The assumption of exactness of division by a power of two is incorrect for subnormal numbers. If you need your code to handle them, you will have to put in corrections.
It is based on the concept that a number that is both exactly representable as a decimal with no more than D digits after the decimal point and is exactly representable as a binary fraction must be representable as a fraction with denominator 2 raised to the D power. If it needs a higher power of 2 in the denominator, it will need more than D digits after the decimal point in its decimal form. If it cannot be represented at all as a fraction with a power-of-two denominator, it cannot be represented exactly as a double.
Although I ran some other cases for illustration, the key output is:
670000.082678 to 6 digits Up: 670000.09375 Down: 670000.078125
Here is the program:
import java.math.BigDecimal;
public class Test {
public static void main(String args[]) {
testIt(2, 0.000001);
testIt(10, 0.000001);
testIt(6, 670000.08267799998);
}
private static void testIt(int d, double in) {
System.out.print(in + " to " + d + " digits");
System.out.print(" Up: " + new BigDecimal(roundUpExact(d, in)).toString());
System.out.println(" Down: "
+ new BigDecimal(roundDownExact(d, in)).toString());
}
public static double roundUpExact(int d, double in) {
double factor = Math.pow(2, d);
double roundee = factor * in;
roundee = Math.ceil(roundee);
return roundee / factor;
}
public static double roundDownExact(int d, double in) {
double factor = Math.pow(2, d);
double roundee = factor * in;
roundee = Math.floor(roundee);
return roundee / factor;
}
}
In general, decimal fractions are not precisely representable as binary fractions. There are some exceptions, like 0.5 (½) and 16.375 (16⅜), because all binary fractions are precisely representable as decimal fractions. (That's because 2 is a factor of 10, but 10 is not a factor of 2, or any power of two.) But if a number is not a multiple of some power of 2, its binary representation will be an infinitely-long cyclic sequence, like the representation of ⅓ in decimal (.333....).
The standard C library provides the macro DBL_DIG (normally 15); any decimal number with that many decimal digits of precision can be converted to a double (for example, with scanf) and then converted back to a decimal representation (for example, with printf). To go in the opposite direction without losing information -- start with a double, convert it to decimal and then convert it back -- you need 17 decimal digits (DBL_DECIMAL_DIG). (The values I quote are based on IEEE-754 64-bit doubles).
One way to provide something close to the question would be to consider a decimal number with no more than DBL_DIG digits of precision to be an "exact-but-not-really-exact" representation of a floating point number if that floating point number is the floating point number which comes closest to the value of the decimal number. One way to find that floating point number would be to use scanf or strtod to convert the decimal number to a floating point number, and then try the floating point numbers in the vicinity (using nextafter to explore) to find which ones convert to the same representation with DBL_DIG digits of precision.
If you trust the standard library implementation to not be too far off, you could convert your double to a decimal number using sprintf, increment the decimal string at the desired digit position (which is just a string operation), and then convert it back to a double with strtod.
Total re-write.
Based on OP's new requirement and using power-of-2 as suggested by #Patricia Shanahan, simple C solution:
double roundedV = ldexp(round(ldexp(V, D)),-D); // for nearest
double roundedV = ldexp(ceil (ldexp(V, D)),-D); // at or just greater
double roundedV = ldexp(floor(ldexp(V, D)),-D); // at or just less
The only thing added here beyond #Patricia Shanahan fine solution is C code to match OP's tag.
In C++ integers must be represented in binary, but floating point types can have a decimal representation.
If FLT_RADIX from <limits.h> is 10, or some multiple of 10, then your goal of exact representation of a decimal values is attainable.
Otherwise, in general, it's not attainable.
So, as a first step, try to find a C++ implementation where FLT_RADIX is 10.
I wouldn't worry about algorithm or efficiency thereof until the C++ implementation is installed and proved to be working on your system. But as a hint, your goal seems to be suspiciously similar to the operation known as “rounding”. I think, after obtaining my decimal floating point C++ implementation, I’d start by investigating techniques for rounding, e.g., googling that, maybe Wikipedia, …

Exact decimal datatype for C++?

PHP has a decimal type, which doesn't have the "inaccuracy" of floats and doubles, so that 2.5 + 2.5 = 5 and not 4.999999999978325 or something like that.
So I wonder if there is such a data type implementation for C or C++?
The Boost.Multiprecision library has a decimal based floating point template class called cpp_dec_float, for which you can specify any precision you want.
#include <iostream>
#include <iomanip>
#include <boost/multiprecision/cpp_dec_float.hpp>
int main()
{
namespace mp = boost::multiprecision;
// here I'm using a predefined type that stores 100 digits,
// but you can create custom types very easily with any level
// of precision you want.
typedef mp::cpp_dec_float_100 decimal;
decimal tiny("0.0000000000000000000000000000000000000000000001");
decimal huge("100000000000000000000000000000000000000000000000");
decimal a = tiny;
while (a != huge)
{
std::cout.precision(100);
std::cout << std::fixed << a << '\n';
a *= 10;
}
}
Yes:
There are arbitrary precision libraries for C++.
A good example is The GNU Multiple Precision arithmetic library.
If you are looking for data type supporting money / currency then try this:
https://github.com/vpiotr/decimal_for_cpp
(it's header-only solution)
There will be always some precision. On any computer in any number representation there will be always numbers which can be represented accurately, and other numbers which can't.
Computers use a base 2 system. Numbers such as 0.5 (2^-1), 0.125 (2^-3), 0.325 (2^-2 + 2^-3) will be represented accurately (0.1, 0.001, 0.011 for the above cases).
In a base 3 system those numbers cannot be represented accurately (half would be 0.111111...), but other numbers can be accurate (e.g. 2/3 would be 0.2)
Even in human base 10 system there are numbers which can't be represented accurately, for example 1/3.
You can use rational number representation and all the above will be accurate (1/2, 1/3, 3/8 etc.) but there will be always some irrational numbers too. You are also practically limited by the sizes of the integers of this representation.
For every non-representable number you can extend the representation to include it explicitly. (e.g. compare rational numbers and a representation a/b + c/d*sqrt(2)), but there will be always more numbers which still cannot be represented accurately. There is a mathematical proof that says so.
So - let me ask you this: what exactly do you need? Maybe precise computation on decimal-based numbers, e.g. in some monetary calculation?
What you're asking is anti-physics.
What phyton (and C++ as well) do is cut off the inaccuracy by rounding the result at the time to print it out, by reducing the number of significant digits:
double x = 2.5;
x += 2.5;
std::cout << x << std::endl;
just makes x to be printed with 6 decimal digit precision (while x itself has more than 12), and will be rounded as 5, cutting away the imprecision.
Alternatives are not using floating point at all, and implement data types that do just integer "scaled" arithmetic: 25/10 + 25/10 = 50/10;
Note, however, that this will reduce the upper limit represented by each integer type. The gain in precision (and exactness) will result in a faster reach to overflow.
Rational arithmetic is also possible (each number is represented by a "numarator" and a "denominator"), with no precision loss against divisions, (that -in fact- are not done unless exact) but again, with increasing values as the number of operation grows (the less "rational" is the number, the bigger are the numerator and denominator) with greater risk of overflow.
In other word the fact a finite number of bits is used (no matter how organized) will always result in a loss you have to pay on the side of small on on the side of big numbers.
I presume you are talking about the Binary Calculator in PHP. No, there isn't one in the C runtime or STL. But you can write your own if you are so inclined.
Here is a C++ version of BCMath compiled using Facebook's HipHop for PHP:
http://fossies.org/dox/facebook-hiphop-php-cf9b612/dir_2abbe3fda61b755422f6c6bae0a5444a.html
Being a higher level language PHP just cuts off what you call "inaccuracy" but it's certainly there. In C/C++ you can achieve similar effect by casting the result to integer type.

C++ function that do base 10 significant + exponent calculation from double

I need to represent numbers using the following structure. The purpose of this structure is not to lose the precision.
struct PreciseNumber
{
long significand;
int exponent;
}
Using this structure actual double value can be represented as value = significand * 10e^exponent.
Now I need to write utility function which can covert double into PreciseNumber.
Can you please let me know how to extract the exponent and significand from the double?
The prelude is somewhat flawed.
Firstly, barring any restrictions on storage space, conversion from a double to a base 10 significand-exponent form won't alter the precision in any form. To understand that, consider the following: any binary terminating fraction (like the one that forms the mantissa on a typical IEEE-754 float) can be written as a sum of negative powers of two. Each negative power of two is a terminating fraction itself, and hence it follows that their sum must be terminating as well.
However, the converse isn't necessarily true. For instance, 0.3 base 10 is equivalent to the non-terminating 0.01 0011 0011 0011 ... in base 2. Fitting this into a fixed size mantissa would blow some precision out of it (which is why 0.3 is actually stored as something that translates back to 0.29999999999999999.)
By this, we may assume that any precision that is intended by storing the numbers in decimal significand-exponent form is either lost, or isn't simply gained at all.
Of course, you might think of the apparent loss of accuracy generated by storing a decimal number as a float as loss in precision, in which case the Decimal32 and Decimal64 floating point formats may be of some interest -- check out http://en.wikipedia.org/wiki/Decimal64_floating-point_format.
This is a very difficult problem. You might want to see how much code it takes to implement a double-to-string conversion (for printf, e.g.). You might steal the code from gnu's implementation of gcc.
You cannot convert an "imprecise" double into a "precise" decimal number, because the required "precision" simply isn't there to begin with (otherwise why would you even want to convert?).
This is what happens if you try something like it in Java:
BigDecimal x = new BigDecimal(0.1);
System.out.println(x);
The output of the program is:
0.1000000000000000055511151231257827021181583404541015625
Well you're at less precision than a typical double. Your significand is a long giving you a range from -2 billion to +2 billion which is more than 9 but fewer than 10 digits of precision.
Here's an untested starting point on what you'd want to do for some simple math on PreciseNumbers
PreciseNumber Multiply(PreciseNumber lhs, PreciseNumber rhs)
{
PreciseNumber ret;
ret.s=lhs.s;
ret.e=lhs.e;
ret.s*=rhs.s;
ret.e+=lhs.e;
return ret;
}
PreciseNumber Add(PreciseNumber lhs, PreciseNumber rhs)
{
PreciseNumber ret;
ret.s=lhs.s;
ret.e=lhs.e;
ret.s+=(rhs.s*pow(10,rhs.e-lhs.e));
}
I didn't take care of any renormalization, but in both cases there are places where you have to worry about over/under flows and loss of precision. Just because you're doing it yourself rather than letting the computer take care of it in a double, doesn't meat the same pitfalls aren't there. The only way to not lose precision is to keep track of all of the digits.
Here's a very rough algorithm. I'll try to fill in some details later.
Take the log10 of the number to get the exponent. Multiply the double by 10^x if positive, or divide by 10^-x if negative.
Start with a significand of zero. Repeat the following 15 times, since a double contains 15 digits of significance:
Multiply the previous significand by 10.
Take the integer portion of the double, add it to the significand, and subtract it from the double.
Subtract 1 from the exponent.
Multiply the double by 10.
When finished, take the remaining double value and use it for rounding: if it's >= 5, add one to the significand.

double type digits in C++

The IEE754 (64 bits) floating point is supposed to correctly represent 15 significant digit although the internal representation has 17 ditigs. Is there a way to force the 16th and 17th digits to zero ??
Ref:
http://msdn.microsoft.com/en-us/library/system.double(VS.80).aspx :
.
.
Remember that a floating-point number can only approximate a decimal number, and that the precision of a floating-point number determines how accurately that number approximates a decimal number. By default, a Double value contains 15 decimal digits of precision, although a maximum of 17 digits is maintained internally. The precision of a floating-point number has several consequences:
.
.
Example nos:
d1 = 97842111437.390091
d2 = 97842111437.390076
d1 and d2 differ in 16th and 17th decimal places that are not supposed to be significant. Looking for ways to force them to zero. ie
d1 = 97842111437.390000
d2 = 97842111437.390000
No. Counter-example: the two closest floating-point numbers to a rational
1.11111111111118
(which has 15 decimal digits) are
1.1111111111111799942818834097124636173248291015625
1.1111111111111802163264883347437717020511627197265625
In other words, there is not floating-point number that starts with 1.1111111111111800.
This question is a little malformed. The hardware stores the numbers
in binary, not decimal. So in the general case you can't do precise
math in base 10. Some decimal numbers (0.1 is one of them!) do not
even have a non-repeating representation in binary. If you have
precision requirements like this, where you care about the number
being of known precision to exactly 15 decimal digits, you will need
to pick another representation for your numbers.
No, but I wonder if this is relevant to any of your issues (GCC specific):
GCC Documentation
-ffloat-store Do not store floating point variables in registers, and
inhibit other options that might
change whether a floating point value
is taken from a register or memory.
This option prevents undesirable
excess precision on machines such as
the 68000 where the floating registers
(of the 68881) keep more precision
than a double is supposed to have.
Similarly for the x86 architecture.
For most programs, the excess
precision does only good, but a few
programs rely on the precise
definition of IEEE floating point. Use
-ffloat-store for such programs, after modifying them to store all pertinent
intermediate computations into
variables.
You should be able to directly modify the bits in your number by creating a union with a field for the floating point number and an integral type of the same size. Then you can access the bits you want and set them however you want. Here is in example where I whack the sign bit; you can choose any field you want, of course.
#include <stdio.h>
union double_int {
double fp;
unsigned long long integer;
};
int main(int argc, const char *argv[])
{
double my_double = 1325.34634;
union double_int *my_union = (union double_int *)&my_double;
/* print original numbers */
printf("Float %f\n", my_double);
printf("Integer %llx\n", my_union->integer);
/* whack the sign bit to 1 */
my_union->integer |= 1ULL << 63;
/* print modified numbers */
printf("Negative float %f\n", my_double);
printf("Negative integer %llx\n", my_union->integer);
return 0;
}
Generally speaking, people only care about something like this ("I only want the first x digits") when displaying the number. That's relatively easy with stringstreams or sprintf.
If you're concerned about comparing numbers with ==; you really can't do that with floating point numbers. Instead you want to see if the numbers are close enough (say, within an epsilon() of each other).
Playing with the bits of the number directly isn't a great idea.