How to loop over exact representations of floating point numbers? - c++

I am trying to loop exactly from one floating point number to the next. Say, I need to loop from std::numeric_limits<float>::epsilon() to 1, which are both exactly representable IEEE754 numbers. My code is:
using nld = std::numeric_limits<float>;
auto h = nld::epsilon();
for (; h < 1; h = std::nextafter(h, 1)) {
std::cerr << "h: " << h << std::endl;
}
which loops indefinitely beacuse h is exactly representable, so nextafter keeps returning it. I also know that adding machine epsilon to h in a loop will not cut it: floating point numbers are not equally spaced. How do I loop over the exact representations of IEEE754 numbers?
The not equally spaced problem presents itself here:
using nld = std::numeric_limits<float>;
auto h = nld::epsilon();
for (; h < 4; h += nld::epsilon()) {
if (h = h + nld::epsilon()) {
std::cerr << "h: " << h << std::endl;
}
}
which keeps printing 2 for me

Per the comments:
The approach with nextafter is exactly what you should be doing. However, it has some complications that may lead to unexpected results.
Quoting cppreference std::nextafter:
float nextafter( float from, float to ); (1) (since C++11)
double nextafter( double from, double to ); (2) (since C++11)
long double nextafter( long double from, long double to ); (3) (since C++11)
Promoted nextafter( Arithmetic from, Arithmetic to ); (4) (since C++11)
...
4) A set of overloads or a function template for all combinations of arguments of arithmetic type not covered by (1-3). If any argument has integral type, it is cast to double. If any argument is long double, then the return type Promoted is also long double, otherwise the return type is always double.
Since your to is 1, of type int, you get overload version 4, with a return type of double. Now, it's entirely possible that given a float f, (float)nextafter((double)f, 1) is exactly equal to the original f: it's rather likely that the next representable number in type double cannot be represented in float, and that the conversion back to float rounds down.
The only overload that returns float is the one where to has type float. To use that overload, use 1.0f instead of 1.

treating them as integers will work for notmal positive floats,
negative floats will step in the wrong direction, and denormals and zero may be a special case.
eg: for positive normal floats:
float nextfloat(float in)
{
union { float f; uint32_t i; } a;
a.f=in;
a.i++;
return(a.f);
}
this is relies on the floats having the same endianness and size as the integers, here I pair float and uint32_t, but you could do the same for double and uint64_t... this could actually be a class of undefined behavior, testing its operation should probably be poart of the build process.

Related

When I use the pow(base, exponent) function in C++ and the exponent is a fraction, I always get 1 as output [duplicate]

I was writing this code:
public static void main(String[] args) {
double g = 1 / 3;
System.out.printf("%.2f", g);
}
The result is 0. Why is this, and how do I solve this problem?
The two operands (1 and 3) are integers, therefore integer arithmetic (division here) is used. Declaring the result variable as double just causes an implicit conversion to occur after division.
Integer division of course returns the true result of division rounded towards zero. The result of 0.333... is thus rounded down to 0 here. (Note that the processor doesn't actually do any rounding, but you can think of it that way still.)
Also, note that if both operands (numbers) are given as floats; 3.0 and 1.0, or even just the first, then floating-point arithmetic is used, giving you 0.333....
1/3 uses integer division as both sides are integers.
You need at least one of them to be float or double.
If you are entering the values in the source code like your question, you can do 1.0/3 ; the 1.0 is a double.
If you get the values from elsewhere you can use (double) to turn the int into a double.
int x = ...;
int y = ...;
double value = ((double) x) / y;
Explicitly cast it as a double
double g = 1.0/3.0
This happens because Java uses the integer division operation for 1 and 3 since you entered them as integer constants.
Because you are doing integer division.
As #Noldorin says, if both operators are integers, then integer division is used.
The result 0.33333333 can't be represented as an integer, therefore only the integer part (0) is assigned to the result.
If any of the operators is a double / float, then floating point arithmetic will take place. But you'll have the same problem if you do that:
int n = 1.0 / 3.0;
The easiest solution is to just do this
double g = (double) 1 / 3;
What this does, since you didn't enter 1.0 / 3.0, is let you manually convert it to data type double since Java assumed it was Integer division, and it would do it even if it meant narrowing the conversion. This is what is called a cast operator.
Here we cast only one operand, and this is enough to avoid integer division (rounding towards zero)
The result is 0. Why is this, and how do I solve this problem?
TL;DR
You can solve it by doing:
double g = 1.0/3.0;
or
double g = 1.0/3;
or
double g = 1/3.0;
or
double g = (double) 1 / 3;
The last of these options is required when you are using variables e.g. int a = 1, b = 3; double g = (double) a / b;.
A more completed answer
double g = 1 / 3;
This result in 0 because
first the dividend < divisor;
both variables are of type int therefore resulting in int (5.6.2. JLS) which naturally cannot represent the a floating point value such as 0.333333...
"Integer division rounds toward 0." 15.17.2 JLS
Why double g = 1.0/3.0; and double g = ((double) 1) / 3; work?
From Chapter 5. Conversions and Promotions one can read:
One conversion context is the operand of a numeric operator such as +
or *. The conversion process for such operands is called numeric
promotion. Promotion is special in that, in the case of binary
operators, the conversion chosen for one operand may depend in part on
the type of the other operand expression.
and 5.6.2. Binary Numeric Promotion
When an operator applies binary numeric promotion to a pair of
operands, each of which must denote a value that is convertible to a
numeric type, the following rules apply, in order:
If any operand is of a reference type, it is subjected to unboxing
conversion (§5.1.8).
Widening primitive conversion (§5.1.2) is applied to convert either or
both operands as specified by the following rules:
If either operand is of type double, the other is converted to double.
Otherwise, if either operand is of type float, the other is converted
to float.
Otherwise, if either operand is of type long, the other is converted
to long.
Otherwise, both operands are converted to type int.
you should use
double g=1.0/3;
or
double g=1/3.0;
Integer division returns integer.
Make the 1 a float and float division will be used
public static void main(String d[]){
double g=1f/3;
System.out.printf("%.2f",g);
}
The conversion in JAVA is quite simple but need some understanding. As explain in the JLS for integer operations:
If an integer operator other than a shift operator has at least one operand of type long, then the operation is carried out using 64-bit precision, and the result of the numerical operator is of type long. If the other operand is not long, it is first widened (§5.1.5) to type long by numeric promotion (§5.6).
And an example is always the best way to translate the JLS ;)
int + long -> long
int(1) + long(2) + int(3) -> long(1+2) + long(3)
Otherwise, the operation is carried out using 32-bit precision, and the result of the numerical operator is of type int. If either operand is not an int, it is first widened to type int by numeric promotion.
short + int -> int + int -> int
A small example using Eclipse to show that even an addition of two shorts will not be that easy :
short s = 1;
s = s + s; <- Compiling error
//possible loss of precision
// required: short
// found: int
This will required a casting with a possible loss of precision.
The same is true for the floating point operators
If at least one of the operands to a numerical operator is of type double, then the operation is carried out using 64-bit floating-point arithmetic, and the result of the numerical operator is a value of type double. If the other operand is not a double, it is first widened (§5.1.5) to type double by numeric promotion (§5.6).
So the promotion is done on the float into double.
And the mix of both integer and floating value result in floating values as said
If at least one of the operands to a binary operator is of floating-point type, then the operation is a floating-point operation, even if the other is integral.
This is true for binary operators but not for "Assignment Operators" like +=
A simple working example is enough to prove this
int i = 1;
i += 1.5f;
The reason is that there is an implicit cast done here, this will be execute like
i = (int) i + 1.5f
i = (int) 2.5f
i = 2
1 and 3 are integer contants and so Java does an integer division which's result is 0. If you want to write double constants you have to write 1.0 and 3.0.
I did this.
double g = 1.0/3.0;
System.out.printf("%gf", g);
Use .0 while doing double calculations or else Java will assume you are using Integers. If a Calculation uses any amount of double values, then the output will be a double value. If the are all Integers, then the output will be an Integer.
Because it treats 1 and 3 as integers, therefore rounding the result down to 0, so that it is an integer.
To get the result you are looking for, explicitly tell java that the numbers are doubles like so:
double g = 1.0/3.0;
(1/3) means Integer division, thats why you can not get decimal value from this division. To solve this problem use:
public static void main(String[] args) {
double g = 1.0 / 3;
System.out.printf("%.2f", g);
}
public static void main(String[] args) {
double g = 1 / 3;
System.out.printf("%.2f", g);
}
Since both 1 and 3 are ints the result not rounded but it's truncated. So you ignore fractions and take only wholes.
To avoid this have at least one of your numbers 1 or 3 as a decimal form 1.0 and/or 3.0.
My code was:
System.out.println("enter weight: ");
int weight = myObj.nextInt();
System.out.println("enter height: ");
int height = myObj.nextInt();
double BMI = weight / (height *height)
System.out.println("BMI is: " + BMI);
If user enters weight(Numerator) = 5, and height (Denominator) = 7,
BMI is 0 where Denominator > Numerator & it returns interger (5/7 = 0.71 ) so result is 0 ( without decimal values )
Solution :
Option 1:
doubleouble BMI = (double) weight / ((double)height * (double)height);
Option 2:
double BMI = (double) weight / (height * height);
I noticed that this is somehow not mentioned in the many replies, but you can also do 1.0 * 1 / 3 to get floating point division. This is more useful when you have variables that you can't just add .0 after it, e.g.
import java.io.*;
public class Main {
public static void main(String[] args) {
int x = 10;
int y = 15;
System.out.println(1.0 * x / y);
}
}
Do "double g=1.0/3.0;" instead.
Many others have failed to point out the real issue:
An operation on only integers casts the result of the operation to an integer.
This necessarily means that floating point results, that could be displayed as an integer, will be truncated (lop off the decimal part).
What is casting (typecasting / type conversion) you ask?
It varies on the implementation of the language, but Wikipedia has a fairly comprehensive view, and it does talk about coercion as well, which is a pivotal piece of information in answering your question.
http://en.wikipedia.org/wiki/Type_conversion
Try this out:
public static void main(String[] args) {
double a = 1.0;
double b = 3.0;
double g = a / b;
System.out.printf(""+ g);
}

What does if( c == (int) c) mean?

This is a code for pythagorean triplet. Can somebodby explain the working of if statement below.
int main()
{
int a, b;
float c;
//calculate the another side using Pythagoras Theorem
//a*a + b*b = c*c
//c = sqrt(a*a+b*b)
//maximum length should be equal to 30
for(a=1;a<=30;a++)
{
for(b=1;b<=30;b++)
{
c = sqrt(a*a+b*b);
if(c == (int)c)
{
printf("(%d, %d, %d)\n",a,b,(int)c);
}
}
}
}
It is checking whether the value is integral. 'c' is a float, the cast drops any fractional part. The original value of 'c' is then checked against the result of that cast. If they match then the original c was an integral value (note that even with this cast there are plenty of integral values that will fail this test, any value of 'c' larger than std::numeric_limits::max())
This if statement does not have to work because floating point (float, double, etc) representations in the processor are different from integers (int, char, etc). Therefore, in the general case, this kind of operator is erroneous.
Floating-point numbers are sometimes impossible to represent exactly as an integer (especially as a result of some kind of calculation) (see wiki and this question), and integers are always represented exactly.
In your case, one way to make the if statement correct is to write it like this:
if (fabs(c - roundf(c)) < eps) {
...
}
where eps is required accuracy. For example,
float eps = 1e-12; // 10^(-12)

Numerical accuracy of pow(a/b,x) vs pow(b/a,-x)

Is there a difference in accuracy between pow(a/b,x) and pow(b/a,-x)?
If there is, does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?
Edit: Let's assume x86_64 processor and gcc compiler.
Edit: I tried comparing using some random numbers. For example:
printf("%.20f",pow(8.72138221/1.761329479,-1.51231)) // 0.08898783049228660424
printf("%.20f",pow(1.761329479/8.72138221, 1.51231)) // 0.08898783049228659037
So, it looks like there is a difference (albeit minuscule in this case), but maybe someone who knows about the algorithm implementation could comment on what the maximum difference is, and under what conditions.
Here's one way to answer such questions, to see how floating-point behaves. This is not a 100% correct way to analyze such question, but it gives a general idea.
Let's generate random numbers. Calculate v0=pow(a/b, n) and v1=pow(b/a, -n) in float precision. And calculate ref=pow(a/b, n) in double precision, and round it to float. We use ref as a reference value (we suppose that double has much more precision than float, so we can trust that ref can be considered the best possible value. This is true for IEEE-754 for most of the time). Then sum the difference between v0-ref and v1-ref. The difference should calculated with "the number of floating point numbers between v and ref".
Note, that the results may be depend on the range of a, b and n (and on the random generator quality. If it's really bad, it may give a biased result). Here, I've used a=[0..1], b=[0..1] and n=[-2..2]. Furthermore, this answer supposes that the algorithm of float/double division/pow is the same kind, have the same characteristics.
For my computer, the summed differences are: 2604828 2603684, it means that there is no significant precision difference between the two.
Here's the code (note, this code supposes IEEE-754 arithmetic):
#include <cmath>
#include <stdio.h>
#include <string.h>
long long int diff(float a, float b) {
unsigned int ai, bi;
memcpy(&ai, &a, 4);
memcpy(&bi, &b, 4);
long long int diff = (long long int)ai - bi;
if (diff<0) diff = -diff;
return diff;
}
int main() {
long long int e0 = 0;
long long int e1 = 0;
for (int i=0; i<10000000; i++) {
float a = 1.0f*rand()/RAND_MAX;
float b = 1.0f*rand()/RAND_MAX;
float n = 4.0f*rand()/RAND_MAX - 2.0f;
if (a==0||b==0) continue;
float v0 = std::pow(a/b, n);
float v1 = std::pow(b/a, -n);
float ref = std::pow((double)a/b, n);
e0 += diff(ref, v0);
e1 += diff(ref, v1);
}
printf("%lld %lld\n", e0, e1);
}
... between pow(a/b,x) and pow(b/a,-x) ... does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?
Whichever division is more arcuate.
Consider z = xy = 2y * log2(x).
Roughly: The error in y * log2(x) is magnified by the value of z to form the error in z. xy is very sensitive to the error in x. The larger the |log2(x)|, the greater concern.
In OP's case, both pow(a/b,p) and pow(b/a,-p), in general, have the same y * log2(x) and same z and similar errors in z. It is a question of how x, y are formed:
a/b and b/a, in general, both have the same error of +/- 0.5*unit in the last place and so both approaches are of similar error.
Yet with select values of a/b vs. b/a, one quotient will be more exact and it is that approach with the lower pow() error.
pow(7777777/4,-p) can be expected to be more accurate than pow(4/7777777,p).
Lacking assurance about the error in the division, the general rule applies: no major difference.
In general, the form with the positive power is slightly better, although by so little it will likely have no practical effect. Specific cases could be distinguished. For example, if either a or b is a power of two, it ought to be used as the denominator, as the division then has no rounding error.
In this answer, I assume IEEE-754 binary floating-point with round-to-nearest-ties-to-even and that the values involved are in the normal range of the floating-point format.
Given a, b, and x with values a, b, and x, and an implementation of pow that computes the representable value nearest the ideal mathematical value (actual implementations are generally not this good), pow(a/b, x) computes (a/b•(1+e0))x•(1+e1), where e0 is the rounding error that occurs in the division and e1 is the rounding error that occurs in the pow, and pow(b/a, -x) computes (b/a•(1+e2))−x•(1+e3), where e2 and e3 are the rounding errors in this division and this pow, respectively.
Each of the errors, e0…e3 lies in the interval [−u/2, u/2], where u is the unit of least precision (ULP) of 1 in the floating-point format. (The notation [p, q] is the interval containing all values from p to q, including p and q.) In case a result is near the edge of a binade (where the floating-point exponent changes and the significand is near 1), the lower bound may be −u/4. At this time, I will not analyze this case.
Rewriting, these are (a/b)x•(1+e0)x•(1+e1) and (a/b)x•(1+e2)−x•(1+e3). This reveals the primary difference is in (1+e0)x versus (1+e2)−x. The 1+e1 versus 1+e3 is also a difference, but this is just the final rounding. [I may consider further analysis of this later but omit it for now.]
Consider (1+e0)x and (1+e2)−x.The potential values of the first expression span [(1−u/2)x, (1+u/2)x], while the second spans [(1+u/2)−x, (1−u/2)−x]. When x > 0, the second interval is longer than the first:
The length of the first is (1+u/2)x−(1+u/2)x.
The length of the second is (1/(1−u/2))x−(1/(1+u/2))x.
Multiplying the latter by (1−u2/22)x produces ((1−u2/22)/(1−u/2))x−( (1−u2/22)/(1+u/2))x = (1+u/2)x−(1+u/2)x, which is the length of the first interval.
1−u2/22 < 1, so (1−u2/22)x < 1 for positive x.
Since the first length equals the second length times a number less than one, the first interval is shorter.
Thus, the form in which the exponent is positive is better in the sense that it has a shorter interval of potential results.
Nonetheless, this difference is very slight. I would not be surprised if it were unobservable in practice. Also, one might be concerned with the probability distribution of errors rather than the range of potential errors. I suspect this would also favor positive exponents.
For evaluation of rounding errors like in your case, it might be useful to use some multi-precision library, such as Boost.Multiprecision. Then, you can compare results for various precisions, e.g, such as with the following program:
#include <iomanip>
#include <iostream>
#include <boost/multiprecision/cpp_bin_float.hpp>
#include <boost/multiprecision/cpp_dec_float.hpp>
namespace mp = boost::multiprecision;
template <typename FLOAT>
void comp() {
FLOAT a = 8.72138221;
FLOAT b = 1.761329479;
FLOAT c = 1.51231;
FLOAT e = mp::pow(a / b, -c);
FLOAT f = mp::pow(b / a, c);
std::cout << std::fixed << std::setw(40) << std::setprecision(40) << e << std::endl;
std::cout << std::fixed << std::setw(40) << std::setprecision(40) << f << std::endl;
}
int main() {
std::cout << "Double: " << std::endl;
comp<mp::cpp_bin_float_double>();
td::cout << std::endl;
std::cout << "Double extended: " << std::endl;
comp<mp::cpp_bin_float_double_extended>();
std::cout << std::endl;
std::cout << "Quad: " << std::endl;
comp<mp::cpp_bin_float_quad>();
std::cout << std::endl;
std::cout << "Dec-100: " << std::endl;
comp<mp::cpp_dec_float_100>();
std::cout << std::endl;
}
Its output reads, on my platform:
Double:
0.0889878304922865903670015086390776559711
0.0889878304922866181225771242679911665618
Double extended:
0.0889878304922865999079806265115166752366
0.0889878304922865999012043629334822725241
Quad:
0.0889878304922865999004910375213273866639
0.0889878304922865999004910375213273505527
Dec-100:
0.0889878304922865999004910375213273881004
0.0889878304922865999004910375213273881004
Live demo: https://wandbox.org/permlink/tAm4sBIoIuUy2lO6
For double, the first calculation was more accurate, however, I guess one cannot make any generic conclusions here.
Also, note that your input numbers are not accurately representable with the IEEE 754 double precision floating-point type (none of them). The question is whether you care about the accuracy of calculations with either those exact numbers of their closest representations.

machine epsilon - long double in c++

I wanted to calculate the machine Epsilon, the smallest possible number e that gives 1 + e > 1 using different data types of C++: float, double and long double.
Here's my code:
#include <cstdio>
template<typename T>
T machineeps() {
T epsilon = 1;
T expression;
do {
epsilon = epsilon / 2;
expression = 1 + epsilon;
} while(expression > 1);
return epsilon;
}
int main() {
auto epsf = machineeps<float>();
auto epsd = machineeps<double>();
auto epsld = machineeps<long double>();
std::printf("epsilon float: %22.17e\nepsilon double: %22.17e\nepsilon long double: %Le\n", epsf, epsd, epsld);
return 0;
}
But I get this strange output:
epsilon float: 5.96046447753906250e-008
epsilon double: 1.11022302462515650e-016
epsilon long double: -0.000000e+000
The values for float and double are what I was expecting, but, I cannot explain the long double behavior.
Can somebody tell me what went wrong?
I cannot reproduce your results. I get:
epsilon long double: 5.421011e-20
Anyway, logically, the code should be something like:
template<typename T>
T machineeps() {
T epsilon = 1, prev;
T expression;
do {
prev = epsilon;
epsilon = epsilon / 2;
expression = 1 + epsilon;
} while (expression > 1);
return prev; // <-- `1+prev` yields a result different from one
}
On my platform it produces values similar to std::numeric_limits::epsilon:
epsilon float: 1.19209289550781250e-07
epsilon double: 2.22044604925031308e-16
epsilon long double: 1.084202e-19
(note the different order of magnitude)
There are several things going on here.
First, floating-point math is often done at the maximum available precision, regardless of the actual declared type of the floating-point variable. So, for example, arithmetic on floats is usually done with 80 bits of precision on Intel hardware (Java originally banned this, requiring all floating-point math to be done at the exact precision of the type; this killed floating-point performance, and they quickly abandoned that rule). Storing the result of a floating-point calculation is supposed to truncate the value to the appropriate type, but by default most compilers ignore this. You can tell your compiler not to allow that; the switch for that depends on the compiler. As is, you can’t rely on the result that’s being calculated here.
Second, the loop in the code terminates when the value of 1 + epsilon is not greater than 1, so the returned value will be less than the true value of epsilon.
Third, coupled with the second one, some floating-point implementations don’t do subnormal values; once the exponent becomes smaller than the smallest that can be represented, the value is 0. That may be what you’re seeing here with the long double value. IEEE floating-point handles zeros less abruptly — once you hit that minimum exponent, smaller values gradually lose precision. There are quite a few values between the smallest normalized value and 0.

Loss of precision?

I have a function like:
double calc( double x );
Do I lose precision with any of these expressions:
double y1 = calc( 1.0 / 3 );
double y2 = calc( 1.0 / 3.0 );
Are these more accurate:
double y3 = calc( static_cast<double>(1) / 3 )
double y4 = calc( static_cast<double>(1) / static_cast<double>(3) )
EDIT
Sorry, I had the wrong numbers here.
But, what I meant was that is 1.0 interpreted as float or double and is that always the case or does it depend on some compiler flags? If it is a float, then 1.0/3 would also be a float and only afterwards be converted into a double. If this is the case, it would cause loss of precision, wouldn't it?
EDIT 2
I have tested this with g++ and as it turns out, if the program is compiled with -fsingle-precision-constant flag, you do lose precision.
#include <iostream>
#include <limits>
#include <typeinfo>
long double calc( long double val)
{
return val;
}
int main() {
std::cout.precision(std::numeric_limits< long double >::max_digits10);
std::cout << calc(1.0/3.0) << std::endl;
std::cout << calc(static_cast<float>(1)/3) << std::endl;
std::cout << calc(static_cast<double>(1)/3) << std::endl;
std::cout << calc(static_cast<long double>(1)/3) << std::endl;
std::cout << typeid(1.0).name() << std::endl;
return 0;
}
The results are,
0.333333343267440795898
0.333333343267440795898
0.33333333333333331483
0.333333333333333333342
f
So, I decided to use static_cast< long double >(1) / 3 to be on the safe side.
None of the alternatives you show gives any loss of precision [at least not if the compiler does what the standard says it should do]. Which is that all binary operators where one operand is a double, the other side automatically gets promoted to double [and in general, when two operands are different size, they are promoted to the larger one].
In particular, integer values [below the mantissa's number of bits] are always represented precisely.
[Obviously, we have no idea what calc does with your input - that may be the source of any and all kinds of errors, but I'm presuming you are actually asking if 3.0/8.0 will always be 0.375 in the cases you have suggested - of course 3/8 will result in zero, since that is integer on both sides]
edit in response to the original question being edited:
If the code says 1. or 1.0 or 0.3 or .3, it is a double. If you write 0.5f, it is a float. As per the rules above 1.0/3 will be the result of double(1.0)/double(3.0).
It is technically possible for a compiler to support only one floating point type, with 3 different ways of writing that - the C and C++ standards has no requirement for double to have more bits than float.
No. The constant expression for y1 is implicitly converted into a double. The constant expression for y2 is already a double.
What you are doing in y3 and y4 is defining a constant integer value and casting it to a double, when you can simply defining a double-precision floating-point constant as you have already done.