C++ double operator+ [duplicate] - c++

This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Incorrect floating point math?
Float compile-time calculation not happening?
Strange stuff going on today, I'm about to lose it...
#include <iomanip>
#include <iostream>
using namespace std;
int main()
{
cout << setprecision(14);
cout << (1/9+1/9+4/9) << endl;
}
This code outputs 0 on MSVC 9.0 x64 and x86 and on GCC 4.4 x64 and x86 (default options and strict math...). And as far as I remember, 1/9+1/9+4/9 = 6/9 = 2/3 != 0

1/9 is zero, because 1 and 9 are integers and divided by integer division. The same applies to 4/9.
If you want to express floating-point division through arithmetic literals, you have to either use floating-point literals 1.0/9 + 1.0/9 + 4.0/9 (or 1/9. + 1/9. + 4/9. or 1.f/9 + 1.f/9 + 4.f/9) or explicitly cast one operand to the desired floating-point type (double) 1/9 + (double) 1/9 + (double) 4/9.
P.S. Finally my chance to answer this question :)

Use a decimal point in your calculations to force floating point math optionally along with one of these suffixes: f l F L on your numbers. A number alone without a decimal point and without one of those suffixes is not considered a floating point literal.
C++03 2.13.3-1 on Floating literals:
A floating literal consists of an
integer part, a decimal point, a
fraction part, an e or E, an
optionally signed integer exponent,
and an optional type suffix. The
integer and fraction parts both
consist of a sequence of decimal (base
ten) digits. Either the integer part
or the fraction part (not both) can be
omitted; either the decimal point or
the letter e (or E) and the exponent
(not both) can be omitted. The integer
part, the optional decimal point and
the optional fraction part form the
significant part of the floating
literal. The exponent, if present,
indicates the power of 10 by which the
significant part is to be scaled. If
the scaled value is in the range of
representable values for its type, the
result is the scaled value if
representable, else the larger or
smaller representable value nearest
the scaled value, chosen in an
implementation-defined manner. The
type of a floating literal is double
unless explicitly specified by a
suffix. The suffixes f and F specify
float, the suffixes l and L specify
long double. If the scaled value is
not in the range of representable
values for its type, the program is
ill-formed. 18

They are all integers. So 1/9 is 0. 4/9 is also 0. And 0 + 0 + 0 = 0. So the result is 0. If you want fractions, cast your fractions to floats.

1/9(=0)+1/9(=0)+4/9(=0) = 0

well, in C++ (and many other languages), 1/9+1/9+4/9 is zero, because it is integer arithmetic.
You probably want to write 1/9.0+1/9.0+4/9.0

Unless you specifically specify the decimal, the numbers C++ uses are integers, so 1/9 = 4/9 = 0 and 0 + 0 + 0 = 0.
You should simply add the decimal 1.0 etc...

By the C rules of types, you're doing all integer math there. 1/9 and 4/9 are both truncated to 0 (as integers). If you wrote 1.0/9.0 etc, it would use double precision math and do what you want.

You might make it a habit to use more parentheses. They cost little time, make clear what you intend, and ensure you get what you wanted. Well mostly... ;)

Related

Given an `int A` Is there a strong guarantee that `A == (int) (double) A`?

I need a strong guarantee that int x = (int) std::round(y) will always give the correct results (y is finite and "humanly", e.g. -50000 to 50000).
std::round(4.1) can give 4.000000000001 or 3.99999999999. In the latter case, casting to int gives 3, right?
To manage this, I reinvented the wheel with this ugly function:
template<std::integral S = int, std::floating_point T>
S roundi(T x)
{
S r = (S) x;
T r2 = std::fmod(x, 1);
if (r2 >= 0.5) return r + 1;
if (r2 <= -0.5) return r - 1;
return r;
}
But is this necessary? Or does casting from double to int use the last mantissa bit for rounding?
Assuming int is 32 bits wide and double is 64 bits wide (and assuming IEEE 754), all values of int are exactly representable in a double.
That means std::round(4.1) returns exactly 4. Nothing more nothing less. And casting that number to int is always 4 exactly.
std::round(4.1) can give 4.000000000001 or 3.99999999999. In later case, casting to int gives 3 right?
No, it cannot. The result of std::round is always an integer, exactly, with no rounding error.
I need strong guarantee that int x = (int) std::round(y) will give always the correct results (y is finite and "humanly" e.g. -50000 to
50000).
C++ inherits its floating-point model from C, and, per C 2018 5.2.4.2.2 12, double is capable of representing at least ten-digit integers, so [−50,000, +50,000] is well within its range. It is even within the range of float, which is capable of representing six-digit integers. This requirement extends back to C 1990.
Given an int A Is there a strong guarantee that A == (int) (double) A?
No, the C++ standard does not impose an upper limit on the width of int nor a relationship between with precision of int (number of bits it uses for the value, excluding the sign bit) and the precision of double (number of bits or other digits in its significand), so a C++ implementation may have an int with more precision than double.
std::round(4.1) can give 4.000000000001 or 3.99999999999. In later case, casting to int gives 3 right?
That's true. 4.1 can be seen as 4.0 (which has exact representation in floating point as an integer it is) plus 0.1, which can be seen as 1/10 (it's exactly 1/10, indeed) And the problem you will have is if you try to round a number close to that to one decimal point after the decimal mark (rounding to an integer multiple of 0.1 or 0.01 or 0.001, etc.)
If you are using decimal floating point (which normally C compilers don't) then you are lucky, as 0.1 is 10&^(-1) which again has an exact representation in the machine. But as a binary floating point number, it has an infinite representation in binary as 0.000110011001100110011001100...b and it depends where you cut the number you will get some value or another, but you will never get the exact value as a decimal number (with a finite number of digits)
But the way round() works is not that... if first adds 0.5 (which is exactly representable as a binary floating point number) to the number (this results in an exact operation, no rounding error emerges from it), and then cuts the integer part (which is also an exact operation), meaning that you are getting always an exact integer result (which is perfectly representable as an exact floating point, if the original number was). The rounding is equivalent to this set of operations:
(int)(4.1 + 0.5);
so you will get the integer part of 4.6 after addding the 0.5 part (or something like 4.60000000000000003, 4.59999999999999998, anyway both will be truncated to 4.0, which is also exactly representable in binary floating point format) so you will never get a wrong answer for the rounding to integer case... you can get a wrong response in case you get something close to 4.5 (which can round to 4.0 instead of the correct rounding to 5.0, but .5 happens to be exactly 0.1b in binary... and so it's not affected --
Beware although that rounding to multiples of a negative power of ten (0.1, 0.01, ...) is not warranted, as none of those numbers is representable exactly in binary floating point. All of them have an infinite representation as binary numbers, and due to the cutting at some point, they can be represented as a tiny number above or below (depending on which is close) and the rounding will not work.

What Are the Maximum Number of Base-10 Digits in the Integral Part of a Floating Point Number

I want to know if there is something in the standard, like a #define or something in numeric_limits which would tell me the maximum number of base-10 digits in the integral part of a floating point type.
For example, if I have some floating point type the largest value of which is: 1234.567. I'd like something defined in the standard that would tell me 4 for that type.
Is there an option to me doing this?
template <typename T>
constexpr auto integral_digits10 = static_cast<int>(log10(numeric_limits<T>::max())) + 1;
As Nathan Oliver points out in the comments, C++ provides std::numeric_limits<T>::digits10.
the number of base-10 digits that can be represented by the type T without change, that is, any number with this many decimal digits can be converted to a value of type T and back to decimal form, without change due to rounding or overflow. For base-radix types, it is the value of digits (digits-1 for floating-point types) multiplied by log10(radix) and rounded down.
The explanation for this is explained by Rick Regan here. In summary, if your binary floating point format can store b bits in the significand, then you are guaranteed to be able to round-trip up to d decimal digits, where d is the largest integer such that
10d < 2b-1
In the case of an IEEE754 binary64 (the standard double in C++ on most systems nowadays), then b = 53, and 2b-1 = 4,503,599,627,370,496, so the format is only guaranteed to be able to represent d = 15 digits.
However this result holds for all digits, whereas you just ask about the integral part. However we can easily find a counterexample by choosing x = 2b+1, which is the smallest integer not representable by the format: for binary64 this is 9,007,199,254,740,993, which also happens to have 16 digits, and so will need to be rounded.
The value that you are looking for is max_exponent10 which:
Is the largest positive number n such that 10n is a representable finite value of the floating-point type
Because of this relationship:
log10x = n
10n = x
Your calculation is doing, is finding n the way the first equation works:
log10(numeric_limits<T>::max())
The definition of max_exponent10 is explaining that it is using a 10n + 1 would be larger than numeric_limits<T>::max() but 10n is less than or equal to numeric_limits<T>::max(). So numeric_limits<T>::max_exponent10 is what you're looking for.
Note that you will still need the + 1 as in your example, to account for the 1's place. (Because log101 = 0) So your the number of 10-based digits required to represent numeric_limits<T>::max() will be:
numeric_limits<T>::max_exponent10 + 1
If you feel like validating that by hand you can check here:
http://coliru.stacked-crooked.com/a/443e4d434cbcb2f6

How is ++ defined on a large floating point [duplicate]

This question already has an answer here:
maximum value in float
(1 answer)
Closed 7 years ago.
So I've been looking at IEEE754 floating point double. (My C++ compiler uses that type for a double).
Consider this snippet:
// 9007199254740992 is the 53rd power of 2.
// 590295810358705700000 is the 69th power of 2.
for (double f = 9007199254740992; f <= 590295810358705700000; ++f){
/* what is f?*/
}
Presumably f increments in even steps up to the 54th power of 2, due to rounding up?
Then after that, nothing happens due to rounding down?
Is that correct? Is it even well-defined?
++f is essentially the same as f = f + 1, ignoring the fact that ++f is an expression that yields a value.
Now, for floating point values, the issue of representability comes into play. It may be that f + 1 is not representable. In which case, f + 1 will evaluate to the nearest representable value to the true value of f + 1. In case there are two equally near candidates for nearest representable value, round to even is used.
This is covered in the Operations section of What Every Computer Scientist Should Know About Floating-Point Arithmetic:
The IEEE standard requires that the result of addition, subtraction, multiplication and division be exactly rounded. That is, the result must be computed exactly and then rounded to the nearest floating-point number (using round to even).
So, if your example, for sufficiently large values of f, you will find that f == f + 1.
Yes, this loop will never end on rounding problem. I hope the reason is clear for you (since you are familiar with https://en.wikipedia.org/wiki/IEEE_floating_point) but let me describe in short for impatient audience.
We can think about floating point as forced by compiler/FPU/standard special presentation of number. For simple example let's review:
20000
2e4
0.2e5
Both three forms represents the same number. Last two form called "science" form but what is the best? IEEE754 answers - the last one because we can save the space by omitting leading 0 and just write .2e5 . Such decimal analogy is very close to binary presentation where there is a space for mantissa (.2) and exponent (5).
Now let's do the same for 20000.00000000001
0.2000000000000001e5
As we can see mantissa growth and there is some limit where fixed memory will overflow. Instead of exception we sacrifice precision, that (just as example) give as the 0.2e5.
For bigger numbers (as in question) we have lost in precision too.
9007199254740992 may be presented as 0.9e16 And when 1 is added nothing happens.
So f = f + 1 creates infinite loop
Being f++ the same as f = f + 1, as pointed out on the comments, and as i tested myself, f == f+1 (!!) for a large f dependent on the platform. An explanation is here (for small numbers, but the principle is the same) http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BinMath/addFloat.html
Here's how to add floating point numbers.
First, convert the two representations to scientific notation. Thus,
we explicitly represent the hidden 1. In order to add, we need the
exponents of the two numbers to be the same. We do this by rewriting
Y. This will result in Y being not normalized, but value is equivalent
to the normalized Y. Add x - y to Y's exponent. Shift the radix point
of the mantissa (signficand) Y left by x - y to compensate for the
change in exponent. Add the two mantissas of X and the adjusted Y
together. If the sum in the previous step does not have a single bit
of value 1, left of the radix point, then adjust the radix point and
exponent until it does. Convert back to the one byte floating point
representation.
In the process of converting the number to the same exponent, due to precision, 1 is rounded to 0, and hence f == f + 1.
According to IEEE754, after the sum the number is rounded to match the double format, and due to the rounding operation, f==f+1.
I don't know if there are problems where looping over large floating point values by increment of 1 is a meaningful solution, but people may be stumbling on this question looking for a workaround for their neverending loop. Therefore, even though the question only asks how the addition is defined by the standard, I'll propose a workaround.
Indeed, for large values of f, f++ == f is true, and using that as the increment in loop will have undefined behaviour.
Assuming it's OK that f be incremented by a number that is the smallest number e greater than 1 for which the floating point has a representation f + e > f. In that case, following workaround where the loop will always terminate could be OK:
// use template, or overloads for different floatingpoints
template<class T>
T add_s(T l, T r) {
T result = l + r;
T greater = std::max(l, r);
if(result == greater)
return std::nextafter(greater, std::numeric_limits<T>::max());
return result;
}
// ...
for (double f = /*...*/; f < /*...*/; f = add_s(f, 1.0))
That said, adding tiny floats to huge floats will result in an uncontrollable cumulation of errors. If that's not OK for you, then you need arbitraty precision math, not floating point.

How to shift a floating-point value to the nearest one that can be represented exactly in a specific number of decimal places?

Is there an algorithm in C++ that will allow me to, given a floating-point value V of type T (e.g. double or float), returns the closest value to V in a given direction (up or down) that can be represented exactly in less than or equal to a specified number of decimal places D ?
For example, given
T = double
V = 670000.08267799998
D = 6
For direction = towards +inf I would like the result to be 670000.082678, and for direction = towards -inf I would like the result to be 670000.082677
This is somewhat similar to std::nexttoward(), but with the restriction that the 'next' value needs to be exactly representable using at most D decimal places.
I've considered a naive solution involving separating out the fractional portion and scaling it by 10^D, truncating it, and scaling it again by 10^-D and tacking it back onto the whole number portion, but I don't believe that guarantees that the resulting value will be exactly representable in the underlying type.
I'm hopeful that there's a way to do this properly, but so far I've been unable to find one.
Edit: I think my original explanation didn't properly convey my requirements. At the suggestion of #patricia-shanahan I'll try to describing my higher-level goal and then reformulate the problem a little differently in that context.
At the highest level, the reason I need this routine is due to some business logic wherein I must take in a double value K and a percentage P, split it into two double components V1 and V2 where V1 ~= P percent of K and V1 + V2 ~= K. The catch is that V1 is used in further calculations before being sent to a 3rd party over a wire protocol that accepts floating-point values in string format with a max of D decimal places. Because the value sent to the 3rd party (in string format) needs to be reconcilable with the results of the calculations made using V1 (in double format) , I need to "adjust" V1 using some function F() so that it is as close as possible to being P percent of K while still being exactly representable in string format using at most D decimal places. V2 has none of the restrictions of V1, and can be calculated as V2 = K - F(V1) (it is understood and acceptable that this may result in V2 such that V1 + V2 is very close to but not exactly equal to K).
At the lower level, I'm looking to write that routine to 'adjust' V1 as something with the following signature:
double F(double V, unsigned int D, bool roundUpIfTrueElseDown);
where the output is computed by taking V and (if necessary, and in the direction specified by the bool param) rounding it to the Dth decimal place.
My expectation would be that when V is serialized out as follows
const auto maxD = std::numeric_limits<double>::digits10;
assert(D <= maxD); // D will be less than maxD... e.g. typically 1-6, definitely <= 13
std::cout << std::fixed
<< std::setprecision(maxD)
<< F(V, D, true);
then the output contains only zeros beyond the Dth decimal place.
It's important to note that, for performance reasons, I am looking for an implementation of F() that does not involve conversion back and forth between double and string format. Though the output may eventually be converted to a string format, in many cases the logic will early-out before this is necessary and I would like to avoid the overhead in that case.
This is a sketch of a program that does what is requested. It is presented mainly to find out whether that is really what is wanted. I wrote it in Java, because that language has some guarantees about floating point arithmetic on which I wanted to depend. I only use BigDecimal to get exact display of doubles, to show that the answers are exactly representable with no more than D digits after the decimal point.
Specifically, I depended on double behaving according to IEEE 754 64-bit binary arithmetic. That is likely, but not guaranteed by the standard, for C++. I also depended on Math.pow being exact for simple exact cases, on exactness of division by a power of two, and on being able to get exact output using BigDecimal.
I have not handled edge cases. The big missing piece is dealing with large magnitude numbers with large D. I am assuming that the bracketing binary fractions are exactly representable as doubles. If they have more than 53 significant bits that will not be the case. It also needs code to deal with infinities and NaNs. The assumption of exactness of division by a power of two is incorrect for subnormal numbers. If you need your code to handle them, you will have to put in corrections.
It is based on the concept that a number that is both exactly representable as a decimal with no more than D digits after the decimal point and is exactly representable as a binary fraction must be representable as a fraction with denominator 2 raised to the D power. If it needs a higher power of 2 in the denominator, it will need more than D digits after the decimal point in its decimal form. If it cannot be represented at all as a fraction with a power-of-two denominator, it cannot be represented exactly as a double.
Although I ran some other cases for illustration, the key output is:
670000.082678 to 6 digits Up: 670000.09375 Down: 670000.078125
Here is the program:
import java.math.BigDecimal;
public class Test {
public static void main(String args[]) {
testIt(2, 0.000001);
testIt(10, 0.000001);
testIt(6, 670000.08267799998);
}
private static void testIt(int d, double in) {
System.out.print(in + " to " + d + " digits");
System.out.print(" Up: " + new BigDecimal(roundUpExact(d, in)).toString());
System.out.println(" Down: "
+ new BigDecimal(roundDownExact(d, in)).toString());
}
public static double roundUpExact(int d, double in) {
double factor = Math.pow(2, d);
double roundee = factor * in;
roundee = Math.ceil(roundee);
return roundee / factor;
}
public static double roundDownExact(int d, double in) {
double factor = Math.pow(2, d);
double roundee = factor * in;
roundee = Math.floor(roundee);
return roundee / factor;
}
}
In general, decimal fractions are not precisely representable as binary fractions. There are some exceptions, like 0.5 (½) and 16.375 (16⅜), because all binary fractions are precisely representable as decimal fractions. (That's because 2 is a factor of 10, but 10 is not a factor of 2, or any power of two.) But if a number is not a multiple of some power of 2, its binary representation will be an infinitely-long cyclic sequence, like the representation of ⅓ in decimal (.333....).
The standard C library provides the macro DBL_DIG (normally 15); any decimal number with that many decimal digits of precision can be converted to a double (for example, with scanf) and then converted back to a decimal representation (for example, with printf). To go in the opposite direction without losing information -- start with a double, convert it to decimal and then convert it back -- you need 17 decimal digits (DBL_DECIMAL_DIG). (The values I quote are based on IEEE-754 64-bit doubles).
One way to provide something close to the question would be to consider a decimal number with no more than DBL_DIG digits of precision to be an "exact-but-not-really-exact" representation of a floating point number if that floating point number is the floating point number which comes closest to the value of the decimal number. One way to find that floating point number would be to use scanf or strtod to convert the decimal number to a floating point number, and then try the floating point numbers in the vicinity (using nextafter to explore) to find which ones convert to the same representation with DBL_DIG digits of precision.
If you trust the standard library implementation to not be too far off, you could convert your double to a decimal number using sprintf, increment the decimal string at the desired digit position (which is just a string operation), and then convert it back to a double with strtod.
Total re-write.
Based on OP's new requirement and using power-of-2 as suggested by #Patricia Shanahan, simple C solution:
double roundedV = ldexp(round(ldexp(V, D)),-D); // for nearest
double roundedV = ldexp(ceil (ldexp(V, D)),-D); // at or just greater
double roundedV = ldexp(floor(ldexp(V, D)),-D); // at or just less
The only thing added here beyond #Patricia Shanahan fine solution is C code to match OP's tag.
In C++ integers must be represented in binary, but floating point types can have a decimal representation.
If FLT_RADIX from <limits.h> is 10, or some multiple of 10, then your goal of exact representation of a decimal values is attainable.
Otherwise, in general, it's not attainable.
So, as a first step, try to find a C++ implementation where FLT_RADIX is 10.
I wouldn't worry about algorithm or efficiency thereof until the C++ implementation is installed and proved to be working on your system. But as a hint, your goal seems to be suspiciously similar to the operation known as “rounding”. I think, after obtaining my decimal floating point C++ implementation, I’d start by investigating techniques for rounding, e.g., googling that, maybe Wikipedia, …

C++ integer floor function

I want to implement greatest integer function. [The "greatest integer function" is a quite standard name for what is also known as the floor function.]
int x = 5/3;
My question is with greater numbers could there be a loss of precision as 5/3 would produce a double?
EDIT: Greatest integer function is integer less than or equal to X.
Example:
4.5 = 4
4 = 4
3.2 = 3
3 = 3
What I want to know is 5/3 going to produce a double? Because if so I will have loss of precision when converting to int.
Hope this makes sense.
You will lose the fractional portion of the quotient. So yes, with greater numbers you will have more relative precision, such as compared with 5000/3000.
However, 5 / 3 will return an integer, not a double. To force it to divide as double, typecast the dividend as static_cast<double>(5) / 3.
Integer division gives integer results, so 5 / 3 is 1 and 5 % 3 is 2 (the remainder operator). However, this doesn't necessarily hold with negative numbers. In the original C++ standard, -5 / 3 could be either -1 (rounding towards zero) or -2 (the floor), but -1 was recommended. In the latest C++0B draft (which is almost certainly very close to the final standard), it is -1, so finding the floor with negative numbers is more involved.
5/3 will always produce 1 (an integer), if you do 5.0/3 or 5/3.0 the result will be a double.
As far as I know, there is no predefined function for this purpose.
It might be necessary to use such a function, if for some reason floating-point calculations are out of question (e.g. int64_t has a higher precision than double can represent without error)
We could define this function as follows:
#include <cmath>
inline long
floordiv (long num, long den)
{
if (0 < (num^den))
return num/den;
else
{
ldiv_t res = ldiv(num,den);
return (res.rem)? res.quot-1
: res.quot;
}
}
The idea is to use the normal integer divison, but adjust for negative results to match the behaviour of the double floor(double) function. The point is to truncate always towards the next lower integer, irrespective of the position of the zero point. This can be very important if the intention is to create even sized intervals.
Timing measurements show that this function here only creates a small overhead compared with the built-in / operator, but of course the floating point based floor function is significantly faster....
Since in C and C++, as others have said, / is integer division, it will return an int. in particular, it will return the floor of the double answer... (C and C++ always truncate) So, basically 5/3 is exactly what you want.
It may get a little weird in negatives as -5/3 => -2 which may or may not be what you want...