Rounding a float upward to an integer, how reliable is that? - c++

I've seen static_cast<int>(std::ceil(floatValue)); before.
My question though, is can I absolutely count on this not "needlessly" rounding up? I've read that some whole numbers can't be perfectly represented in floating point, so my worry is that the miniscule "error" will trick ceil() into rounding upwards when it logically shouldn't. Not only that, but once rounded up, I worry it may be possible for a small "error" in representation to cause the number to be slightly less than a whole number, causing the cast to int to truncate it.
Is this worry unfounded? I remember a while back, an example in python where printing a specific whole number would cause it to print something very slightly less (like x.999, though I can't remember the exact number)
The reason I need to make sure, is I'm writing a texture buffer. The common case is whole numbers as floating point, but it'll occasionally get between values that need to be rounded to the nearest integer width and height that contains them. It increments in steps of power of 2, so the cost of rounding up needlessly can cause what should've only took a 256x256 texture to need a 512x512 texture.

If floatValue is exact, then there is no problem with rounding in your code. The only possible problem is overflow (if the result doesn't fit inside an int). Of course with such large values, the float will typically not have enough precision to distinguish adjacent integers anyway.
However, the danger usually lies in floatValue itself not being exact. For example, if it is the result of some computation whose exact answer is a whole number, it may end up a tiny amount greater than a whole number due to floating point rounding errors in the computation.
So whether you have a problem depends on how you got floatValue.

can I absolutely count on this not "needlessly" rounding up? I've read that some whole numbers can't be perfectly represented in floating point, so my worry is that the miniscule "error" will trick ceil()
Yes, some large numbers are impossible to represent exactly as floating-point numbers. In the zone where this happens, all floating-point numbers are integers. The error is not minuscule: the error in representing an integer by a floating-point, if error there is, is at least one. And, obviously, in the zone where some integers cannot be represented as floating-point and where all floating-point numbers are integers, ceil(f) == f.
The zone in question is |f| > 224 (16*1024*1024) for IEEE 754 single-precision and |f| > 253 for IEEE 754 double-precision.
A problem you are more likely to come across does not come from the impossibility of representing integers in floating-point format but from the cumulative effects of rounding errors. If your compiler offers IEEE 754 (the floating-point standard implemented exactly by the SSE2 instructions of modern and not so modern Intel processors) semantics, then any +, -, *, / and sqrt operation that results in a number exactly representable as floating-point is guaranteed to produce that result, but if several of the operations you apply do not have exactly representable results, the floating-point computation may drift away from the mathematical computation, even when the final result is an integer and is exactly representable. Then you may end up with a floating-point result slightly above the target integer and cause ceil() to return something other than you would have obtained with exact mathematical computations.
There are ways to be confident that some floating-point operations are exact (because the result is always representable). For instance (double)float1 * (double)float2, where float1 and float2 are two single-precision variables, is always exact, because the mathematical result of the multiplication of two single-precision numbers is always representable as a double. By doing the computation the “right” way, it is possible to minimize or eliminate the error in the end result.

The range is 0.0 to ~1024.0
All integers in this range can be represented exactly as float, so you'll be fine.
You'll only start having issues once you stray beyond the 24 bits of mantissa afforded by float.

Related

I'm trying to round a float to two decimal points but it's incorrect. How to fix this rounding error in C++?

I'm having trouble with rounding floats. I'm solving a task where you need to round your result to two decimal points. But I can't do it when the third decimal point is 5 because it's stored incorrectly.
For example: My result is equal to 1.005 and that should be rounded to 1.01. But C++ rounds it to 1.00 because the original float is stored as 1.0049999... and not 1.005.
I've already tried always adding a very small float to the result but there are some other test cases which are then rounded up but should be rounded down.
I know how floating-point works and that it is often not completely accurate. I'm just wondering whether anyone has found a way around this specific problem.
When you say "my result is equal to 1.005", you are assuming some count of true decimal digits. This can be 1.005 (three digits of fractional part), 1.0050 (four digits), 1.005000, and so on.
So, you should first round, using some usual rounding, to that count of digits. It is simpler to do this in integers: for example, with 6 fractional digits, it means some usual round(), rint(), etc. after multiplication by 1,000,000. With this step, you are getting exact decimal number. After this, you are able to make the required final rounding to what you need.
In your example, this will round 1,004,999.99... to 1,005,000. Then, divide by 10000 and round again.
(Notice that there are suggestions to make this rounding in yet specific way. The General Decimal Arithmetic specification and IBM arithmetic manuals suggest this rounding is done in the way that exact fractional part 0.5 shall be rounded away from zero unless least significant result bit becomes 0 or 5, in that case it is rounded toward zero. But, if you have no such rounding available, a general away-from-zero is also suitable.)
If you are implementing arithmetic for money accounting, it is reasonable to avoid floating point at all and use fixed-point arithmetic (emulated with integers, if needed). This is better because you the methods I've described for rounding are inevitably containing conversion to integers (and back), so, it's cheaper to use such integers directly. You will get inexact operation checking as well (by cost of explicit integer overflow).
If you can use a library like boost with its Multiprecision support.
Another option would be to use a long double, maybe that's precise enough for you.

Printing exact values for floats

How does a program (MySQL is an example) store a float like 0.9 and then return it to me as 0.9? Why does it not return 0.899...?
The issue I am currently experiencing is retrieving floating point values from MySQL with C++ and then reprinting those values.
There are software libraries, like Gnu MP that implement arbitrary precision arithmetic, that calculate floating point numbers to specified precision. Using Gnu MP you can, for example, add 0.3 to 0.6, and get exactly 0.9. No more, no less.
Database servers do pretty much the same thing.
For normal, run of the mill applications, native floating point arithmetic is fast, and it's good enough. But database servers typically have plenty of spare CPU cycles. Their limiting factors will not be available CPU, but things like available I/O bandwidth. They'll have plenty of CPU cycles to burn on executing complicated arbitrary precision arithmetic calculations.
There are a number of algorithms for rounding floating point numbers in a way that will result in the same internal representation when read back in. For an overview of the subject, with links to papers with full details of the algorithms, see
Printing Floating-Point Numbers
What's happening, in a nutshell, is that the function which converts the floating-point approximation of 0.9 to decimal text is actually coming up with a value like 0.90000....0123 or 0.89999....9573. This gets rounded to 0.90000...0. And then these trailing zeros are trimmed off so you get a tidy looking 0.9.
Although floating-point numbers are inexact, and often do not use base 10 internally, they can in fact precisely save and recover a decimal representation. For instance, an IEEE 754 64 bit representation has enough precision to preserve 15 decimal digits. This is often mapped to the C language type double, and that language has the constant DBL_DIG, which will be 15 when double is this IEEE type.
If a decimal number with 15 digits or less is converted to double, it can be coverted back to exactly that number. The conversion routine just has to round it off at 15 digits; of course if the conversion routine uses, say, 40 digits, there will be messy trailing digits representing the error between the floating-point value and the original number. The more digits you print, the more accurately rendered is that error.
There is also the opposite problem: given a floating-point object, can it be printed into decimal such that the resulting decimal can be scanned back to reproduce that object? For an IEEE 64 bit double, the number of decimal digits required for that is 17.

When working with floating point numbers, is it good to normalize between 0 and 1?

I have numbers in the range of let's say 1e10 and 1e11. Is it better to normalize those numbers to [0;1] before making any calculations and/or comparisons for the sake of accuracy? I wonder because I heard that between 0 and 1 there are as many representable numbers than from 1 to infinity. However I can't find a source for that.
You can't increase the precision of an existing floating point number. There is no "hidden" precision that can be recovered through normalization, on the contrary, normalization is more likely to reduce the precision of a number due to rounding error. That said, there are some mathematical operations that may produce a more precise result if the inputs are normalized in some way first, but that depends specifically on the operations you are performing.
Floating point numbers are stored in memory in, well, floating point, that is scientific notation. That is 1.23456789e10, 1.23456789e-10 and 1.23456789 will all hold the same number of significant digits.
It is true that, mathematically, there are infinite numbers between 0 and 1, (that would be Aleph-1?), but that is irrelevant to the discussion, because a floating point variable can only hold so many different values. For example a 4-byte floating point variable has 32 bits, so it is impossible to make more than 2^32 different floating point values.

controlling overflow and loss in precision while multiplying doubles

ques:
I have a large number of floating point numbers (~10,000 numbers) , each having 6 digits after decimal. Now, the multiplication of all these numbers would yield about 60,000 digits. But the double range is for 15 digits only. The output product has to have 6 digits of precision after decimal.
my approach:
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
I also thought of multiplying these numbers using arrays to store their digits and later converting them to decimal. But this also appears cumbersome and may not yield correct result.
Is there an alternate easier way to do this?
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
This would only achieve further loss of accuracy. In floating-point, large numbers are represented approximately just like small numbers are. Making your numbers bigger only means you are doing 19999 multiplications (and one division) instead of 9999 multiplications; it does not magically give you more significant digits.
This manipulation would only be useful if it prevented the partial product to reach into subnormal territory (and in this case, multiplying by a power of two would be recommended to avoid loss of accuracy due to the multiplication). There is no indication in your question that this happens, no example data set, no code, so it is only possible to provide the generic explanation below:
Floating-point multiplication is very well behaved when it does not underflow or overflow. At the first order, you can assume that relative inaccuracies add up, so that multiplying 10000 values produces a result that's 9999 machine epsilons away from the mathematical result in relative terms(*).
The solution to your problem as stated (no code, no data set) is to use a wider floating-point type for the intermediate multiplications. This solves both the problems of underflow or overflow and leaves you with a relative accuracy on the end result such that once rounded to the original floating-point type, the product is wrong by at most one ULP.
Depending on your programming language, such a wider floating-point type may be available as long double. For 10000 multiplications, the 80-bit “extended double” format, widely available in x86 processors, would improve things dramatically and you would barely see any performance difference, as long as your compiler does map this 80-bit format to a floating-point type. Otherwise, you would have to use a software implementation such as MPFR's arbitrary-precision floating-point format or the double-double format.
(*) In reality, relative inaccuracies compound, so that the real bound on the relative error is more like (1 + ε)9999 - 1 where ε is the machine epsilon. Also, in reality, relative errors often cancel each other, so that you can expect the actual relative error to grow like the square root of the theoretical maximum error.

Should I worry about precision when I use C++ mathematical functions with integers?

For example, The code below will give undesirable result due to precision of floating point numbers.
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
I wonder whether similar problems will show up if I use mathematical functions. For example
int a = sqrt(4); // Do I have guarantee that I will always get 2 here?
int b = log2(8); // Do I have guarantee that I will always get 3 here?
If not, how to solve this problem?
Edit:
Actually, I came across this problem when I was programming for an algorithm task. There I want to get
the largest integer which is power of 2 and is less than or equal to integer N
So round function can not solve my problem. I know I can solve this problem through a loop, but it seems not very elegant.
I want to know if
int a = pow(2, static_cast<int>(log2(N)));
can always give correct result. For example if N==8, is it possible that log2(N) gives me something like 2.9999999999999 and the final result become 4 instead of 8?
Inaccurate operands vs inaccurate results
I wonder whether similar problems will show up if I use mathematical functions.
Actually, the problem that could prevent log2(8) to be 3 does not exist for basic operations (including *). But it exists for the log2 function.
You are confusing two different issues:
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
In the example above, a is not exactly 1/3, so it is possible that a*3 does not produce 1.0. The product could have happened to round to 1.0, it just doesn't. However, if a somehow had been exactly 1/3, the product of a by 3 would have been exactly 1.0, because this is how IEEE 754 floating-point works: the result of basic operations is the nearest representable value to the mathematical result of the same operation on the same operands. When the exact result is representable as a floating-point number, then that representation is what you get.
Accuracy of sqrt and log2
sqrt is part of the “basic operations”, so sqrt(4) is guaranteed always, with no exception, in an IEEE 754 system, to be 2.0.
log2 is not part of the basic operations. The result of an implementation of this function is not guaranteed by the IEEE 754 standard to be the closest to the mathematical result. It can be another representable number further away. So without more hypotheses on the log2 function that you use, it is impossible to tell what log2(8.0) can be.
However, most implementations of reasonable quality for elementary functions such as log2 guarantee that the result of the implementation is within 1 ULP of the mathematical result. When the mathematical result is not representable, this means either the representable value above or the one below (but not necessarily the closest one of the two). When the mathematical result is exactly representable (such as 3.0), then this representation is still the only one guaranteed to be returned.
So about log2(8), the answer is “if you have a reasonable quality implementation of log2, you can expect the result to be 3.0`”.
Unfortunately, not every implementation of every elementary function is a quality implementation. See this blog post, caused by a widely used implementation of pow being inaccurate by more than 1 ULP when computing pow(10.0, 2.0), and thus returning 99.0 instead of 100.0.
Rounding to the nearest integer
Next, in each case, you assign the floating-point to an int with an implicit conversion. This conversion is defined in the C++ standard as truncating the floating-point values (that is, rounding towards zero). If you expect the result of the floating-point computation to be an integer, you can round the floating-point value to the nearest integer before assigning it. It will help obtain the desired answer in all cases where the error does not accumulate to a value larger than 1/2:
int b = std::nearbyint(log2(8.0));
To conclude with a straightforward answer to the question the the title: yes, you should worry about accuracy when using floating-point functions for the purpose of producing an integral end-result. These functions do not come even with the guarantees that basic operations come with.
Unfortunately the default conversion from a floating point number to integer in C++ is really crazy as it works by dropping the decimal part.
This is bad for two reasons:
a floating point number really really close to a positive integer, but below it will be converted to the previous integer instead (e.g. 3-1×10-10 = 2.9999999999 will be converted to 2)
a floating point number really really close to a negative integer, but above it will be converted to the next integer instead (e.g. -3+1×10-10 = -2.9999999999 will be converted to -2)
The combination of (1) and (2) means also that using int(x + 0.5) will not work reasonably as it will round negative numbers up.
There is a reasonable round function, but unfortunately returns another floating point number, thus you need to write int(round(x)).
When working with C99 or C++11 you can use lround(x).
Note that the only numbers that can be represented correctly in floating point are quotients where the denominator is an integral power of 2.
For example 1/65536 = 0.0000152587890625 can be represented correctly, but even just 0.1 is impossible to represent correctly and thus any computation involving that quantity will be approximated.
Of course when using 0.1 approximations can cancel out leaving a correct result occasionally, but even just adding ten times 0.1 will not give 1.0 as result when doing the computation using IEEE754 double-precision floating point numbers.
Even worse the compilers are allowed to use higher precision for intermediate results. This means that adding 10 times 0.1 may give back 1 when converted to an integer if the compiler decides to use higher accuracy and round to closest double at the end.
This is "worse" because despite being the precision higher the results are compiler and compiler options dependent, making reasoning about the computations harder and making the exact result non portable among different systems (even if they use the same precision and format).
Most compilers have special options to avoid this specific problem.