I am writing code for enumerating floating point addition in C++ using integer addition and shifts for some homework. I have googled the topic and I am able to add floating point numbers by adjusting exponents and then adding. The problem is I could not find the appropriate algorithm for rounding off result. Right now I am using truncation. It shows errors of something like 0.000x magnitude. But when I try to use this adder for complex calculations like fft's, it shows enormous errors.
So what I am looking for now is the exact algorithm that is used by my machine for rounding off floating point results. It would be great if someone can post some link for the purpose.
Thanks in advance.
Most commonly, if the bits to be rounded away represent a value less than half that of the smallest bit to be retained, they are rounded downward, the same as truncation. If they represent more than half, they are rounded upward, thus adding one in the position of the smallest retained bit. If they are exactly half, they are rounded downward if the smallest retained bit is zero and upward if the bit is one. This is called “round-to-nearest, ties to even.”
This presumes you have all the bits you are rounding away, that none have been lost yet in the course of doing arithmetic. If you cannot keep all the bits, there are techniques for keeping track of enough information about them to do the correct rounding, such as maintaining three bits called guard, round, and sticky bits.
Related
I'm having trouble with rounding floats. I'm solving a task where you need to round your result to two decimal points. But I can't do it when the third decimal point is 5 because it's stored incorrectly.
For example: My result is equal to 1.005 and that should be rounded to 1.01. But C++ rounds it to 1.00 because the original float is stored as 1.0049999... and not 1.005.
I've already tried always adding a very small float to the result but there are some other test cases which are then rounded up but should be rounded down.
I know how floating-point works and that it is often not completely accurate. I'm just wondering whether anyone has found a way around this specific problem.
When you say "my result is equal to 1.005", you are assuming some count of true decimal digits. This can be 1.005 (three digits of fractional part), 1.0050 (four digits), 1.005000, and so on.
So, you should first round, using some usual rounding, to that count of digits. It is simpler to do this in integers: for example, with 6 fractional digits, it means some usual round(), rint(), etc. after multiplication by 1,000,000. With this step, you are getting exact decimal number. After this, you are able to make the required final rounding to what you need.
In your example, this will round 1,004,999.99... to 1,005,000. Then, divide by 10000 and round again.
(Notice that there are suggestions to make this rounding in yet specific way. The General Decimal Arithmetic specification and IBM arithmetic manuals suggest this rounding is done in the way that exact fractional part 0.5 shall be rounded away from zero unless least significant result bit becomes 0 or 5, in that case it is rounded toward zero. But, if you have no such rounding available, a general away-from-zero is also suitable.)
If you are implementing arithmetic for money accounting, it is reasonable to avoid floating point at all and use fixed-point arithmetic (emulated with integers, if needed). This is better because you the methods I've described for rounding are inevitably containing conversion to integers (and back), so, it's cheaper to use such integers directly. You will get inexact operation checking as well (by cost of explicit integer overflow).
If you can use a library like boost with its Multiprecision support.
Another option would be to use a long double, maybe that's precise enough for you.
ques:
I have a large number of floating point numbers (~10,000 numbers) , each having 6 digits after decimal. Now, the multiplication of all these numbers would yield about 60,000 digits. But the double range is for 15 digits only. The output product has to have 6 digits of precision after decimal.
my approach:
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
I also thought of multiplying these numbers using arrays to store their digits and later converting them to decimal. But this also appears cumbersome and may not yield correct result.
Is there an alternate easier way to do this?
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
This would only achieve further loss of accuracy. In floating-point, large numbers are represented approximately just like small numbers are. Making your numbers bigger only means you are doing 19999 multiplications (and one division) instead of 9999 multiplications; it does not magically give you more significant digits.
This manipulation would only be useful if it prevented the partial product to reach into subnormal territory (and in this case, multiplying by a power of two would be recommended to avoid loss of accuracy due to the multiplication). There is no indication in your question that this happens, no example data set, no code, so it is only possible to provide the generic explanation below:
Floating-point multiplication is very well behaved when it does not underflow or overflow. At the first order, you can assume that relative inaccuracies add up, so that multiplying 10000 values produces a result that's 9999 machine epsilons away from the mathematical result in relative terms(*).
The solution to your problem as stated (no code, no data set) is to use a wider floating-point type for the intermediate multiplications. This solves both the problems of underflow or overflow and leaves you with a relative accuracy on the end result such that once rounded to the original floating-point type, the product is wrong by at most one ULP.
Depending on your programming language, such a wider floating-point type may be available as long double. For 10000 multiplications, the 80-bit “extended double” format, widely available in x86 processors, would improve things dramatically and you would barely see any performance difference, as long as your compiler does map this 80-bit format to a floating-point type. Otherwise, you would have to use a software implementation such as MPFR's arbitrary-precision floating-point format or the double-double format.
(*) In reality, relative inaccuracies compound, so that the real bound on the relative error is more like (1 + ε)9999 - 1 where ε is the machine epsilon. Also, in reality, relative errors often cancel each other, so that you can expect the actual relative error to grow like the square root of the theoretical maximum error.
I've seen static_cast<int>(std::ceil(floatValue)); before.
My question though, is can I absolutely count on this not "needlessly" rounding up? I've read that some whole numbers can't be perfectly represented in floating point, so my worry is that the miniscule "error" will trick ceil() into rounding upwards when it logically shouldn't. Not only that, but once rounded up, I worry it may be possible for a small "error" in representation to cause the number to be slightly less than a whole number, causing the cast to int to truncate it.
Is this worry unfounded? I remember a while back, an example in python where printing a specific whole number would cause it to print something very slightly less (like x.999, though I can't remember the exact number)
The reason I need to make sure, is I'm writing a texture buffer. The common case is whole numbers as floating point, but it'll occasionally get between values that need to be rounded to the nearest integer width and height that contains them. It increments in steps of power of 2, so the cost of rounding up needlessly can cause what should've only took a 256x256 texture to need a 512x512 texture.
If floatValue is exact, then there is no problem with rounding in your code. The only possible problem is overflow (if the result doesn't fit inside an int). Of course with such large values, the float will typically not have enough precision to distinguish adjacent integers anyway.
However, the danger usually lies in floatValue itself not being exact. For example, if it is the result of some computation whose exact answer is a whole number, it may end up a tiny amount greater than a whole number due to floating point rounding errors in the computation.
So whether you have a problem depends on how you got floatValue.
can I absolutely count on this not "needlessly" rounding up? I've read that some whole numbers can't be perfectly represented in floating point, so my worry is that the miniscule "error" will trick ceil()
Yes, some large numbers are impossible to represent exactly as floating-point numbers. In the zone where this happens, all floating-point numbers are integers. The error is not minuscule: the error in representing an integer by a floating-point, if error there is, is at least one. And, obviously, in the zone where some integers cannot be represented as floating-point and where all floating-point numbers are integers, ceil(f) == f.
The zone in question is |f| > 224 (16*1024*1024) for IEEE 754 single-precision and |f| > 253 for IEEE 754 double-precision.
A problem you are more likely to come across does not come from the impossibility of representing integers in floating-point format but from the cumulative effects of rounding errors. If your compiler offers IEEE 754 (the floating-point standard implemented exactly by the SSE2 instructions of modern and not so modern Intel processors) semantics, then any +, -, *, / and sqrt operation that results in a number exactly representable as floating-point is guaranteed to produce that result, but if several of the operations you apply do not have exactly representable results, the floating-point computation may drift away from the mathematical computation, even when the final result is an integer and is exactly representable. Then you may end up with a floating-point result slightly above the target integer and cause ceil() to return something other than you would have obtained with exact mathematical computations.
There are ways to be confident that some floating-point operations are exact (because the result is always representable). For instance (double)float1 * (double)float2, where float1 and float2 are two single-precision variables, is always exact, because the mathematical result of the multiplication of two single-precision numbers is always representable as a double. By doing the computation the “right” way, it is possible to minimize or eliminate the error in the end result.
The range is 0.0 to ~1024.0
All integers in this range can be represented exactly as float, so you'll be fine.
You'll only start having issues once you stray beyond the 24 bits of mantissa afforded by float.
Is it the case that:
Representable floating point values are densest in the real number line near zero?
Representable floating point values grow sparser (exponentially?) as the number line moves away from zero?
If the above two are true, does that mean there is less precision farther from zero?
Overall question: Does precision in some way refer to or depend on the density of numbers you can represent (accurately)?
The term precision usually refers to the number of significant digits (bits) in the represented value. So precision varies with the number of bits (or digits) in the mantissa of representation. Distance from the origin has no role.
What you say is true about the density of floats on the real line. But in this case the right term is accuracy, not precision. FP numbers of small magnitude are far more accurate that larger ones. This contrasts with integers, which have uniform accuracy over their ranges.
I highly recommend the paper What Every Computer Scientist Should Know About Floating Point Arithmetic, which covers this and much more.
Floating point numbers are basically stored in scientific notation. As long as they are normalized, they consistently have the same number of significant figures, no matter where you are on the number line.
If you consider density linearly, then the floating point numbers get exponentially more dense as you get closer to 0.
As you get extremely closed to 0, and the exponent reaches its lowest point, the floating point numbers become denormalized. At this point, they have 1 extra significant figure and are thus more precise.
Representable floating point values are densest in the real number line near zero?
In a full implementation of IEEE 754 floating point yes.
However in systems that do not support subnormals, there is a gap around zero which is substantially larger than the difference between the smallest nonzero value and the second smallest nonzero value.
Representable floating point values grow sparser (exponentially?) as the number line moves away from zero?
Yes, each time the value passes a power of 2, the gap between adjacent values doubles.
If the above two are true, does that mean there is less precision farther from zero?
That depends on how exactly you define "precision", one can talk about precision in either a relative sense ("significant figures") or an absolute sense ("decimal places").
Which is more appropriate depends on what exactly the numbers are used for. Loss of precision when moving away from zero tends to become a real concern if floating point numbers are used for things like coordinates or timestamps.
Answers:
Representable floating point values are densest in the real number line near zero?
Yes
Representable floating point values grow sparser (exponentially? No - It decreases hyperbolically) as the number line moves away from zero?
Yes
If the above two are true, does that mean there is less precision farther from zero?
Yes
Overall question: Does precision in some way refer to or depend on the density of numbers you can represent (accurately)?
See https://stackoverflow.com/a/24179424
I also recommend What Every Computer Scientist Should Know About Floating Point Arithmetic
I have a floating point addition that is somewhat likely to go wrong as the values have different magnitude, so quite a few significant digits are shifted out (possibly even all of them). In the scope of the entire calculation precision is not that relevant, only that the result is greater or equal to what would be the result with arbitrary precision (I'm keeping track of the end of a range here, and extend it by at least a certain amount).
So I'd need an addition that rounds up when bringing the summands to the same exponent (i.e. if one digit shifted out of a summand was set, the addition should take place with nextval(denormalized_summand, +infinity).
Is there an easy way to perform this addition (manually denormalizing the smaller summand and using nextval on it springs to mind, but I doubt that would be efficient)?
You can set the FPU rounding mode to "upward" and then just add normally.
This is how it's done in GNU environments:
#include <fenv.h>
fesetround(FE_UPWARD);
If you have a Microsoft compiler, the equivalent code is:
#include <float.h>
_set_controlfp(_RC_UP, _MCW_RC);