controlling overflow and loss in precision while multiplying doubles

controlling overflow and loss in precision while multiplying doubles - c++

ques:
I have a large number of floating point numbers (~10,000 numbers) , each having 6 digits after decimal. Now, the multiplication of all these numbers would yield about 60,000 digits. But the double range is for 15 digits only. The output product has to have 6 digits of precision after decimal.
my approach:
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
I also thought of multiplying these numbers using arrays to store their digits and later converting them to decimal. But this also appears cumbersome and may not yield correct result.
Is there an alternate easier way to do this?

I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
This would only achieve further loss of accuracy. In floating-point, large numbers are represented approximately just like small numbers are. Making your numbers bigger only means you are doing 19999 multiplications (and one division) instead of 9999 multiplications; it does not magically give you more significant digits.
This manipulation would only be useful if it prevented the partial product to reach into subnormal territory (and in this case, multiplying by a power of two would be recommended to avoid loss of accuracy due to the multiplication). There is no indication in your question that this happens, no example data set, no code, so it is only possible to provide the generic explanation below:
Floating-point multiplication is very well behaved when it does not underflow or overflow. At the first order, you can assume that relative inaccuracies add up, so that multiplying 10000 values produces a result that's 9999 machine epsilons away from the mathematical result in relative terms(*).
The solution to your problem as stated (no code, no data set) is to use a wider floating-point type for the intermediate multiplications. This solves both the problems of underflow or overflow and leaves you with a relative accuracy on the end result such that once rounded to the original floating-point type, the product is wrong by at most one ULP.
Depending on your programming language, such a wider floating-point type may be available as long double. For 10000 multiplications, the 80-bit “extended double” format, widely available in x86 processors, would improve things dramatically and you would barely see any performance difference, as long as your compiler does map this 80-bit format to a floating-point type. Otherwise, you would have to use a software implementation such as MPFR's arbitrary-precision floating-point format or the double-double format.
(*) In reality, relative inaccuracies compound, so that the real bound on the relative error is more like (1 + ε)9999 - 1 where ε is the machine epsilon. Also, in reality, relative errors often cancel each other, so that you can expect the actual relative error to grow like the square root of the theoretical maximum error.

Related

I'm trying to round a float to two decimal points but it's incorrect. How to fix this rounding error in C++?

I'm having trouble with rounding floats. I'm solving a task where you need to round your result to two decimal points. But I can't do it when the third decimal point is 5 because it's stored incorrectly.
For example: My result is equal to 1.005 and that should be rounded to 1.01. But C++ rounds it to 1.00 because the original float is stored as 1.0049999... and not 1.005.
I've already tried always adding a very small float to the result but there are some other test cases which are then rounded up but should be rounded down.
I know how floating-point works and that it is often not completely accurate. I'm just wondering whether anyone has found a way around this specific problem.

When you say "my result is equal to 1.005", you are assuming some count of true decimal digits. This can be 1.005 (three digits of fractional part), 1.0050 (four digits), 1.005000, and so on.
So, you should first round, using some usual rounding, to that count of digits. It is simpler to do this in integers: for example, with 6 fractional digits, it means some usual round(), rint(), etc. after multiplication by 1,000,000. With this step, you are getting exact decimal number. After this, you are able to make the required final rounding to what you need.
In your example, this will round 1,004,999.99... to 1,005,000. Then, divide by 10000 and round again.
(Notice that there are suggestions to make this rounding in yet specific way. The General Decimal Arithmetic specification and IBM arithmetic manuals suggest this rounding is done in the way that exact fractional part 0.5 shall be rounded away from zero unless least significant result bit becomes 0 or 5, in that case it is rounded toward zero. But, if you have no such rounding available, a general away-from-zero is also suitable.)
If you are implementing arithmetic for money accounting, it is reasonable to avoid floating point at all and use fixed-point arithmetic (emulated with integers, if needed). This is better because you the methods I've described for rounding are inevitably containing conversion to integers (and back), so, it's cheaper to use such integers directly. You will get inexact operation checking as well (by cost of explicit integer overflow).

If you can use a library like boost with its Multiprecision support.
Another option would be to use a long double, maybe that's precise enough for you.

Representation of a Gradual underflow program

I was reading about the gradual underflow concept and how it is op important in the music industry Gradual overflow Application in Music
I well understand the problem of an overflow buffer, but this i don't know how to represent an underflow.
Can you please give me an example(a program preferably in c or c++) as in how a computer handles gradual underflow?

Gradual underflow is related to what IEEE 754 calls "subnormal" numbers.
Consider using scientific notation in which you have (say) 10 digits of precision and exponents that can range from -99 through 99.
Under normal circumstances, you treat everything as scientific notation, so if you want to represent 1000, you represent it as 1e3 -- that is, 1 * 103.
Now, consider a number like 1.234e-102. The smallest exponent you can represent is -99. So, if you do your job the simplest possible way, you simply that since it has an exponent smaller than that, it's just 0. That would be "fast underflow".
In IEEE 754 (and related standards) you can store that as (essentially) 0.001234 * 10-99. In doing so, you may lose some precision compared to a normal number that has an exponent in the -99...99 range. On the other hand, you lose less than if you just rounded it to zero because it has an exponent smaller than -99. In fact, in this case it started with 4 significant digits, and as represented it retains all 4 significant digits.
On a computer, the numbers are represented in binary, so the numbers of significant digits and/or maximum range of exponents aren't round numbers when converted to decimal, but the same basic idea applies--when we have a number that's too small to represent in the normal format, we can still store it with the smallest exponent that can be represented, but also includes some leading zeros.
This does lead to one difficulty: numbers are normally stored in what's called normalized form. The "significand" part is normalized by shifting it left until the first digit is a 1 (keep in mind that since it's binary it can only be 0 or 1). Since we know it's a 1, we cheat a little: we don't normally store that 1 in the number as it's stored. So, a double precision floating point number normally has 53 bits of precision, but only actually stores 52 bits of significand.
With a subnormal number, that's no longer the case. That's not terribly difficult to deal with, but it still introduces a special case--and one that's only rarely used, so CPU designers (and such) rarely try to optimize for it. As a result, the exact same code can suddenly run a lot slower when executing on data that contains subnormals.

Rounding a float upward to an integer, how reliable is that?

I've seen static_cast<int>(std::ceil(floatValue)); before.
My question though, is can I absolutely count on this not "needlessly" rounding up? I've read that some whole numbers can't be perfectly represented in floating point, so my worry is that the miniscule "error" will trick ceil() into rounding upwards when it logically shouldn't. Not only that, but once rounded up, I worry it may be possible for a small "error" in representation to cause the number to be slightly less than a whole number, causing the cast to int to truncate it.
Is this worry unfounded? I remember a while back, an example in python where printing a specific whole number would cause it to print something very slightly less (like x.999, though I can't remember the exact number)
The reason I need to make sure, is I'm writing a texture buffer. The common case is whole numbers as floating point, but it'll occasionally get between values that need to be rounded to the nearest integer width and height that contains them. It increments in steps of power of 2, so the cost of rounding up needlessly can cause what should've only took a 256x256 texture to need a 512x512 texture.

If floatValue is exact, then there is no problem with rounding in your code. The only possible problem is overflow (if the result doesn't fit inside an int). Of course with such large values, the float will typically not have enough precision to distinguish adjacent integers anyway.
However, the danger usually lies in floatValue itself not being exact. For example, if it is the result of some computation whose exact answer is a whole number, it may end up a tiny amount greater than a whole number due to floating point rounding errors in the computation.
So whether you have a problem depends on how you got floatValue.

can I absolutely count on this not "needlessly" rounding up? I've read that some whole numbers can't be perfectly represented in floating point, so my worry is that the miniscule "error" will trick ceil()
Yes, some large numbers are impossible to represent exactly as floating-point numbers. In the zone where this happens, all floating-point numbers are integers. The error is not minuscule: the error in representing an integer by a floating-point, if error there is, is at least one. And, obviously, in the zone where some integers cannot be represented as floating-point and where all floating-point numbers are integers, ceil(f) == f.
The zone in question is |f| > 224 (16*1024*1024) for IEEE 754 single-precision and |f| > 253 for IEEE 754 double-precision.
A problem you are more likely to come across does not come from the impossibility of representing integers in floating-point format but from the cumulative effects of rounding errors. If your compiler offers IEEE 754 (the floating-point standard implemented exactly by the SSE2 instructions of modern and not so modern Intel processors) semantics, then any +, -, *, / and sqrt operation that results in a number exactly representable as floating-point is guaranteed to produce that result, but if several of the operations you apply do not have exactly representable results, the floating-point computation may drift away from the mathematical computation, even when the final result is an integer and is exactly representable. Then you may end up with a floating-point result slightly above the target integer and cause ceil() to return something other than you would have obtained with exact mathematical computations.
There are ways to be confident that some floating-point operations are exact (because the result is always representable). For instance (double)float1 * (double)float2, where float1 and float2 are two single-precision variables, is always exact, because the mathematical result of the multiplication of two single-precision numbers is always representable as a double. By doing the computation the “right” way, it is possible to minimize or eliminate the error in the end result.

The range is 0.0 to ~1024.0
All integers in this range can be represented exactly as float, so you'll be fine.
You'll only start having issues once you stray beyond the 24 bits of mantissa afforded by float.

How to correctly normalize a floating point value in C++?

Maybe I don't understand the IEEE754 standard that much, but given a set of floating point values that are float or double, for example :
56.543f 3238.124124f 121.3f ...
you are able to convert them in values ranging from 0 to 1, so you normalize them, by taking an appropriate common factor while considering what is the maximum value and the minimum value in the set.
Now my point is that in this transformation I need a much higher precision for the set of destination that ranges from 0 to 1 if compared to the level of precision that I need in the first one, especially if the values in the first set are covering a wide range of numerical values ( really big and really small values ).
How the float or the double ( or the IEEE 754 standard if you want ) type can handle this situation while providing more precision for the second set of values knowing that I will basically not need an integer part ?
Or it doesn't handle this at all and I need fixed point math with a totally different type ?

Floating point numbers are stored in a format similar to scientific notation. Internally, they align the leading 1 of the binary representation to the top of the significand. Each value is carried with the same number of binary digits of precision relative to its own magnitude.
When you compress your set of floating point values to the range 0..1, the only precision loss you will get will be due to the rounding that occurs in the various steps of the process.
If you're merely compressing by scaling, you will lose only a small amount of precision near the LSBs of the mantissa (around 1 or 2 ulp, where ulp means "units of the last place).
If you also need to shift your data, then things get trickier. If your data is all positive, then subtracting off the smallest number will not damage anything. But, if your data is a mixture of positive and negative data, then some of your values near zero may suffer a loss in precision.
If you do all the arithmetic at double precision, you'll carry 53 bits of precision through the calculation. If your precision needs fit within that (which likely they do), then you'll be fine. Otherwise, the exact numerical performance will depend on the distribution of your data.

Single and double IEEE floats have a format where the exponent and fraction parts have fixed bit-width. So this is not possible (i.e. you will always have unused bits if you only store values between 0 and 1). (See: http://en.wikipedia.org/wiki/Single-precision_floating-point_format)
Are you sure the 52-bit wide fraction part of a double is not precise enough?
Edit: If you use the whole range of the floating format, you will lose precision when normalizing the values. The roundings can be off and enough small values will become 0. Unless you know that this is a problem, don't worry. Otherwise you have to look up some other solution as mentioned in other answers.

Having binary floating point values (with an implicit leading one) expressed as
(1+fraction) * 2^exponent where fraction < 1
A division a/b is:
a/b = (1+fraction(a)) / (1+fraction(b)) * 2^(exponent(a) - exponent(b))
Hence division/multiplication has essentially no loss of precision.
A subtraction a-b is:
a-b = (1+fraction(a)) * 2^(exponent(a) - (1+fraction(b)) * exponent(b))
Hence a subtraction/addition might have a loss of precision (big - tiny == big) !
Clamping a value x in a range [min, max] to [0, 1]
(x - min) / (max - min)
will have precision issues if any subtraction has a loss of precision.
Answering your question:
Nothing is, choose a suitable representation (floating point, fraction, multi precision ...) for your algorithms and expected data.

If you have a selection of doubles and you normalize them to between 0.0 and 1.0, there are a number of sources of precision loss. They are all, however, much smaller than you suspect.
First, you will lose some precision in the arithmetic operations required to normalize them as rounding occurs. This is relatively small -- a bit or so per operation -- and usually relatively random.
Second, the exponent component will no longer be using the positive exponent possibility.
Third, as all the values are positive, the sign bit will also be wasted.
Forth, if the input space does not include +inf or -inf or +NaN or -NaN or the like, those code points will also be wasted.
But, for the most part, you'll waste about 3 bits of information in a 64 bit double in your normalization, one of which being the kind of thing that is nearly unavoidable when you deal with finite-bit-width values.
Any 64 bit fixed point representation of the values from 0 to 1 will have far less "range" than doubles. A double can represent something on the order of 10^-300, while a 64 bit fixed point representation that includes 1.0 can only go as low as 10^-19 or so. (The 64 bit fixed point representation can represent 1 - 10^-19 as being distinct from 1, while the double cannot, but the 64 bit fixed point value can not represent anything smaller than 2^-64, while doubles can).
Some of the numbers above are approximate, and may depend on rounding/exact format.

For higher precision you can try http://www.boost.org/doc/libs/1_55_0/libs/multiprecision/doc/html/boost_multiprecision/tut/floats.html.
Note also, that for the numerical critical operations +,- there are special algorithms that minimize the numerical error introduced by the algorithm:
http://en.wikipedia.org/wiki/Kahan_summation_algorithm

machine precision and max and min value of a double-precision type

(1) I have met several cases where epsilon is added to a non-negative variable to guarantee nonzero value. So I wonder why not add the minimum value that the data type can represent instead of epsilon? What are the difference problems that these two can solve?
(2) Also I notice that the inverse of the maximum value of a double precision type is bigger than its min value, and inverse of its min value is inf, way bigger than its max value. Is it useful to compute the reciprocals of its max and min values?
(3) For a very small positive number of double type, to compute its reciprocal, how small it is when its reciprocal starts to not make sense? Is it better to put an upper bound on the reciprocal? How much is the bound?
Thanks and regards

Epsilon
Epsilon is the smallest value that can be added to 1.0 and produce a result that's distinguishable from 1.0. As Poita_ implied, this is useful for dealing with rounding errors. The situation is pretty simple: a normal floating point number has precision that remains fixed, regardless of the magnitude of the number. To put that slightly differently, it always computes to the same number of significant digits. For example, a typical implementation of double will have around 15 significant digits (which translates to Epsilon = ~1e-15). If you're working with a number in the range 10e-200, the smallest change it can represent will be around 10e-215. If you're working with a number in the range 10e+200, the smallest change it can represent will be around 1e+185.
Meaningful use of Epsilon normally requires scaling it to the range of the numbers you're working with, and using that to define a range you're willing to accept as probably due to rounding errors, so if two numbers fall within that range, you assume they're probably really equal. For example, with Epsilon of 1e-15, you might decide to treat numbers that fall within 1e-14 of each other as equal (i.e. on significant digit has been lost to rounding).
The smallest number that can be represented will normally be dramatically smaller than that. With that same typical double, it's usually going to be around 1e-308. This would be equivalent to Epsilon if you were using fixed point numbers instead of floating point numbers. For example, at one time quite a few people used fixed-point for various graphics. A typical version was a 16-bit bit integer broken into a something like 10 bits before the decimal point and six bits after the decimal point. Such a number can represent numbers from roughly 0 to 1024, with about two (decimal) digits after the decimal point. Alternatively, you can treat it as signed, running from (roughly) -512 to +512, again with around two digits after the decimal point.
In this case, the scaling factor is fixed, so the smallest difference that can be represented between two numbers is also fixed -- i.e. the difference between 1024 and the next larger number is exactly the same as the difference between 0 and the next larger number.
Reciprocals
I'm not sure exactly why you're concerned with computing reciprocals of extremely large or extremely small numbers. IEEE floating point uses denormals, which means numbers close to the limits of the range lose precision. Basically, a number is divided into an exponent and a significand. The exponent contains the magnitude of the number, and the significand contains the significant digits. Each is represented with a specified number of bits. In the usual case, numbers are normalized, which means they're vaguely similar to the scientific notation we all learned in school. In scientific notation, you always adjust the significand and exponent so there's exactly one place before the decimal point, so (for example) 140 becomes 1.4e2, 20030 becomes 2.003e4, and so on.
Think of this as the "normalized" form of a floating point number. Assume, however, that you're limited t an exponent having 2 digits, so it can only run from -99 to +99. Also assume that you can have a maximum of 15 significant digits. Within those limitations, you could produce a number like 0.00001002e-99. This lets you represent a number smaller than 1e-99, at the expense of losing some precision -- instead of 15 digits of precision, you've used 5 digits of your significand to represent magnitude, so you're left with only 10 digits that are really significant.
Except that it's in binary instead of decimal, IEEE floating point works roughly that way.
As you approach the end of the range, the numbers have less and less precision, until (at the very end of the range) you have only one bit of precision left.
If you take that number that has only one bit of precision, and take its reciprocal you get an extremely large number -- but since you only started with one bit of precision, the result can only have one bit of precision as well. Although slightly better than no result at all, it's still pretty close to meaningless. You've reached the limit of what the number of bits can represent; about the only way to cure the problem is to use more bits.
There's not really any one point at which a reciprocal (or other computation) "stops making sense". It's not really a hard line where one result makes sense, and another doesn't. Rather, it's a slope, where one result might have 15 digits of precision, another 10 and a third only 1. What "makes sense" or not is mostly how you interpret that result. To get meaningful results, you need a fair idea of how many digits in your final result are really meaningful.

You need to understand how floating point numbers are represented in the CPU. In the data type, 1 bit is reserved for the sign, i.e. whether it is a positive or negative number, (yes you can have positive and negative 0 in floating point numbers,) then a number of bits is reserved for the significand (or mantissa,) these are the significant digits in the floating point number and finally a number of bits is reserved for the exponent. The value of the floating point number now is:
-1^sign * significand * 2^exponent
This means the smallest number is a very small value, namely the smalles significand with the lowest exponent. The rounding error however is much larger and depends on the magnitude of the number, namely the smallest number with a given exponent. The epsilon is the difference between 1.0 and the next representable larger value. That's why epsilon is used in code that is robust for rounding errors, and really you should scale the epsilon with the magnitude of the numbers you work with if you do it right. The smallest representable value is not really of any significant use normally.
You're seeing the difference between the normalized and denormalized minimum. The problem is that due to the way the significand is used it is possible to make a larger negative exponent than a positive one, say the bit pattern of the significand is all zeros except the last bit, which is one, then the exponent is effectively lowered by the number of bits in the significand. For the maximum you cannot do this, even if you set the significand to all ones, the effective exponent will still only be the exponent that is given. i.e. think of the difference between 0.000001e-10 and 9.999999e+10, the first is much smaller than the second is big. The first is actually 1e-16 while the second is approx 1e+11.
It depends on the precision of the floating point number of course. In the case of double precision, the difference between the maximum and the next smaller value is already huge, (along the lines of 10^292,) so your rounding errors will be very big. If the value is too small you will simply get inf instead, as you already saw. Really, there is no strict answer, it depends entirely on the precision of numbers you need. Given that the rounding error is approx epsilon*magnitude, the reciprocal of (1/epsilon) already has a rounding error of around 1.0 if you need numbers to be accurate to 1e-3 then even epsilon would be too big to divide by.
See these wikipedia pages on IEEE754 and Machine epsilon for some background info.

Epsilons are added to test equality between two values that should be equal, but aren't because of rounding errors. While you could use the smallest positive value for epsilon, it wouldn't be optimal, because it's simply too small. The rounding errors caused by floating point arithmetic almost always exceed that smallest value, so a larger epsilon is needed. How large depends on your desired accuracy.
I don't understand the question. Are the reciprocals useful for what? I can't think of any reason why they would be useful.
In general, dividing by very small values is a bad idea as it will cause very large rounding errors. I'm not sure what you mean by adding an upper bound. Just avoid dividing by small values wherever possible.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js