I was reading about the gradual underflow concept and how it is op important in the music industry Gradual overflow Application in Music
I well understand the problem of an overflow buffer, but this i don't know how to represent an underflow.
Can you please give me an example(a program preferably in c or c++) as in how a computer handles gradual underflow?
Gradual underflow is related to what IEEE 754 calls "subnormal" numbers.
Consider using scientific notation in which you have (say) 10 digits of precision and exponents that can range from -99 through 99.
Under normal circumstances, you treat everything as scientific notation, so if you want to represent 1000, you represent it as 1e3 -- that is, 1 * 103.
Now, consider a number like 1.234e-102. The smallest exponent you can represent is -99. So, if you do your job the simplest possible way, you simply that since it has an exponent smaller than that, it's just 0. That would be "fast underflow".
In IEEE 754 (and related standards) you can store that as (essentially) 0.001234 * 10-99. In doing so, you may lose some precision compared to a normal number that has an exponent in the -99...99 range. On the other hand, you lose less than if you just rounded it to zero because it has an exponent smaller than -99. In fact, in this case it started with 4 significant digits, and as represented it retains all 4 significant digits.
On a computer, the numbers are represented in binary, so the numbers of significant digits and/or maximum range of exponents aren't round numbers when converted to decimal, but the same basic idea applies--when we have a number that's too small to represent in the normal format, we can still store it with the smallest exponent that can be represented, but also includes some leading zeros.
This does lead to one difficulty: numbers are normally stored in what's called normalized form. The "significand" part is normalized by shifting it left until the first digit is a 1 (keep in mind that since it's binary it can only be 0 or 1). Since we know it's a 1, we cheat a little: we don't normally store that 1 in the number as it's stored. So, a double precision floating point number normally has 53 bits of precision, but only actually stores 52 bits of significand.
With a subnormal number, that's no longer the case. That's not terribly difficult to deal with, but it still introduces a special case--and one that's only rarely used, so CPU designers (and such) rarely try to optimize for it. As a result, the exact same code can suddenly run a lot slower when executing on data that contains subnormals.
Related
ques:
I have a large number of floating point numbers (~10,000 numbers) , each having 6 digits after decimal. Now, the multiplication of all these numbers would yield about 60,000 digits. But the double range is for 15 digits only. The output product has to have 6 digits of precision after decimal.
my approach:
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
I also thought of multiplying these numbers using arrays to store their digits and later converting them to decimal. But this also appears cumbersome and may not yield correct result.
Is there an alternate easier way to do this?
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
This would only achieve further loss of accuracy. In floating-point, large numbers are represented approximately just like small numbers are. Making your numbers bigger only means you are doing 19999 multiplications (and one division) instead of 9999 multiplications; it does not magically give you more significant digits.
This manipulation would only be useful if it prevented the partial product to reach into subnormal territory (and in this case, multiplying by a power of two would be recommended to avoid loss of accuracy due to the multiplication). There is no indication in your question that this happens, no example data set, no code, so it is only possible to provide the generic explanation below:
Floating-point multiplication is very well behaved when it does not underflow or overflow. At the first order, you can assume that relative inaccuracies add up, so that multiplying 10000 values produces a result that's 9999 machine epsilons away from the mathematical result in relative terms(*).
The solution to your problem as stated (no code, no data set) is to use a wider floating-point type for the intermediate multiplications. This solves both the problems of underflow or overflow and leaves you with a relative accuracy on the end result such that once rounded to the original floating-point type, the product is wrong by at most one ULP.
Depending on your programming language, such a wider floating-point type may be available as long double. For 10000 multiplications, the 80-bit “extended double” format, widely available in x86 processors, would improve things dramatically and you would barely see any performance difference, as long as your compiler does map this 80-bit format to a floating-point type. Otherwise, you would have to use a software implementation such as MPFR's arbitrary-precision floating-point format or the double-double format.
(*) In reality, relative inaccuracies compound, so that the real bound on the relative error is more like (1 + ε)9999 - 1 where ε is the machine epsilon. Also, in reality, relative errors often cancel each other, so that you can expect the actual relative error to grow like the square root of the theoretical maximum error.
c++ pow(2,1000) is normaly to big for double, but it's working. why?
So I've been learning C++ for couple weeks but the datatypes are still confusing me.
One small minor thing first: the code that 0xbadc0de posted in the other thread is not working for me.
First of all pow(2,1000) gives me this more than once instance of overloaded function "pow" matches the argument list.
I fixed it by changing pow(2,1000) -> pow(2.0,1000)
Seems fine, i run it and get this:
http://i.stack.imgur.com/bbRat.png
Instead of
10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
it is missing a lot of the values, what might be cause that?
But now for the real problem.
I'm wondering how can 302 digits long number fit a double (8 bytes)?
0xFFFFFFFFFFFFFFFF = 18446744073709551616 so how can the number be larger than that?
I think it has something to do with the floating point number encoding stuff.
Also what is the largest number that can possibly be stored in 8 bytes if it's not 0xFFFFFFFFFFFFFFFF?
Eight bytes contain 64 bits of information, so you can store 2^64 ~ 10^20 unique items using those bits. Those items can easily be interpreted as the integers from 0 to 2^64 - 1. So you cannot store 302 decimal digits in 8 bytes; most numbers between 0 and 10^303 - 1 cannot be so represented.
Floating point numbers can hold approximations to numbers with 302 decimal digits; this is because they store the mantissa and exponent separately. Numbers in this representation store a certain number of significant digits (15-16 for doubles, if I recall correctly) and an exponent (which can go into the hundreds, of memory serves). However, if a decimal is X bytes long, then it can only distinguish between 2^(8X) different values... unlikely enough for exactly representing integers with 302 decimal digits.
To represent such numbers, you must use many more bits: about 1000, actually, or 125 bytes.
It's called 'floating point' for a reason. The datatype contains a number in the standard sense, and an exponent which says where the decimal point belongs. That's why pow(2.0, 1000) works, and it's why you see a lot of zeroes. A floating point (or double, which is just a bigger floating point) number contains a fixed number of digits of precision. All the remaining digits end up being zero. Try pow(2.0, -1000) and you'll see the same situation in reverse.
The number of decimal digits of precision in a float (32 bits) is about 7, and for a double (64 bits) it's about 16 decimal digits.
Most systems nowadays use IEEE floating point, and I just linked to a really good description of it. Also, the article on the specific standard IEEE 754-1985 gives a detailed description of the bit layouts of various sizes of floating point number.
2.0 ^ 1000 mathematically will have a decimal (non-floating) output. IEEE floating point numbers, and in your case doubles (as the pow function takes in doubles and outputs a double) have 52 bits of the 64 bit representation allocated to the mantissa. If you do the math, 2^52 = 4,503,599,627,370,496. Because a floating point number can represent positive and negative numbers, really the integer representation will be ~ 2^51 = 2,251,799,813,685,248. Notice there are 16 digits. there are 16 quality (non-zero) digits in the output you see.
Essentially the pow function is going to perform the exponentiation, but once the exponentiation moves past ~2^51, it is going to begin losing precision. Ultimately it will hold precision for the top ~16 decimal digits, but all other digits right will be un-guaranteed.
Thus it is a floating point precision / rounding problem.
If you were strictly in unsigned integer land, the number would overflow after (2^64 - 1) = 18,446,744,073,709,551,616. What overflowing means, is that you would never actually see the number go ANY HIGHER than the one provided, infact I beleive the answer would be 0 from this operation. Once the answer goes beyond 2^64, the result register would be zero, and any multiply afterwords would be 0 * 2, which would always result in 0. I would have to try it.
The exact answer (as you show) can be obtained using a standard computer using a multi-precision libary. What these do is to emulate a larger bit computer by concatenating multiple of the smaller data types, and use algorithms to convert and print on the fly. Mathematica is one example of a math engine that implements an arbitrary precision math calculation library.
Floating point types can cover a much larger range than integer types of the same size, but with less precision.
They represent a number as:
a sign bit s to indicate positive or negative;
a mantissa m, a value between 1 and 2, giving a certain number of bits of precision;
an exponent e to indicate the scale of the number.
The value itself is calculated as m * pow(2,e), negated if the sign bit is set.
A standard double has a 53-bit mantissa, which gives about 16 decimal digits of precision.
So, if you need to represent an integer with more than (say) 64 bits of precision, then neither a 64-bit integer nor a 64-bit floating-point type will work. You will need either a large integer type, with as many bits as necessary to represent the values you're using, or (depending on the problem you're solving) some other representation such as a prime factorisation. No such type is available in standard C++, so you'll need to make your own.
If you want to calculate the range of the digits that can be hold by some bytes, it should be (2^(64bits - 1bit)) to (2^(64bits - 1bit) - 1).
Because the left most digit of the variable is for representing sign (+ and -).
So the range for negative side of the number should be : (2^(64bits - 1bit))
and the range for positive side of the number should be : (2^(64bits - 1bit) - 1)
there is -1 for the positive range because of 0(to avoid reputation of counting 0 for each side).
For example if we are calculating 64bits, the range should be ==> approximately [-9.223372e+18] to [9.223372e+18]
I am just wondering if we can make rules for the form of the approximation of real numbers using floating point numbers.
For intance is a floating point number can be terminated by 1.xxx777777 (so terminated by infinite 7 by instance and eventually a random digit at the end ) ?
I believe that there is only this form of floating point number :
1. exact value.
2. value like 1.23900008721.... so where 1.239 is approximated with digits that appears as "noise" but with 0 between the exact value and this noise
3. value like 3.2599995, where 3.26 is approximated by adding 9999.. and a final digit (like 5), so approximated with a floating number just below the real number
4. value like 2.000001, where 2.0 is approximated with a floating number just above the real number
You are thinking in terms of decimal numbers, that is, numbers that can be represented as n*(10^e), with e either positive or negative. These numbers occur naturally in your thought processes for historical reasons having to do with having ten fingers.
Computer numbers are represented in binary, for technical reasons that have to do with an electrical signal being either present or absent.
When you are dealing with smallish integer numbers, it does not matter much that the computer representation does not match your own, because you are thinking of an accurate approximation of the mathematical number, and so is the computer, so by transitivity, you and the computer are thinking about the same thing.
With either very large or very small numbers, you will tend to think in terms of powers of ten, and the computer will definitely think in terms of powers of two. In these cases you can observe a difference between your intuition and what the computer does, and also, your classification is nonsense. Binary floating-point numbers are neither more dense or less dense near numbers that happen to have a compact representation as decimal numbers. They are simply represented in binary, n*(2^p), with p either positive or negative. Many real numbers have only an approximative representation in decimal, and many real numbers have only an approximative representation in binary. These numbers are not the same (binary numbers can be represented in decimal, but not always compactly. Some decimal numbers cannot be represented exactly in binary at all, for instance 0.1).
If you want to understand the computer's floating-point numbers, you must stop thinking in decimal. 1.23900008721.... is not special, and neither is 1.239. 3.2599995 is not special, and neither is 3.26. You think they are special because they are either exactly or close to compact decimal numbers. But that does not make any difference in binary floating-point.
Here are a few pieces of information that may amuse you, since you tagged your question C++:
If you print a double-precision number with the format %.16e, you get a decimal number that converts back to the original double. But it does not always represent the exact value of the original double. To see the exact value of the double in decimal, you must use %.53e. If you write 0.1 in a program, the compiler interprets this as meaning 1.000000000000000055511151231257827021181583404541015625e-01, which is a relatively compact number in binary. Your question speaks of 3.2599995 and 2.000001 as if these were floating-point numbers, but they aren't. If you write these numbers in a program, the compiler will interpret them as 3.25999950000000016103740563266910612583160400390625
and
2.00000100000000013977796697872690856456756591796875. So the pattern you are looking for is simple: the decimal representation of a floating-point number is always 17 significant digits followed by 53-17=36 “noise” digits as you call them. The noise digits are sometimes all zeroes, and the significant digits can end in a bunch of zeroes too.
Floating point is presented by bits. What this means is:
1 bit flipped after the decimal is 0.5 or 1/2
01 bits is 0.25 or 1/4
etc.
This means floating point is always approximately close but not exact if it's not an exact power of 2, when represented in terms of what the machine can handle.
Rational numbers can very accurately be represented by the machine (not precisely of course if not a power of two below the decimal point), but irrational numbers will always carry an error. In terms of this your question is not so much related to c++ as to computer architecture.
I'm learning about the representation of floating-point IEEE 754 numbers, and my textbook says:
To pack even more bits into the significand, IEEE 754 makes the leading 1-bit of normalized binary numbers implicit. Hence, the number is actually 24 bits long in single precision (implied 1 and 23-bit fraction), and 53 bits long in double precision (1 + 52).
I don't get what "implicit" means here... what's the difference between an explicit bit and an implicit bit? Don't all numbers have the bit, regardless of their sign?
Yes, all normalised numbers (other than the zeroes) have that bit set to one (a), so they make it implicit to prevent wasting space storing it.
In other words, they save that bit totally, and reuse it so that it can be used to increase the precision of your numbers.
Keep in mind that this is the first bit of the fraction, not the first bit of the binary pattern. The first bit of the binary pattern is the sign, followed by a few bits of exponent, followed by the fraction itself.
For example, a single precision number is (sign, exponent, fraction):
<1> <--8---> <---------23----------> <- bit widths
s eeeeeeee fffffffffffffffffffffff
If you look at the way the number is calculated, it's:
(-1)sign x 1.fraction x 2exponent-bias
So the fractional part used for calculating that value is 1.fffff...fff (in binary).
(a) There is actually a class of numbers (the denormalised ones and the zeroes) for which that property does not hold true. These numbers all have a biased exponent of zero but the vast majority of numbers follow the rule.
Here is what they are saying. The first non-zero bit is always going to be 1. So there is no need for the binary representation to include that bit, since you know what it is. So they don't. They tell you where that first 1 is, and then they give the bits after it. So there is a 1 that is not explicitly in the binary representation, whose location is implicit from the fact that they told you where it was.
It may also be helpful to note that we are dealing in binary representations of a number. The reason that the first digit of a normalized binary number (that is, no leading zeroes) has to be 1 is that 1 is the only non-zero value available to us in this representation. So, the same would not be true for, say, base-three representations.
(1) I have met several cases where epsilon is added to a non-negative variable to guarantee nonzero value. So I wonder why not add the minimum value that the data type can represent instead of epsilon? What are the difference problems that these two can solve?
(2) Also I notice that the inverse of the maximum value of a double precision type is bigger than its min value, and inverse of its min value is inf, way bigger than its max value. Is it useful to compute the reciprocals of its max and min values?
(3) For a very small positive number of double type, to compute its reciprocal, how small it is when its reciprocal starts to not make sense? Is it better to put an upper bound on the reciprocal? How much is the bound?
Thanks and regards
Epsilon
Epsilon is the smallest value that can be added to 1.0 and produce a result that's distinguishable from 1.0. As Poita_ implied, this is useful for dealing with rounding errors. The situation is pretty simple: a normal floating point number has precision that remains fixed, regardless of the magnitude of the number. To put that slightly differently, it always computes to the same number of significant digits. For example, a typical implementation of double will have around 15 significant digits (which translates to Epsilon = ~1e-15). If you're working with a number in the range 10e-200, the smallest change it can represent will be around 10e-215. If you're working with a number in the range 10e+200, the smallest change it can represent will be around 1e+185.
Meaningful use of Epsilon normally requires scaling it to the range of the numbers you're working with, and using that to define a range you're willing to accept as probably due to rounding errors, so if two numbers fall within that range, you assume they're probably really equal. For example, with Epsilon of 1e-15, you might decide to treat numbers that fall within 1e-14 of each other as equal (i.e. on significant digit has been lost to rounding).
The smallest number that can be represented will normally be dramatically smaller than that. With that same typical double, it's usually going to be around 1e-308. This would be equivalent to Epsilon if you were using fixed point numbers instead of floating point numbers. For example, at one time quite a few people used fixed-point for various graphics. A typical version was a 16-bit bit integer broken into a something like 10 bits before the decimal point and six bits after the decimal point. Such a number can represent numbers from roughly 0 to 1024, with about two (decimal) digits after the decimal point. Alternatively, you can treat it as signed, running from (roughly) -512 to +512, again with around two digits after the decimal point.
In this case, the scaling factor is fixed, so the smallest difference that can be represented between two numbers is also fixed -- i.e. the difference between 1024 and the next larger number is exactly the same as the difference between 0 and the next larger number.
Reciprocals
I'm not sure exactly why you're concerned with computing reciprocals of extremely large or extremely small numbers. IEEE floating point uses denormals, which means numbers close to the limits of the range lose precision. Basically, a number is divided into an exponent and a significand. The exponent contains the magnitude of the number, and the significand contains the significant digits. Each is represented with a specified number of bits. In the usual case, numbers are normalized, which means they're vaguely similar to the scientific notation we all learned in school. In scientific notation, you always adjust the significand and exponent so there's exactly one place before the decimal point, so (for example) 140 becomes 1.4e2, 20030 becomes 2.003e4, and so on.
Think of this as the "normalized" form of a floating point number. Assume, however, that you're limited t an exponent having 2 digits, so it can only run from -99 to +99. Also assume that you can have a maximum of 15 significant digits. Within those limitations, you could produce a number like 0.00001002e-99. This lets you represent a number smaller than 1e-99, at the expense of losing some precision -- instead of 15 digits of precision, you've used 5 digits of your significand to represent magnitude, so you're left with only 10 digits that are really significant.
Except that it's in binary instead of decimal, IEEE floating point works roughly that way.
As you approach the end of the range, the numbers have less and less precision, until (at the very end of the range) you have only one bit of precision left.
If you take that number that has only one bit of precision, and take its reciprocal you get an extremely large number -- but since you only started with one bit of precision, the result can only have one bit of precision as well. Although slightly better than no result at all, it's still pretty close to meaningless. You've reached the limit of what the number of bits can represent; about the only way to cure the problem is to use more bits.
There's not really any one point at which a reciprocal (or other computation) "stops making sense". It's not really a hard line where one result makes sense, and another doesn't. Rather, it's a slope, where one result might have 15 digits of precision, another 10 and a third only 1. What "makes sense" or not is mostly how you interpret that result. To get meaningful results, you need a fair idea of how many digits in your final result are really meaningful.
You need to understand how floating point numbers are represented in the CPU. In the data type, 1 bit is reserved for the sign, i.e. whether it is a positive or negative number, (yes you can have positive and negative 0 in floating point numbers,) then a number of bits is reserved for the significand (or mantissa,) these are the significant digits in the floating point number and finally a number of bits is reserved for the exponent. The value of the floating point number now is:
-1^sign * significand * 2^exponent
This means the smallest number is a very small value, namely the smalles significand with the lowest exponent. The rounding error however is much larger and depends on the magnitude of the number, namely the smallest number with a given exponent. The epsilon is the difference between 1.0 and the next representable larger value. That's why epsilon is used in code that is robust for rounding errors, and really you should scale the epsilon with the magnitude of the numbers you work with if you do it right. The smallest representable value is not really of any significant use normally.
You're seeing the difference between the normalized and denormalized minimum. The problem is that due to the way the significand is used it is possible to make a larger negative exponent than a positive one, say the bit pattern of the significand is all zeros except the last bit, which is one, then the exponent is effectively lowered by the number of bits in the significand. For the maximum you cannot do this, even if you set the significand to all ones, the effective exponent will still only be the exponent that is given. i.e. think of the difference between 0.000001e-10 and 9.999999e+10, the first is much smaller than the second is big. The first is actually 1e-16 while the second is approx 1e+11.
It depends on the precision of the floating point number of course. In the case of double precision, the difference between the maximum and the next smaller value is already huge, (along the lines of 10^292,) so your rounding errors will be very big. If the value is too small you will simply get inf instead, as you already saw. Really, there is no strict answer, it depends entirely on the precision of numbers you need. Given that the rounding error is approx epsilon*magnitude, the reciprocal of (1/epsilon) already has a rounding error of around 1.0 if you need numbers to be accurate to 1e-3 then even epsilon would be too big to divide by.
See these wikipedia pages on IEEE754 and Machine epsilon for some background info.
Epsilons are added to test equality between two values that should be equal, but aren't because of rounding errors. While you could use the smallest positive value for epsilon, it wouldn't be optimal, because it's simply too small. The rounding errors caused by floating point arithmetic almost always exceed that smallest value, so a larger epsilon is needed. How large depends on your desired accuracy.
I don't understand the question. Are the reciprocals useful for what? I can't think of any reason why they would be useful.
In general, dividing by very small values is a bad idea as it will cause very large rounding errors. I'm not sure what you mean by adding an upper bound. Just avoid dividing by small values wherever possible.