Why is this python numerical error happening? - python-2.7

been playing a bit with numerical stability and found this:
>>> sum([1e4,1e20,-1e20])
16384.0
Any ideas why is this happening?

The first two numbers can't be summed accurately because Python's floating point representation doesn't support that many significant decimal digits (it supports 16 digits and your sum requires 17 to be represented accurately). Python is approximating the answer with a single bit in the least significant part of the mantissa.
The difference between the answer you get after adding the third number and the answer you are expecting represents the error in the representation of the intermediate result. After that subtraction, that single bit in the mantissa is all that is left; when the exponent is normalized, you are left with 16384. The fact that it is a power of two tips you off to what is happening.

Related

What exactly happens while multiplying a double value by 10

I have recently been wondering about multiplying floating point numbers.
Let's assume I have a number, for example 3.1415 with a guaranteed 3-digit precision.
Now, I multiply this value by 10, and I get 31.415X, where X is a digit I cannot
define because of the limited precision.
Now, can I be sure, that the five get's carried over to the precise digits?
If a number is proven to be precise up to 3 digits I wouldn't expect this
five to always pop up there, but after studying many cases in c++ i have noticed that it always happens.
From my point of view, however, this doesn't make any sense, because floating point numbers are stored base-two, so multiplication by ten isn't really possible, it will always be mutiplication by 10.something.
I ask this question because I wanted to create a function that calculates how precise a type is. I have came up with something like this:
template <typename T>
unsigned accuracy(){
unsigned acc = 0;
T num = (T)1/(T)3;
while((unsigned)(num *= 10) == 3){
acc++;
num -= 3;
}
return acc;
}
Now, this works for any types I've used it with, but I'm still not sure that the first unprecise digit will always be carried over in an unchanged form.
I'll talk specifically about IEEE754 doubles since that what I think you're asking for.
Doubles are defined as a sign bit, an 11-bit exponent and a 52-bit mantissa, which are concatenated to form a 64-bit value:
sign|exponent|mantissa
Exponent bits are stored in a biased format, which means we store the actual exponent +1023 (for a double). The all-zeros exponent and all-ones exponent are special, so we end up being able to represent an exponent from 2^-1022 to 2^+1023
It's a common misconception that integer values can't be represented exactly by doubles, but we can actually store any integer in [0,2^53) exactly by setting the mantissa and exponent properly, in fact the range [2^52,2^53) can only store the integer values in that range. So 10 is easily stored exactly in a double.
When it comes to multiplying doubles, we effectively have two numbers of this form:
A = (-1)^sA*mA*2^(eA-1023)
B = (-1)^sB*mB*2^(eB-1023)
Where sA,mA,eA are the sign,mantissa and exponent for A (and similarly for B).
If we multiply these:
A*B = (-1)^(sA+sB)*(mA*mB)*2^((eA-1023)+(eB-1023))
We can see that we merely sum the exponents, and then multiply the mantissas. This actually isn't bad for precision! We might overflow the exponent bits (and thus get an infinity), but other than that we just have to round the intermediate mantissa result back to 52 bits, but this will at worst only change the least significant bit in the new mantissa.
Ultimately, the error you'll see will be proportional to the magnitude of the result. But, doubles have an error proportional to their magnitude anyways so this is really as safe as we can get. The way to approximate the error in your number is as |magnitude|*2^-53. In your case, since 10 is exact, the only error will come in the representation of pi. It will have an error of ~2^-51 and thus the result will as well.
As a rule of thumb, I consider doubles to have ~15 digits of decimal precision when thinking about precision concerns.
Lets assume that for single precision 3.1415 is
0x40490E56
in IEEE 754 format which is a very popular but not the only format used.
01000000010010010000111001010110
0 10000000 10010010000111001010110
so the binary portion is 1.10010010000111001010110
110010010000111001010110
1100 1001 0000 1110 0101 0110
0xC90E56 * 10 = 0x7DA8F5C
Just like in grade school with decimal you worry about the decimal(/binary) point later, you just do a multiply.
01111.10110101000111101011100
to get into IEEE 754 format it needs to be shifted to a 1.mantissa format
so that is a shift of 3
1.11110110101000111101011
but look at the three bits chopped off 100 specifically the 1 so this means depending on the rounding mode you round, in this case lets round up
1.11110110101000111101100
0111 1011 0101 0001 1110 1100
0x7BA1EC
now if I already computed the answer:
0x41FB51EC
0 10000011 11110110101000111101100
we moved the point 3 and the exponent reflects that, the mantissa matches what we computed. we did lose one of the original non-zero bits off the right, but is that too much loss?
double, extended, work the same way just more exponent and mantissa bits, more precision and range. but at the end of the day it is nothing more than what we learned in grade school as far as the math goes, the format requires 1.mantissa so you have to use your grade school math to adjust the exponent of the base to get it in that form.
Now, can I be sure, that the five get's carried over to the precise digits?
In general, no. You can only be sure about the precision of output when you know the exact representation format used by your system, and know that the correct output is exactly representable in that format.
If you want precise result for any rational input, then you cannot use finite precision.
It seems that your function attempts to calculate how accurately the floating point type can represent 1/3. This accuracy is not useful for evaluating accuracy of representing other numbers.
because floating point numbers are stored base-two
While very common, this is not universally true. Some systems use base-10 for example.

IEEE754 float point substraction precision lost

Here is the subtraction
First number
Decimal 3.0000002
Hexadecimal 0x4040001
Binary: Sign[0], Exponent[1000_0000], Mantissa[100_0000_0000_0000_0000_0001]
substract second number:
Decimal 3.000000
Hexadecimal 0x4040000
Binary: Sign[0], Exponent[1000_0000], Mantissa[100_0000_0000_0000_0000_0000]
==========================================
At this situation, the exponent is already same, we just need to substract the mantissa. We know in IEEE754, there is a hiding bit 1 in front of mantissa. Therefore, the result mantissa should be:
Mantissa_1[1100_0000_0000_0000_0000_0001] - Mantissa_2[1100_0000_0000_0000_0000_0000]
which equal to
Mantissa_Rst = [0000_0000_0000_0000_0000_0001]
But this number is not normalized, Because of the first hiding bit is not 1. Thus we shift the Mantissa_Rst right 23 times, and the exponent minuses 23 at the same time.
Then we have the result value
Hexadecimal 0x4040000
Binary: Sign[0], Exponent[0110_1000], Mantissa[000_0000_0000_0000_0000_0000].
32 bits total, no rounding needed.
Notice that in the mantissa region, there still is a hidden 1.
If my calculations were correct, then converting result to decimal number is 0.00000023841858, comparing with the real result 0.0000002, I still think that is not very precise.
So the question is, are my calculations wrong? or actually this is a real situation and happens all the time in computer?
The inaccuracy already starts with your input. 3.0000002 is a fraction with a prime factor of five in the denominator, so its "decimal" expansion in base 2 is periodic. No amount of mantissa bits will suffice to represent it exactly. The float you give actually has the value 3.0000002384185791015625 (this is exact). Yes, this happens all the time.
Don't despair, though! Base ten has the same problem (for example 1/3). It isn't a problem. Well, it is for some people, but luckily there are other number types available for their needs. Floating point numbers have many advantages, and slight rounding error is irrelevant for many applications, for example when not even your inputs are perfectly accurate measurements of what you're interested in (a lot of scientific computing and simulation). Also remember that 64-bit floats also exist. Additionally, the error is bounded: With the best possible rounding, your result will be within 0.5 units in the last place removed from the infinite-precision result. For a 32-bit float of the magnitude as your example, this is approximately 2^-25, or 3 * 10^-8. This gets worse and worse as you do additional operations that have to round, but with careful numeric analysis and the right algorithms, you can get a lot of milage out of them.
Whenever x/2 ≤ y ≤ 2x, the calculation x - y is exact which means there is no rounding error whatsoever. That is also the case in your example.
You just made the wrong assumption that you could have a floating point number that is equal to 3.0000002. You can't. The type "float" can only ever represent integers less than 2^24, multiplied by a power of two. 3.0000002 is not such a number, therefore it is rounded to the nearest floating point number, which is closer to 3.00000023841858. Subtracting 3 calculates the difference exactly and gives a result close to 0.00000023841858.

How can 8 bytes hold 302 decimal digits? (Euler challenge 16)

c++ pow(2,1000) is normaly to big for double, but it's working. why?
So I've been learning C++ for couple weeks but the datatypes are still confusing me.
One small minor thing first: the code that 0xbadc0de posted in the other thread is not working for me.
First of all pow(2,1000) gives me this more than once instance of overloaded function "pow" matches the argument list.
I fixed it by changing pow(2,1000) -> pow(2.0,1000)
Seems fine, i run it and get this:
http://i.stack.imgur.com/bbRat.png
Instead of
10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
it is missing a lot of the values, what might be cause that?
But now for the real problem.
I'm wondering how can 302 digits long number fit a double (8 bytes)?
0xFFFFFFFFFFFFFFFF = 18446744073709551616 so how can the number be larger than that?
I think it has something to do with the floating point number encoding stuff.
Also what is the largest number that can possibly be stored in 8 bytes if it's not 0xFFFFFFFFFFFFFFFF?
Eight bytes contain 64 bits of information, so you can store 2^64 ~ 10^20 unique items using those bits. Those items can easily be interpreted as the integers from 0 to 2^64 - 1. So you cannot store 302 decimal digits in 8 bytes; most numbers between 0 and 10^303 - 1 cannot be so represented.
Floating point numbers can hold approximations to numbers with 302 decimal digits; this is because they store the mantissa and exponent separately. Numbers in this representation store a certain number of significant digits (15-16 for doubles, if I recall correctly) and an exponent (which can go into the hundreds, of memory serves). However, if a decimal is X bytes long, then it can only distinguish between 2^(8X) different values... unlikely enough for exactly representing integers with 302 decimal digits.
To represent such numbers, you must use many more bits: about 1000, actually, or 125 bytes.
It's called 'floating point' for a reason. The datatype contains a number in the standard sense, and an exponent which says where the decimal point belongs. That's why pow(2.0, 1000) works, and it's why you see a lot of zeroes. A floating point (or double, which is just a bigger floating point) number contains a fixed number of digits of precision. All the remaining digits end up being zero. Try pow(2.0, -1000) and you'll see the same situation in reverse.
The number of decimal digits of precision in a float (32 bits) is about 7, and for a double (64 bits) it's about 16 decimal digits.
Most systems nowadays use IEEE floating point, and I just linked to a really good description of it. Also, the article on the specific standard IEEE 754-1985 gives a detailed description of the bit layouts of various sizes of floating point number.
2.0 ^ 1000 mathematically will have a decimal (non-floating) output. IEEE floating point numbers, and in your case doubles (as the pow function takes in doubles and outputs a double) have 52 bits of the 64 bit representation allocated to the mantissa. If you do the math, 2^52 = 4,503,599,627,370,496. Because a floating point number can represent positive and negative numbers, really the integer representation will be ~ 2^51 = 2,251,799,813,685,248. Notice there are 16 digits. there are 16 quality (non-zero) digits in the output you see.
Essentially the pow function is going to perform the exponentiation, but once the exponentiation moves past ~2^51, it is going to begin losing precision. Ultimately it will hold precision for the top ~16 decimal digits, but all other digits right will be un-guaranteed.
Thus it is a floating point precision / rounding problem.
If you were strictly in unsigned integer land, the number would overflow after (2^64 - 1) = 18,446,744,073,709,551,616. What overflowing means, is that you would never actually see the number go ANY HIGHER than the one provided, infact I beleive the answer would be 0 from this operation. Once the answer goes beyond 2^64, the result register would be zero, and any multiply afterwords would be 0 * 2, which would always result in 0. I would have to try it.
The exact answer (as you show) can be obtained using a standard computer using a multi-precision libary. What these do is to emulate a larger bit computer by concatenating multiple of the smaller data types, and use algorithms to convert and print on the fly. Mathematica is one example of a math engine that implements an arbitrary precision math calculation library.
Floating point types can cover a much larger range than integer types of the same size, but with less precision.
They represent a number as:
a sign bit s to indicate positive or negative;
a mantissa m, a value between 1 and 2, giving a certain number of bits of precision;
an exponent e to indicate the scale of the number.
The value itself is calculated as m * pow(2,e), negated if the sign bit is set.
A standard double has a 53-bit mantissa, which gives about 16 decimal digits of precision.
So, if you need to represent an integer with more than (say) 64 bits of precision, then neither a 64-bit integer nor a 64-bit floating-point type will work. You will need either a large integer type, with as many bits as necessary to represent the values you're using, or (depending on the problem you're solving) some other representation such as a prime factorisation. No such type is available in standard C++, so you'll need to make your own.
If you want to calculate the range of the digits that can be hold by some bytes, it should be (2^(64bits - 1bit)) to (2^(64bits - 1bit) - 1).
Because the left most digit of the variable is for representing sign (+ and -).
So the range for negative side of the number should be : (2^(64bits - 1bit))
and the range for positive side of the number should be : (2^(64bits - 1bit) - 1)
there is -1 for the positive range because of 0(to avoid reputation of counting 0 for each side).
For example if we are calculating 64bits, the range should be ==> approximately [-9.223372e+18] to [9.223372e+18]

is it possible to categorize the different forms of the approximation of floating point numbers

I am just wondering if we can make rules for the form of the approximation of real numbers using floating point numbers.
For intance is a floating point number can be terminated by 1.xxx777777 (so terminated by infinite 7 by instance and eventually a random digit at the end ) ?
I believe that there is only this form of floating point number :
1. exact value.
2. value like 1.23900008721.... so where 1.239 is approximated with digits that appears as "noise" but with 0 between the exact value and this noise
3. value like 3.2599995, where 3.26 is approximated by adding 9999.. and a final digit (like 5), so approximated with a floating number just below the real number
4. value like 2.000001, where 2.0 is approximated with a floating number just above the real number
You are thinking in terms of decimal numbers, that is, numbers that can be represented as n*(10^e), with e either positive or negative. These numbers occur naturally in your thought processes for historical reasons having to do with having ten fingers.
Computer numbers are represented in binary, for technical reasons that have to do with an electrical signal being either present or absent.
When you are dealing with smallish integer numbers, it does not matter much that the computer representation does not match your own, because you are thinking of an accurate approximation of the mathematical number, and so is the computer, so by transitivity, you and the computer are thinking about the same thing.
With either very large or very small numbers, you will tend to think in terms of powers of ten, and the computer will definitely think in terms of powers of two. In these cases you can observe a difference between your intuition and what the computer does, and also, your classification is nonsense. Binary floating-point numbers are neither more dense or less dense near numbers that happen to have a compact representation as decimal numbers. They are simply represented in binary, n*(2^p), with p either positive or negative. Many real numbers have only an approximative representation in decimal, and many real numbers have only an approximative representation in binary. These numbers are not the same (binary numbers can be represented in decimal, but not always compactly. Some decimal numbers cannot be represented exactly in binary at all, for instance 0.1).
If you want to understand the computer's floating-point numbers, you must stop thinking in decimal. 1.23900008721.... is not special, and neither is 1.239. 3.2599995 is not special, and neither is 3.26. You think they are special because they are either exactly or close to compact decimal numbers. But that does not make any difference in binary floating-point.
Here are a few pieces of information that may amuse you, since you tagged your question C++:
If you print a double-precision number with the format %.16e, you get a decimal number that converts back to the original double. But it does not always represent the exact value of the original double. To see the exact value of the double in decimal, you must use %.53e. If you write 0.1 in a program, the compiler interprets this as meaning 1.000000000000000055511151231257827021181583404541015625e-01, which is a relatively compact number in binary. Your question speaks of 3.2599995 and 2.000001 as if these were floating-point numbers, but they aren't. If you write these numbers in a program, the compiler will interpret them as 3.25999950000000016103740563266910612583160400390625
and
2.00000100000000013977796697872690856456756591796875. So the pattern you are looking for is simple: the decimal representation of a floating-point number is always 17 significant digits followed by 53-17=36 “noise” digits as you call them. The noise digits are sometimes all zeroes, and the significant digits can end in a bunch of zeroes too.
Floating point is presented by bits. What this means is:
1 bit flipped after the decimal is 0.5 or 1/2
01 bits is 0.25 or 1/4
etc.
This means floating point is always approximately close but not exact if it's not an exact power of 2, when represented in terms of what the machine can handle.
Rational numbers can very accurately be represented by the machine (not precisely of course if not a power of two below the decimal point), but irrational numbers will always carry an error. In terms of this your question is not so much related to c++ as to computer architecture.

How to convert float to double(both stored in IEEE-754 representation) without losing precision?

I mean, for example, I have the following number encoded in IEEE-754 single precision:
"0100 0001 1011 1110 1100 1100 1100 1100" (approximately 23.85 in decimal)
The binary number above is stored in literal string.
The question is, how can I convert this string into IEEE-754 double precision representation(somewhat like the following one, but the value is not the same), WITHOUT losing precision?
"0100 0000 0011 0111 1101 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010"
which is the same number encoded in IEEE-754 double precision.
I have tried using the following algorithm to convert the first string back to decimal number first, but it loses precision.
num in decimal = (sign) * (1 + frac * 2^(-23)) * 2^(exp - 127)
I'm using Qt C++ Framework on Windows platform.
EDIT: I must apologize maybe I didn't get the question clearly expressed.
What I mean is that I don't know the true value 23.85, I only got the first string and I want to convert it to double precision representation without precision loss.
Well: keep the sign bit, rewrite the exponent (minus old bias, plus new bias), and pad the mantissa with zeros on the right...
(As #Mark says, you have to treat some special cases separately, namely when the biased exponent is either zero or max.)
IEEE-754 (and floating point in general) cannot represent periodic binary decimals with full precision. Not even when they, in fact, are rational numbers with relatively small integer numerator and denominator. Some languages provide a rational type that may do it (they are the languages that also support unbounded precision integers).
As a consequence those two numbers you posted are NOT the same number.
They in fact are:
10111.11011001100110011000000000000000000000000000000000000000 ...
10111.11011001100110011001100110011001100110011001101000000000 ...
where ... represent an infinite sequence of 0s.
Stephen Canon in a comment above gives you the corresponding decimal values (did not check them, but I have no reason to doubt he got them right).
Therefore the conversion you want to do cannot be done as the single precision number does not have the information you would need (you have NO WAY to know if the number is in fact periodic or simply looks like being because there happens to be a repetition).
First of all, +1 for identifying the input in binary.
Second, that number does not represent 23.85, but slightly less. If you flip its last binary digit from 0 to 1, the number will still not accurately represent 23.85, but slightly more. Those differences cannot be adequately captured in a float, but they can be approximately captured in a double.
Third, what you think you are losing is called accuracy, not precision. The precision of the number always grows by conversion from single precision to double precision, while the accuracy can never improve by a conversion (your inaccurate number remains inaccurate, but the additional precision makes it more obvious).
I recommend converting to a float or rounding or adding a very small value just before displaying (or logging) the number, because visual appearance is what you really lost by increasing the precision.
Resist the temptation to round right after the cast and to use the rounded value in subsequent computation - this is especially risky in loops. While this might appear to correct the issue in the debugger, the accummulated additional inaccuracies could distort the end result even more.
It might be easiest to convert the string into an actual float, convert that to a double, and convert it back to a string.
Binary floating points cannot, in general, represent decimal fraction values exactly. The conversion from a decimal fractional value to a binary floating point (see "Bellerophon" in "How to Read Floating-Point Numbers Accurately" by William D.Clinger) and from a binary floating point back to a decimal value (see "Dragon4" in "How to Print Floating-Point Numbers Accurately" by Guy L.Steele Jr. and Jon L.White) yield the expected results because one converts a decimal number to the closest representable binary floating point and the other controls the error to know which decimal value it came from (both algorithms are improved on and made more practical in David Gay's dtoa.c. The algorithms are the basis for restoring std::numeric_limits<T>::digits10 decimal digits (except, potentially, trailing zeros) from a floating point value stored in type T.
Unfortunately, expanding a float to a double wrecks havoc on the value: Trying to format the new number will in many cases not yield the decimal original because the float padded with zeros is different from the closest double Bellerophon would create and, thus, Dragon4 expects. There are basically two approaches which work reasonably well, however:
As someone suggested convert the float to a string and this string into a double. This isn't particularly efficient but can be proven to produce the correct results (assuming a correct implementation of the not entirely trivial algorithms, of course).
Assuming your value is in a reasonable range, you can multiply it by a power of 10 such that the least significant decimal digit is non-zero, convert this number to an integer, this integer to a double, and finally divide the resulting double by the original power of 10. I don't have a proof that this yields the correct number but for the range of value I'm interested in and which I want to store accurately in a float, this works.
One reasonable approach to avoid this entirely issue is to use decimal floating point values as described for C++ in the Decimal TR in the first place. Unfortunately, these are not, yet, part of the standard but I have submitted a proposal to the C++ standardization committee to get this changed.