How can 8 bytes hold 302 decimal digits? (Euler challenge 16) - c++

c++ pow(2,1000) is normaly to big for double, but it's working. why?
So I've been learning C++ for couple weeks but the datatypes are still confusing me.
One small minor thing first: the code that 0xbadc0de posted in the other thread is not working for me.
First of all pow(2,1000) gives me this more than once instance of overloaded function "pow" matches the argument list.
I fixed it by changing pow(2,1000) -> pow(2.0,1000)
Seems fine, i run it and get this:
http://i.stack.imgur.com/bbRat.png
Instead of
10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
it is missing a lot of the values, what might be cause that?
But now for the real problem.
I'm wondering how can 302 digits long number fit a double (8 bytes)?
0xFFFFFFFFFFFFFFFF = 18446744073709551616 so how can the number be larger than that?
I think it has something to do with the floating point number encoding stuff.
Also what is the largest number that can possibly be stored in 8 bytes if it's not 0xFFFFFFFFFFFFFFFF?

Eight bytes contain 64 bits of information, so you can store 2^64 ~ 10^20 unique items using those bits. Those items can easily be interpreted as the integers from 0 to 2^64 - 1. So you cannot store 302 decimal digits in 8 bytes; most numbers between 0 and 10^303 - 1 cannot be so represented.
Floating point numbers can hold approximations to numbers with 302 decimal digits; this is because they store the mantissa and exponent separately. Numbers in this representation store a certain number of significant digits (15-16 for doubles, if I recall correctly) and an exponent (which can go into the hundreds, of memory serves). However, if a decimal is X bytes long, then it can only distinguish between 2^(8X) different values... unlikely enough for exactly representing integers with 302 decimal digits.
To represent such numbers, you must use many more bits: about 1000, actually, or 125 bytes.

It's called 'floating point' for a reason. The datatype contains a number in the standard sense, and an exponent which says where the decimal point belongs. That's why pow(2.0, 1000) works, and it's why you see a lot of zeroes. A floating point (or double, which is just a bigger floating point) number contains a fixed number of digits of precision. All the remaining digits end up being zero. Try pow(2.0, -1000) and you'll see the same situation in reverse.
The number of decimal digits of precision in a float (32 bits) is about 7, and for a double (64 bits) it's about 16 decimal digits.
Most systems nowadays use IEEE floating point, and I just linked to a really good description of it. Also, the article on the specific standard IEEE 754-1985 gives a detailed description of the bit layouts of various sizes of floating point number.

2.0 ^ 1000 mathematically will have a decimal (non-floating) output. IEEE floating point numbers, and in your case doubles (as the pow function takes in doubles and outputs a double) have 52 bits of the 64 bit representation allocated to the mantissa. If you do the math, 2^52 = 4,503,599,627,370,496. Because a floating point number can represent positive and negative numbers, really the integer representation will be ~ 2^51 = 2,251,799,813,685,248. Notice there are 16 digits. there are 16 quality (non-zero) digits in the output you see.
Essentially the pow function is going to perform the exponentiation, but once the exponentiation moves past ~2^51, it is going to begin losing precision. Ultimately it will hold precision for the top ~16 decimal digits, but all other digits right will be un-guaranteed.
Thus it is a floating point precision / rounding problem.
If you were strictly in unsigned integer land, the number would overflow after (2^64 - 1) = 18,446,744,073,709,551,616. What overflowing means, is that you would never actually see the number go ANY HIGHER than the one provided, infact I beleive the answer would be 0 from this operation. Once the answer goes beyond 2^64, the result register would be zero, and any multiply afterwords would be 0 * 2, which would always result in 0. I would have to try it.
The exact answer (as you show) can be obtained using a standard computer using a multi-precision libary. What these do is to emulate a larger bit computer by concatenating multiple of the smaller data types, and use algorithms to convert and print on the fly. Mathematica is one example of a math engine that implements an arbitrary precision math calculation library.

Floating point types can cover a much larger range than integer types of the same size, but with less precision.
They represent a number as:
a sign bit s to indicate positive or negative;
a mantissa m, a value between 1 and 2, giving a certain number of bits of precision;
an exponent e to indicate the scale of the number.
The value itself is calculated as m * pow(2,e), negated if the sign bit is set.
A standard double has a 53-bit mantissa, which gives about 16 decimal digits of precision.
So, if you need to represent an integer with more than (say) 64 bits of precision, then neither a 64-bit integer nor a 64-bit floating-point type will work. You will need either a large integer type, with as many bits as necessary to represent the values you're using, or (depending on the problem you're solving) some other representation such as a prime factorisation. No such type is available in standard C++, so you'll need to make your own.

If you want to calculate the range of the digits that can be hold by some bytes, it should be (2^(64bits - 1bit)) to (2^(64bits - 1bit) - 1).
Because the left most digit of the variable is for representing sign (+ and -).
So the range for negative side of the number should be : (2^(64bits - 1bit))
and the range for positive side of the number should be : (2^(64bits - 1bit) - 1)
there is -1 for the positive range because of 0(to avoid reputation of counting 0 for each side).
For example if we are calculating 64bits, the range should be ==> approximately [-9.223372e+18] to [9.223372e+18]

Related

What exactly happens while multiplying a double value by 10

I have recently been wondering about multiplying floating point numbers.
Let's assume I have a number, for example 3.1415 with a guaranteed 3-digit precision.
Now, I multiply this value by 10, and I get 31.415X, where X is a digit I cannot
define because of the limited precision.
Now, can I be sure, that the five get's carried over to the precise digits?
If a number is proven to be precise up to 3 digits I wouldn't expect this
five to always pop up there, but after studying many cases in c++ i have noticed that it always happens.
From my point of view, however, this doesn't make any sense, because floating point numbers are stored base-two, so multiplication by ten isn't really possible, it will always be mutiplication by 10.something.
I ask this question because I wanted to create a function that calculates how precise a type is. I have came up with something like this:
template <typename T>
unsigned accuracy(){
unsigned acc = 0;
T num = (T)1/(T)3;
while((unsigned)(num *= 10) == 3){
acc++;
num -= 3;
}
return acc;
}
Now, this works for any types I've used it with, but I'm still not sure that the first unprecise digit will always be carried over in an unchanged form.
I'll talk specifically about IEEE754 doubles since that what I think you're asking for.
Doubles are defined as a sign bit, an 11-bit exponent and a 52-bit mantissa, which are concatenated to form a 64-bit value:
sign|exponent|mantissa
Exponent bits are stored in a biased format, which means we store the actual exponent +1023 (for a double). The all-zeros exponent and all-ones exponent are special, so we end up being able to represent an exponent from 2^-1022 to 2^+1023
It's a common misconception that integer values can't be represented exactly by doubles, but we can actually store any integer in [0,2^53) exactly by setting the mantissa and exponent properly, in fact the range [2^52,2^53) can only store the integer values in that range. So 10 is easily stored exactly in a double.
When it comes to multiplying doubles, we effectively have two numbers of this form:
A = (-1)^sA*mA*2^(eA-1023)
B = (-1)^sB*mB*2^(eB-1023)
Where sA,mA,eA are the sign,mantissa and exponent for A (and similarly for B).
If we multiply these:
A*B = (-1)^(sA+sB)*(mA*mB)*2^((eA-1023)+(eB-1023))
We can see that we merely sum the exponents, and then multiply the mantissas. This actually isn't bad for precision! We might overflow the exponent bits (and thus get an infinity), but other than that we just have to round the intermediate mantissa result back to 52 bits, but this will at worst only change the least significant bit in the new mantissa.
Ultimately, the error you'll see will be proportional to the magnitude of the result. But, doubles have an error proportional to their magnitude anyways so this is really as safe as we can get. The way to approximate the error in your number is as |magnitude|*2^-53. In your case, since 10 is exact, the only error will come in the representation of pi. It will have an error of ~2^-51 and thus the result will as well.
As a rule of thumb, I consider doubles to have ~15 digits of decimal precision when thinking about precision concerns.
Lets assume that for single precision 3.1415 is
0x40490E56
in IEEE 754 format which is a very popular but not the only format used.
01000000010010010000111001010110
0 10000000 10010010000111001010110
so the binary portion is 1.10010010000111001010110
110010010000111001010110
1100 1001 0000 1110 0101 0110
0xC90E56 * 10 = 0x7DA8F5C
Just like in grade school with decimal you worry about the decimal(/binary) point later, you just do a multiply.
01111.10110101000111101011100
to get into IEEE 754 format it needs to be shifted to a 1.mantissa format
so that is a shift of 3
1.11110110101000111101011
but look at the three bits chopped off 100 specifically the 1 so this means depending on the rounding mode you round, in this case lets round up
1.11110110101000111101100
0111 1011 0101 0001 1110 1100
0x7BA1EC
now if I already computed the answer:
0x41FB51EC
0 10000011 11110110101000111101100
we moved the point 3 and the exponent reflects that, the mantissa matches what we computed. we did lose one of the original non-zero bits off the right, but is that too much loss?
double, extended, work the same way just more exponent and mantissa bits, more precision and range. but at the end of the day it is nothing more than what we learned in grade school as far as the math goes, the format requires 1.mantissa so you have to use your grade school math to adjust the exponent of the base to get it in that form.
Now, can I be sure, that the five get's carried over to the precise digits?
In general, no. You can only be sure about the precision of output when you know the exact representation format used by your system, and know that the correct output is exactly representable in that format.
If you want precise result for any rational input, then you cannot use finite precision.
It seems that your function attempts to calculate how accurately the floating point type can represent 1/3. This accuracy is not useful for evaluating accuracy of representing other numbers.
because floating point numbers are stored base-two
While very common, this is not universally true. Some systems use base-10 for example.

Errors multiplying large doubles

I've made a BOMDAS calculator in C++ that uses doubles. Whenever I input an expression like
1000000000000000000000*1000000000000000000000
I get a result like 1000000000000000000004341624882808674582528.000000. I suspect it has something to do with floating-point numbers.
Floating point number represent values with a fixed size representation. A double can represent 16 decimal digits in form where the decimal digits can be restored (internally, it normally stores the value using base 2 which means that it can accurately represent most fractional decimal values). If the number of digits is exceeded, the value will be rounded appropriately. Of course, the upshot is that you won't necessarily get back the digits you're hoping for: if you ask for more then 16 decimal digits either explicitly or implicitly (e.g. by setting the format to std::ios_base::fixed with numbers which are bigger than 1e16) the formatting will conjure up more digits: it will accurately represent the internally held binary values which may produce up to, I think, 54 non-zero digits.
If you want to compute with large values accurately, you'll need some variable sized representation. Since your values are integers a big integer representation might work. These will typically be a lot slower to compute with than double.
A double stores 53 bits of precision. This is about 15 decimal digits. Your problem is that a double cannot store the number of digits you are trying to store. Digits after the 15th decimal digit will not be accurate.
That's not an error. It's exactly because of how floating-point types are represented, as the result is precise to double precision.
Floating-point types in computers are written in the form (-1)sign * mantissa * 2exp so they only have broader ranges, not infinite precision. They're only accurate to the mantissa precision, and the result after every operation will be rounded as such. The double type is most commonly implemented as IEEE-754 64-bit double precision with 53 bits of mantissa so it can be correct to log(253) ≈ 15.955 decimal digits. Doing 1e21*1e21 produces 1e42 which when rounding to the closest value in double precision gives the value that you saw. If you round that to 16 digits it's exactly the same as 1e42.
If you need more range, use double or long double. If you only works with integer then int64_t (or __int128 with gcc and many other compilers on 64-bit platforms) has a much larger precision (64/128 bits compared to 53 bits). If you need even more precision, use an arbitrary-precision arithmetic library instead such as GMP

Float cast reduces value by 1

When casting (float)33554329L the result is 33554328. if the number is then cast back to a long the value stays at 33554328, has any one an explanation for this.
Using VS2005 in C++ [non managed]
32 bit float has 23 bits for the mantissa which are 8,388,608 distinct values. This means that the accuracy is around 7 significant decimal digits. Your number has 8 decimal significant digits so you see the loss of accuracy in the one last significant digit.
Here's More information on float representation
Double precision are 64 bits and have 52 bit for the mantissa which is 4,503,599,627,370,496 (a 16 digit number) and thus have roughly 15-16 decimal digit accuracy.
A decimal type is something that potentially allows you to save any number of any length in any accuracy. C# has them but unfortunately they are not a primitive type in C++. You can probably find some 3rd party library that implements them in C++.
Read this:
"What every computer scientist should know about floating point"
http://www.validlab.com/goldberg/paper.pdf
Floats have very low precision for high (3 billion +) numbers.
Float's precision is the best in range 0-1. The further you go from zero, the lesser is the precision. And at around three billion, it is not even precise enough to hold every integer (so it rounds to the closest value it can represent).
Solution: Use double (or decimal representation).
The accuracy of representation of various floating point types varies based on their size. For a 32 bit float you can expect approximately 7 digits of precision. For double's it is approximately 16 digits.
I highly recommend reading up on floating point representations and the various advantages and disadvantages. It'll save you a lot of hassle in the long run, especially when things like comparisons don't work as you expect.

floating point issue

I have a floating value as 0.1 entering from UI.
But, while converting that string to float i am getting as 0.10...01. The problem is the appending of non zero digit. How do i tackle with this problem.
Thanks,
iSight
You need to do some background reading on floating point representations: http://docs.sun.com/source/806-3568/ncg_goldberg.html.
Given computers are on-off switches, they're storing a rounded answer, and they work in base two not the base ten we humans seem to like.
Your options are to:
display it back with less digits so you round back to base 10 (checkout the Standard library's <iomanip> header, and setprecision)
store the number in some actual decimal-capable object - you'll find plenty of C++ classes to do this via google, but none are provided in the Standard, nor in boost last I looked
convert the input from a string directly to an integral number of some smaller unit (like thousandths), avoiding the rounding.
0.1 (decimal) = 0.00011001100110011... (binary)
So, in general, a number you can represent with a finite number of decimal digits may not be representable with a finite number of bits. But floating point numbers only store the most N significant bits. So, conversions between a decimal string and a "binary" float usually involves rounding.
However a lossless roundtrip conversion decimal string -> double -> decimal string is possible if you restrict yourself to decimal strings with at most 15 significant digits (assuming IEEE 754 64 bit floats). This includes the last conversion. You need to produce a string from the double with at most 15 significant digits.
It is also possible to make the roundtrip double -> string -> double lossless. But here you may need decimal strings with 17 decimal digits to make it work (again assuming IEEE-754 64bit floats).
The best site I've ever seen that explains why some numbers can't be represented exactly is Harald Schmidt's IEEE754 Converter site.
It's an online tool for showing representations of IEEE754 single precision values and I liked it so much, I wrote my own Java app to do it (and double precision as well).
Bottom line, there are only about four billion different 32-bit values you can have but there are an infinite number of real values between any two different values. So you have a problem with precision. That's something you'll have to get used to.
If you want more precision and/or better type for decimal values, you can either:
switch to a higher number of bits.
use a decimal type
use a big-number library like GMP (although I refuse to use this in production code since I discovered it doesn't handle memory shortages elegantly).
Alternatively, you can use the inaccurate values (their error rates are very low, something like one part per hundred million for floats, from memory) and just print them out with less precision. Printing out 0.10000000145 to two decimal places will get you 0.10.
You would have to do millions and millions of additions for the error to accumulate noticeably. Less of other operations of course but still a lot.
As to why you're getting that value, 0.1 is stored in IEEE754 single precision format as follows (sign, exponent and mantissa):
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm 1/n
0 01111011 10011001100110011001101
||||||||||||||||||||||+- 8388608
|||||||||||||||||||||+-- 4194304
||||||||||||||||||||+--- 2097152
|||||||||||||||||||+---- 1048576
||||||||||||||||||+----- 524288
|||||||||||||||||+------ 262144
||||||||||||||||+------- 131072
|||||||||||||||+-------- 65536
||||||||||||||+--------- 32768
|||||||||||||+---------- 16384
||||||||||||+----------- 8192
|||||||||||+------------ 4096
||||||||||+------------- 2048
|||||||||+-------------- 1024
||||||||+--------------- 512
|||||||+---------------- 256
||||||+----------------- 128
|||||+------------------ 64
||||+------------------- 32
|||+-------------------- 16
||+--------------------- 8
|+---------------------- 4
+----------------------- 2
The sign is positive, that's pretty easy.
The exponent is 64+32+16+8+2+1 = 123 - 127 bias = -4, so the multiplier is 2-4 or 1/16.
The mantissa is chunky. It consists of 1 (the implicit base) plus (for all those bits with each being worth 1/(2n) as n starts at 1 and increases to the right), {1/2, 1/16, 1/32, 1/256, 1/512, 1/4096, 1/8192, 1/65536, 1/131072, 1/1048576, 1/2097152, 1/8388608}.
When you add all these up, you get 1.60000002384185791015625.
When you multiply that by the multiplier, you get 0.100000001490116119384765625, matching the double precision value on Harald's site as far as it's printed:
0.10000000149011612 (out by 0.00000000149011612)
And when you turn off the least significant (rightmost) bit, which is the smallest downward movement you can make, you get:
0.09999999403953552 (out by 0.00000000596046448)
Putting those two together:
0.10000000149011612 (out by 0.00000000149011612)
|
0.09999999403953552 (out by 0.00000000596046448)
you can see that the first one is a closer match, by about a factor of four (14.9:59.6). So that's the closest value you can get to 0.1.
Since floats get stored in binary, the fractional portion is effectively in base-two... and one-tenth is a repeating decimal in base two, same as one-ninth is in base ten.
The most common ways to deal with this are to store your values as appropriately-scaled integers, as in the C# or SQL currency types, or to round off floating-point numbers when you display them.

Long Integer and Float

If a Long Integer and a float both take 4 bytes to store in memory then why are their ranges different?
Integers are stored like this:
1 bit for the sign (+/-)
31 bits for the value.
Floats are stored differently, giving greater range at the expense of accuracy:
1 bit for the sign (+/-)
N bits for the mantissa S
M bits for the exponent E
Float is represented in the exponential form: (+/-)S*(base)^E
BTW, "long" isn't always 32 bits. See this article.
Different way to encode your numbers.
Long counts up from 1 to 2^(4*8).
Float uses only 23 of the 32 bits for the "counting". But it adds "range" with an exponent in the other bits. So you have bigger numbers, but they are less accurate (in the lower based parts):
1.2424 * 10^54 (mantisse * exponent) is certainly bigger than 2^32. But you can discern a long 2^31 from a long 2^31-1 whereas you can't discern a float 1.24 * 10^54 and a float 1.24 * 10^54 - 1: the 1 just is lost in this representation as float.
They are not always the same size. But even when they are, their ranges are different because they serve different purposes. One is for integers with no decimal places, and one is for decimals.
This can be explained in terms of why a floating point representation can represent a larger range of numbers than a fixed point representation. This text from the Wikipedia entry:
The advantage of floating-point
representation over fixed-point (and
integer) representation is that it can
support a much wider range of values.
For example, a fixed-point
representation that has seven decimal
digits, with the decimal point assumed
to be positioned after the fifth
digit, can represent the numbers
12345.67, 8765.43, 123.00, and so on, whereas a floating-point
representation (such as the IEEE 754
decimal32 format) with seven decimal
digits could in addition represent
1.234567, 123456.7, 0.00001234567, 1234567000000000, and so on. The
floating-point format needs slightly
more storage (to encode the position
of the radix point), so when stored in
the same space, floating-point numbers
achieve their greater range at the
expense of precision.
Indeed a float takes 4 bytes (32bits), but since it's a float you have to store different things in these 32 bits:
1 bit is used for the sign (+/-)
8 bits are used for the exponent
23 bits are used for the significand (the significant digits)
You can see that the range of a float directly depends on the number of bits allocated to the significand, and the min/max values depend on the numbre of bits allocated for the exponent.
With the upper example:
8 bits for the exponent (= size of a char) gives an exponent range [-128,127]
--> max is about 127*log10(2) = 10^38
Regarding a long integer, you've got 1 bit used for the sign and then 31 bits to represent the integer value leading to a max of 2 147 483 647.
You can have a look at Wikipedia for more precise info:
Wikipedia - Floating point
Their ranges are different because they use different ways of representing numbers.
long (in c) is equivalent to long int. The size of this type varies between processors and compilers, but is often, as you say, 32 bits. At 32 bits, it can represent 232 different values. Since we often want to use negative numbers, computers normally represent integers using a format called "two's complement". This way, we can represent numbers from (-231) and up to (231-1). Counting the number 0, this adds up to 232 numbers.
float (in c) is usually a single presicion IEEE 754 formatted number. At 32 bits, this data type can also take 232 different bit patterns, but they are not used to directly represent whole numbers, like in the long. Instead, they represent a sign, and the mantisse and exponent of a normalized decimal number.
In general: when you have more range of values (float has up to 10^many), you have less precision.
This is what happens here. If you need integers, 32-bit long will give you more.
In a handwavey high level, floating point sacrefices integer precision to extend its range. This is done by combining a base value with a scaling factor. For large values, a float will not be able to precisely represent all integers but for small values it will represent better than integer precision.
No, the size of primitive data types in C is Implementation Defined.
This wiki entry clearly states: The floating-point format needs slightly more storage (to encode the position of the radix point), so when stored in the same space, floating-point numbers achieve their greater range at the expense of precision.