Float cast reduces value by 1 - c++

When casting (float)33554329L the result is 33554328. if the number is then cast back to a long the value stays at 33554328, has any one an explanation for this.
Using VS2005 in C++ [non managed]

32 bit float has 23 bits for the mantissa which are 8,388,608 distinct values. This means that the accuracy is around 7 significant decimal digits. Your number has 8 decimal significant digits so you see the loss of accuracy in the one last significant digit.
Here's More information on float representation
Double precision are 64 bits and have 52 bit for the mantissa which is 4,503,599,627,370,496 (a 16 digit number) and thus have roughly 15-16 decimal digit accuracy.
A decimal type is something that potentially allows you to save any number of any length in any accuracy. C# has them but unfortunately they are not a primitive type in C++. You can probably find some 3rd party library that implements them in C++.

Read this:
"What every computer scientist should know about floating point"
http://www.validlab.com/goldberg/paper.pdf

Floats have very low precision for high (3 billion +) numbers.
Float's precision is the best in range 0-1. The further you go from zero, the lesser is the precision. And at around three billion, it is not even precise enough to hold every integer (so it rounds to the closest value it can represent).
Solution: Use double (or decimal representation).

The accuracy of representation of various floating point types varies based on their size. For a 32 bit float you can expect approximately 7 digits of precision. For double's it is approximately 16 digits.
I highly recommend reading up on floating point representations and the various advantages and disadvantages. It'll save you a lot of hassle in the long run, especially when things like comparisons don't work as you expect.

Related

Errors multiplying large doubles

I've made a BOMDAS calculator in C++ that uses doubles. Whenever I input an expression like
1000000000000000000000*1000000000000000000000
I get a result like 1000000000000000000004341624882808674582528.000000. I suspect it has something to do with floating-point numbers.
Floating point number represent values with a fixed size representation. A double can represent 16 decimal digits in form where the decimal digits can be restored (internally, it normally stores the value using base 2 which means that it can accurately represent most fractional decimal values). If the number of digits is exceeded, the value will be rounded appropriately. Of course, the upshot is that you won't necessarily get back the digits you're hoping for: if you ask for more then 16 decimal digits either explicitly or implicitly (e.g. by setting the format to std::ios_base::fixed with numbers which are bigger than 1e16) the formatting will conjure up more digits: it will accurately represent the internally held binary values which may produce up to, I think, 54 non-zero digits.
If you want to compute with large values accurately, you'll need some variable sized representation. Since your values are integers a big integer representation might work. These will typically be a lot slower to compute with than double.
A double stores 53 bits of precision. This is about 15 decimal digits. Your problem is that a double cannot store the number of digits you are trying to store. Digits after the 15th decimal digit will not be accurate.
That's not an error. It's exactly because of how floating-point types are represented, as the result is precise to double precision.
Floating-point types in computers are written in the form (-1)sign * mantissa * 2exp so they only have broader ranges, not infinite precision. They're only accurate to the mantissa precision, and the result after every operation will be rounded as such. The double type is most commonly implemented as IEEE-754 64-bit double precision with 53 bits of mantissa so it can be correct to log(253) ≈ 15.955 decimal digits. Doing 1e21*1e21 produces 1e42 which when rounding to the closest value in double precision gives the value that you saw. If you round that to 16 digits it's exactly the same as 1e42.
If you need more range, use double or long double. If you only works with integer then int64_t (or __int128 with gcc and many other compilers on 64-bit platforms) has a much larger precision (64/128 bits compared to 53 bits). If you need even more precision, use an arbitrary-precision arithmetic library instead such as GMP

How can 8 bytes hold 302 decimal digits? (Euler challenge 16)

c++ pow(2,1000) is normaly to big for double, but it's working. why?
So I've been learning C++ for couple weeks but the datatypes are still confusing me.
One small minor thing first: the code that 0xbadc0de posted in the other thread is not working for me.
First of all pow(2,1000) gives me this more than once instance of overloaded function "pow" matches the argument list.
I fixed it by changing pow(2,1000) -> pow(2.0,1000)
Seems fine, i run it and get this:
http://i.stack.imgur.com/bbRat.png
Instead of
10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
it is missing a lot of the values, what might be cause that?
But now for the real problem.
I'm wondering how can 302 digits long number fit a double (8 bytes)?
0xFFFFFFFFFFFFFFFF = 18446744073709551616 so how can the number be larger than that?
I think it has something to do with the floating point number encoding stuff.
Also what is the largest number that can possibly be stored in 8 bytes if it's not 0xFFFFFFFFFFFFFFFF?
Eight bytes contain 64 bits of information, so you can store 2^64 ~ 10^20 unique items using those bits. Those items can easily be interpreted as the integers from 0 to 2^64 - 1. So you cannot store 302 decimal digits in 8 bytes; most numbers between 0 and 10^303 - 1 cannot be so represented.
Floating point numbers can hold approximations to numbers with 302 decimal digits; this is because they store the mantissa and exponent separately. Numbers in this representation store a certain number of significant digits (15-16 for doubles, if I recall correctly) and an exponent (which can go into the hundreds, of memory serves). However, if a decimal is X bytes long, then it can only distinguish between 2^(8X) different values... unlikely enough for exactly representing integers with 302 decimal digits.
To represent such numbers, you must use many more bits: about 1000, actually, or 125 bytes.
It's called 'floating point' for a reason. The datatype contains a number in the standard sense, and an exponent which says where the decimal point belongs. That's why pow(2.0, 1000) works, and it's why you see a lot of zeroes. A floating point (or double, which is just a bigger floating point) number contains a fixed number of digits of precision. All the remaining digits end up being zero. Try pow(2.0, -1000) and you'll see the same situation in reverse.
The number of decimal digits of precision in a float (32 bits) is about 7, and for a double (64 bits) it's about 16 decimal digits.
Most systems nowadays use IEEE floating point, and I just linked to a really good description of it. Also, the article on the specific standard IEEE 754-1985 gives a detailed description of the bit layouts of various sizes of floating point number.
2.0 ^ 1000 mathematically will have a decimal (non-floating) output. IEEE floating point numbers, and in your case doubles (as the pow function takes in doubles and outputs a double) have 52 bits of the 64 bit representation allocated to the mantissa. If you do the math, 2^52 = 4,503,599,627,370,496. Because a floating point number can represent positive and negative numbers, really the integer representation will be ~ 2^51 = 2,251,799,813,685,248. Notice there are 16 digits. there are 16 quality (non-zero) digits in the output you see.
Essentially the pow function is going to perform the exponentiation, but once the exponentiation moves past ~2^51, it is going to begin losing precision. Ultimately it will hold precision for the top ~16 decimal digits, but all other digits right will be un-guaranteed.
Thus it is a floating point precision / rounding problem.
If you were strictly in unsigned integer land, the number would overflow after (2^64 - 1) = 18,446,744,073,709,551,616. What overflowing means, is that you would never actually see the number go ANY HIGHER than the one provided, infact I beleive the answer would be 0 from this operation. Once the answer goes beyond 2^64, the result register would be zero, and any multiply afterwords would be 0 * 2, which would always result in 0. I would have to try it.
The exact answer (as you show) can be obtained using a standard computer using a multi-precision libary. What these do is to emulate a larger bit computer by concatenating multiple of the smaller data types, and use algorithms to convert and print on the fly. Mathematica is one example of a math engine that implements an arbitrary precision math calculation library.
Floating point types can cover a much larger range than integer types of the same size, but with less precision.
They represent a number as:
a sign bit s to indicate positive or negative;
a mantissa m, a value between 1 and 2, giving a certain number of bits of precision;
an exponent e to indicate the scale of the number.
The value itself is calculated as m * pow(2,e), negated if the sign bit is set.
A standard double has a 53-bit mantissa, which gives about 16 decimal digits of precision.
So, if you need to represent an integer with more than (say) 64 bits of precision, then neither a 64-bit integer nor a 64-bit floating-point type will work. You will need either a large integer type, with as many bits as necessary to represent the values you're using, or (depending on the problem you're solving) some other representation such as a prime factorisation. No such type is available in standard C++, so you'll need to make your own.
If you want to calculate the range of the digits that can be hold by some bytes, it should be (2^(64bits - 1bit)) to (2^(64bits - 1bit) - 1).
Because the left most digit of the variable is for representing sign (+ and -).
So the range for negative side of the number should be : (2^(64bits - 1bit))
and the range for positive side of the number should be : (2^(64bits - 1bit) - 1)
there is -1 for the positive range because of 0(to avoid reputation of counting 0 for each side).
For example if we are calculating 64bits, the range should be ==> approximately [-9.223372e+18] to [9.223372e+18]

How to check whether a huge floating point number is an integer?

I have a very large floating point number (around 20 digits) and I want to check whether it is an integer or not. For example, if I have a number like 154.0 then it is an integer while 154.123123 is not an integer.
I need to check for very huge floating point numbers (20 digits or more) which means I can't first convert it into a long long datatype and see if both of them are same. Please shove me into right direction. I would appreciate answers only in C/C++. Thank you! :)
Well, what's "huge"? If the number is really huge in a sense that the number of digits is greater than the number representable by the mantissa of your floating-point number, then your floating-point number is always an integer.
For example, the IEEE 754 double-precision format has a 52-bit mantissa, which is sufficient for about 16 decimal digits. If your numbers have 20 decimal digits then any attempt to squeeze such numbers into a double will result in rounding, effectively turning your numbers into "integers".
You mention that your numbers are too large to fit into the long long datatype. If you are referring to 64-bit long long datatype, then it automatically means that your numbers are so large that they'll never have any fractional part when represented by a typical double type, i.e. they will always be "integers" if represented by double values.
P.S. Are you are using some exotic floating-point type with an extra-wide mantissa?
Just test whether x == floor(x)?

floating point issue

I have a floating value as 0.1 entering from UI.
But, while converting that string to float i am getting as 0.10...01. The problem is the appending of non zero digit. How do i tackle with this problem.
Thanks,
iSight
You need to do some background reading on floating point representations: http://docs.sun.com/source/806-3568/ncg_goldberg.html.
Given computers are on-off switches, they're storing a rounded answer, and they work in base two not the base ten we humans seem to like.
Your options are to:
display it back with less digits so you round back to base 10 (checkout the Standard library's <iomanip> header, and setprecision)
store the number in some actual decimal-capable object - you'll find plenty of C++ classes to do this via google, but none are provided in the Standard, nor in boost last I looked
convert the input from a string directly to an integral number of some smaller unit (like thousandths), avoiding the rounding.
0.1 (decimal) = 0.00011001100110011... (binary)
So, in general, a number you can represent with a finite number of decimal digits may not be representable with a finite number of bits. But floating point numbers only store the most N significant bits. So, conversions between a decimal string and a "binary" float usually involves rounding.
However a lossless roundtrip conversion decimal string -> double -> decimal string is possible if you restrict yourself to decimal strings with at most 15 significant digits (assuming IEEE 754 64 bit floats). This includes the last conversion. You need to produce a string from the double with at most 15 significant digits.
It is also possible to make the roundtrip double -> string -> double lossless. But here you may need decimal strings with 17 decimal digits to make it work (again assuming IEEE-754 64bit floats).
The best site I've ever seen that explains why some numbers can't be represented exactly is Harald Schmidt's IEEE754 Converter site.
It's an online tool for showing representations of IEEE754 single precision values and I liked it so much, I wrote my own Java app to do it (and double precision as well).
Bottom line, there are only about four billion different 32-bit values you can have but there are an infinite number of real values between any two different values. So you have a problem with precision. That's something you'll have to get used to.
If you want more precision and/or better type for decimal values, you can either:
switch to a higher number of bits.
use a decimal type
use a big-number library like GMP (although I refuse to use this in production code since I discovered it doesn't handle memory shortages elegantly).
Alternatively, you can use the inaccurate values (their error rates are very low, something like one part per hundred million for floats, from memory) and just print them out with less precision. Printing out 0.10000000145 to two decimal places will get you 0.10.
You would have to do millions and millions of additions for the error to accumulate noticeably. Less of other operations of course but still a lot.
As to why you're getting that value, 0.1 is stored in IEEE754 single precision format as follows (sign, exponent and mantissa):
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm 1/n
0 01111011 10011001100110011001101
||||||||||||||||||||||+- 8388608
|||||||||||||||||||||+-- 4194304
||||||||||||||||||||+--- 2097152
|||||||||||||||||||+---- 1048576
||||||||||||||||||+----- 524288
|||||||||||||||||+------ 262144
||||||||||||||||+------- 131072
|||||||||||||||+-------- 65536
||||||||||||||+--------- 32768
|||||||||||||+---------- 16384
||||||||||||+----------- 8192
|||||||||||+------------ 4096
||||||||||+------------- 2048
|||||||||+-------------- 1024
||||||||+--------------- 512
|||||||+---------------- 256
||||||+----------------- 128
|||||+------------------ 64
||||+------------------- 32
|||+-------------------- 16
||+--------------------- 8
|+---------------------- 4
+----------------------- 2
The sign is positive, that's pretty easy.
The exponent is 64+32+16+8+2+1 = 123 - 127 bias = -4, so the multiplier is 2-4 or 1/16.
The mantissa is chunky. It consists of 1 (the implicit base) plus (for all those bits with each being worth 1/(2n) as n starts at 1 and increases to the right), {1/2, 1/16, 1/32, 1/256, 1/512, 1/4096, 1/8192, 1/65536, 1/131072, 1/1048576, 1/2097152, 1/8388608}.
When you add all these up, you get 1.60000002384185791015625.
When you multiply that by the multiplier, you get 0.100000001490116119384765625, matching the double precision value on Harald's site as far as it's printed:
0.10000000149011612 (out by 0.00000000149011612)
And when you turn off the least significant (rightmost) bit, which is the smallest downward movement you can make, you get:
0.09999999403953552 (out by 0.00000000596046448)
Putting those two together:
0.10000000149011612 (out by 0.00000000149011612)
|
0.09999999403953552 (out by 0.00000000596046448)
you can see that the first one is a closer match, by about a factor of four (14.9:59.6). So that's the closest value you can get to 0.1.
Since floats get stored in binary, the fractional portion is effectively in base-two... and one-tenth is a repeating decimal in base two, same as one-ninth is in base ten.
The most common ways to deal with this are to store your values as appropriately-scaled integers, as in the C# or SQL currency types, or to round off floating-point numbers when you display them.

Setprecision() for a float number in C++?

In C++,
What are the random digits that are displayed after giving setprecision() for a floating point number?
Note: After setting the fixed flag.
example:
float f1=3.14;
cout < < fixed<<setprecision(10)<<f1<<endl;
we get random numbers for the remaining 7 digits? But it is not the same case in double.
Two things to be aware of:
floats are stored in binary.
float has a maximum of 24 significant bits. This is equivalent to 7.22 significant digits.
So, to your computer, there's no such number as 3.14. The closest you can get using float is 3.1400001049041748046875.
double has 53 significant bits (~15.95 significant digits), so you get a more accurate approximation, 3.140000000000000124344978758017532527446746826171875. The "noise" digits don't show up with setprecision(10), but would with setprecision(17) or higher.
They're not really "random" -- they're the (best available) decimal representation of that binary fraction (will be exact only for fractions whose denominator is a power of two, e.g., 3.125 would display exactly).
Of course that changes depending on the number of bits available to represent the binary fraction that best approaches the decimal one you originally entered as a literal, i.e., single vs double precision floats.
Not really a C++ specific issue (applies to all languages using binary floats, typically to exploit the machine's underlying HW, i.e., most languages). For a very bare-bone tutorial, I recommend reading this.