Some floating point precision and numeric limits question - c++

I know that there are tons of questions like this one, but I couldn't find my answers. Please read before voting to close (:
According to PC ASM:
The numeric coprocessor has eight floating point registers.
Each register holds 80 bits of data.
Floating point numbers are always stored as 80-bit
extended precision numbers in these registers.
How is that possible, when sizeof shows different things. For example, on x64 architecture, the sizeof double is 8 and this is far away from 80bits.
why does std::numeric_limits< long double >::max() gives me 1.18973e+4932 ?! This is huuuuuuuuuuge number. If this is not the way to get max of floating point numbers, then why this compiles at all, and even more - why does this returns a value.
what does this mean:
Double precision magnitudes can range from approximately 10^−308 to 10^308
These are huge numbers, you cannot store them into 8B or even 16B (which is extended precision and it is only 128bits)?
Obviously, I'm missing something. Actually, obviously, a lot of things.

1) sizeof is the size in memory, not in a register. sizeof is in bytes, so 8 bytes = 64 bits. When doubles are calculated in memory (on this architecture), they get an extra 16 bits for more precise intermediate calculations. When the value is copied back to memory, the extra 16 bits are lost.
2) Why do you think long double doesn't go up to 1.18973e+4932?
3) Why can't you store 10^308 in 8 bytes? I only need 13 bits: 4 to store the 10, and 9 to store the 308.

A double is not an intel coprocessor 80 bit floating point, it is a IEEE 754 64 bit floating point. With sizeof(double) you will get the size of the latter.
This is the correct way to get the maximum value for long double, so your question is pointless.
You are probably missing that floating point numbers are not exact numbers. 10^308 doesn't store 308 digits, only about 19 digits.

The size of space that the FPU uses and the amount of space used in memory to represent double are two different things. IEEE 754 (which probably most architectures use) specifies 32-bit single precision and 64-bit double precision numbers, which is why sizeof(double) gives you 8 bytes. Intel x86 does floating point math internally using 80 bits.
std::numeric_limits< long double >::max() is giving you the correct size for long double which is typically 80 bits. If you want the max size for 64 bit double you should use that as the template parameter.
For the question about ranges, why do you think you can't store them in 8 bytes? They do in fact fit, and what you're missing is that at the extremes of the range there are number that can't be represented (for example exponent nearing 308, there are many many integers that cant' be represented at all).
See also http://floating-point-gui.de/ for info about floating point.

Floating point number on computer are represented according to the IEEE 754-2008.
It defines several formats, amongst which
binary32 = Single precision,
binary64 = Double precision and
binary128 = Quadruple precision are the most common.
http://en.wikipedia.org/wiki/IEEE_754-2008#Basic_formats
Double precision number have 52 bits for the digit, which gives the precision, and 10 bits for the exponent, which gives the size of the number.
So doubles are 1.xxx(52 binary digits) * 2 ^ exponent(10 binary digits, so up to 2^10=1024)
And 2^1024 = 1,79 * 10^308
Which is why this is the largest value you can store in a double.
When using a quadruple precision number, they are 112 bits of precision and 14 digits for the exponent, so the largest exponent is 16384.
As 2^16384 gives 1,18 * 10^4932 you see that your C++ test was perfectly correct and that on x64 your double is actually stored in a quadruple precision number.

Related

How many decimal places does the primitive float and double support? [duplicate]

This question already has answers here:
'float' vs. 'double' precision
(6 answers)
Closed 8 years ago.
I have read that double stores 15 digits and float stores 7 digits.
My question is, are these numbers the number of decimal places supported or total number of digits in a number?
If you are on an architecture using IEEE-754 floating point arithmetic (as in most architectures), then the type float corresponds to single precision, and the type double corresponds to double precision, as described in the standard.
Let's make some numbers:
Single precision:
32 bits to represent the number, out of which 24 bits are for mantissa. This means that the least significant bit (LSB) has a relative value of 2^(-24) respect to the MSB, which is the "hidden 1", and it is not represented. Therefore, for a fixed exponent, the minimum representable value is 10^(-7.22) times the exponent. What this means is that for a representation in base exponent notation (3.141592653589 E 25), only "7.22" decimal numbers are significant, which in practice means that at least 7 decimals will be always correct.
Double precision:
64 bits to represent the number, out of which 53 bits are for mantissa. Following the same reasoning, expressing 2^(-53) as a power of 10 results in 10^(-15.95), which in term means that at least 15 decimals will be always correct.
Those are the total number of "significant figures" if you will, counting from left to right, regardless of where the decimal point is. Beyond those numbers of digits, accuracy is not preserved.
The counts you listed are for the base 10 representation.
There are macros for the number of decimal places each type supports. The gcc docs explain what they are and also what they mean:
FLT_DIG
This is the number of decimal digits of precision for the float data type. Technically, if p and b are the precision and base (respectively) for the representation, then the decimal precision q is the maximum number of decimal digits such that any floating point number with q base 10 digits can be rounded to a floating point number with p base b digits and back again, without change to the q decimal digits.
The value of this macro is supposed to be at least 6, to satisfy ISO C.
DBL_DIG
LDBL_DIG
These are similar to FLT_DIG, but for the data types double and long double, respectively. The values of these macros are supposed to be at least 10.
On both gcc 4.9.2 and clang 3.5.0, these macros yield 6 and 15, respectively.
are these numbers the number of decimal places supported or total number of digits in a number?
They are the significant digits contained in every number (although you may not need all of them, but they're still there). The mantissa of the same type always contains the same number of bits, so every number consequentially contains the same number of valid "digits" if you think in terms of decimal digits. You cannot store more digits than will fit into the mantissa.
The number of "supported" digits is, however, much larger, for example float will usually support up to 38 decimal digits and double will support up to 308 decimal digits, but most of these digits are not significant (that is, "unknown").
Although technically, this is wrong, since float and double do not have universally well-defined sizes like I presumed above (they're implementation-defined). Also, storage sizes are not necessarily the same as the sizes of intermediate results.
The C++ standard is very reluctant at precisely defining any fundamental type, leaving almost everything to the implementation. Floating point types are no exception:
3.9.1 / 8
There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined.
Now of course all of this is not particularly helpful in practice.
In practice, floating point is (usually) IEEE 754 compliant, with float having a width of 32 bits and double having a width of 64 bits (as stored in memory, registers have higher precision on some notable mainstream architectures).
This is equivalent to 24 bits and 53 bits of matissa, respectively, or 7 and 15 full decimals.

Errors multiplying large doubles

I've made a BOMDAS calculator in C++ that uses doubles. Whenever I input an expression like
1000000000000000000000*1000000000000000000000
I get a result like 1000000000000000000004341624882808674582528.000000. I suspect it has something to do with floating-point numbers.
Floating point number represent values with a fixed size representation. A double can represent 16 decimal digits in form where the decimal digits can be restored (internally, it normally stores the value using base 2 which means that it can accurately represent most fractional decimal values). If the number of digits is exceeded, the value will be rounded appropriately. Of course, the upshot is that you won't necessarily get back the digits you're hoping for: if you ask for more then 16 decimal digits either explicitly or implicitly (e.g. by setting the format to std::ios_base::fixed with numbers which are bigger than 1e16) the formatting will conjure up more digits: it will accurately represent the internally held binary values which may produce up to, I think, 54 non-zero digits.
If you want to compute with large values accurately, you'll need some variable sized representation. Since your values are integers a big integer representation might work. These will typically be a lot slower to compute with than double.
A double stores 53 bits of precision. This is about 15 decimal digits. Your problem is that a double cannot store the number of digits you are trying to store. Digits after the 15th decimal digit will not be accurate.
That's not an error. It's exactly because of how floating-point types are represented, as the result is precise to double precision.
Floating-point types in computers are written in the form (-1)sign * mantissa * 2exp so they only have broader ranges, not infinite precision. They're only accurate to the mantissa precision, and the result after every operation will be rounded as such. The double type is most commonly implemented as IEEE-754 64-bit double precision with 53 bits of mantissa so it can be correct to log(253) ≈ 15.955 decimal digits. Doing 1e21*1e21 produces 1e42 which when rounding to the closest value in double precision gives the value that you saw. If you round that to 16 digits it's exactly the same as 1e42.
If you need more range, use double or long double. If you only works with integer then int64_t (or __int128 with gcc and many other compilers on 64-bit platforms) has a much larger precision (64/128 bits compared to 53 bits). If you need even more precision, use an arbitrary-precision arithmetic library instead such as GMP

More Precise Floating point Data Types than double?

In my project I have to compute division, multiplication, subtraction, addition on a matrix of double elements.
The problem is that when the size of matrix increases the accuracy of my output is drastically getting affected.
Currently I am using double for each element which I believe uses 8 bytes of memory & has accuracy of 16 digits irrespective of decimal position.
Even for large size of matrix the memory occupied by all the elements is in the range of few kilobytes. So I can afford to use datatypes which require more memory.
So I wanted to know which data type is more precise than double.
I tried searching in some books & I could find long double.
But I dont know what is its precision.
And what if I want more precision than that?
According to Wikipedia, 80-bit "Intel" IEEE 754 extended-precision long double, which is 80 bits padded to 16 bytes in memory, has 64 bits mantissa, with no implicit bit, which gets you 19.26 decimal digits. This has been the almost universal standard for long double for ages, but recently things have started to change.
The newer 128-bit quad-precision format has 112 mantissa bits plus an implicit bit, which gets you 34 decimal digits. GCC implements this as the __float128 type and there is (if memory serves) a compiler option to set long double to it.
You might want to consider the sequence of operations, i.e. do the additions in an ordered sequence starting with the smallest values first. This will increase overall accuracy of the results using the same precision in the mantissa:
1e00 + 1e-16 + ... + 1e-16 (1e16 times) = 1e00
1e-16 + ... + 1e-16 (1e16 times) + 1e00 = 2e00
The point is that adding small numbers to a large number will make them disappear. So the latter approach reduces the numerical error
Floating point data types with greater precision than double are going to depend on your compiler and architecture.
In order to get more than double precision, you may need to rely on some math library that supports arbitrary precision calculations. These probably won't be fast though.
On Intel architectures the precision of long double is 80bits.
What kind of values do you want to represent? Maybe you are better off using fixed precision.

How can 8 bytes hold 302 decimal digits? (Euler challenge 16)

c++ pow(2,1000) is normaly to big for double, but it's working. why?
So I've been learning C++ for couple weeks but the datatypes are still confusing me.
One small minor thing first: the code that 0xbadc0de posted in the other thread is not working for me.
First of all pow(2,1000) gives me this more than once instance of overloaded function "pow" matches the argument list.
I fixed it by changing pow(2,1000) -> pow(2.0,1000)
Seems fine, i run it and get this:
http://i.stack.imgur.com/bbRat.png
Instead of
10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
it is missing a lot of the values, what might be cause that?
But now for the real problem.
I'm wondering how can 302 digits long number fit a double (8 bytes)?
0xFFFFFFFFFFFFFFFF = 18446744073709551616 so how can the number be larger than that?
I think it has something to do with the floating point number encoding stuff.
Also what is the largest number that can possibly be stored in 8 bytes if it's not 0xFFFFFFFFFFFFFFFF?
Eight bytes contain 64 bits of information, so you can store 2^64 ~ 10^20 unique items using those bits. Those items can easily be interpreted as the integers from 0 to 2^64 - 1. So you cannot store 302 decimal digits in 8 bytes; most numbers between 0 and 10^303 - 1 cannot be so represented.
Floating point numbers can hold approximations to numbers with 302 decimal digits; this is because they store the mantissa and exponent separately. Numbers in this representation store a certain number of significant digits (15-16 for doubles, if I recall correctly) and an exponent (which can go into the hundreds, of memory serves). However, if a decimal is X bytes long, then it can only distinguish between 2^(8X) different values... unlikely enough for exactly representing integers with 302 decimal digits.
To represent such numbers, you must use many more bits: about 1000, actually, or 125 bytes.
It's called 'floating point' for a reason. The datatype contains a number in the standard sense, and an exponent which says where the decimal point belongs. That's why pow(2.0, 1000) works, and it's why you see a lot of zeroes. A floating point (or double, which is just a bigger floating point) number contains a fixed number of digits of precision. All the remaining digits end up being zero. Try pow(2.0, -1000) and you'll see the same situation in reverse.
The number of decimal digits of precision in a float (32 bits) is about 7, and for a double (64 bits) it's about 16 decimal digits.
Most systems nowadays use IEEE floating point, and I just linked to a really good description of it. Also, the article on the specific standard IEEE 754-1985 gives a detailed description of the bit layouts of various sizes of floating point number.
2.0 ^ 1000 mathematically will have a decimal (non-floating) output. IEEE floating point numbers, and in your case doubles (as the pow function takes in doubles and outputs a double) have 52 bits of the 64 bit representation allocated to the mantissa. If you do the math, 2^52 = 4,503,599,627,370,496. Because a floating point number can represent positive and negative numbers, really the integer representation will be ~ 2^51 = 2,251,799,813,685,248. Notice there are 16 digits. there are 16 quality (non-zero) digits in the output you see.
Essentially the pow function is going to perform the exponentiation, but once the exponentiation moves past ~2^51, it is going to begin losing precision. Ultimately it will hold precision for the top ~16 decimal digits, but all other digits right will be un-guaranteed.
Thus it is a floating point precision / rounding problem.
If you were strictly in unsigned integer land, the number would overflow after (2^64 - 1) = 18,446,744,073,709,551,616. What overflowing means, is that you would never actually see the number go ANY HIGHER than the one provided, infact I beleive the answer would be 0 from this operation. Once the answer goes beyond 2^64, the result register would be zero, and any multiply afterwords would be 0 * 2, which would always result in 0. I would have to try it.
The exact answer (as you show) can be obtained using a standard computer using a multi-precision libary. What these do is to emulate a larger bit computer by concatenating multiple of the smaller data types, and use algorithms to convert and print on the fly. Mathematica is one example of a math engine that implements an arbitrary precision math calculation library.
Floating point types can cover a much larger range than integer types of the same size, but with less precision.
They represent a number as:
a sign bit s to indicate positive or negative;
a mantissa m, a value between 1 and 2, giving a certain number of bits of precision;
an exponent e to indicate the scale of the number.
The value itself is calculated as m * pow(2,e), negated if the sign bit is set.
A standard double has a 53-bit mantissa, which gives about 16 decimal digits of precision.
So, if you need to represent an integer with more than (say) 64 bits of precision, then neither a 64-bit integer nor a 64-bit floating-point type will work. You will need either a large integer type, with as many bits as necessary to represent the values you're using, or (depending on the problem you're solving) some other representation such as a prime factorisation. No such type is available in standard C++, so you'll need to make your own.
If you want to calculate the range of the digits that can be hold by some bytes, it should be (2^(64bits - 1bit)) to (2^(64bits - 1bit) - 1).
Because the left most digit of the variable is for representing sign (+ and -).
So the range for negative side of the number should be : (2^(64bits - 1bit))
and the range for positive side of the number should be : (2^(64bits - 1bit) - 1)
there is -1 for the positive range because of 0(to avoid reputation of counting 0 for each side).
For example if we are calculating 64bits, the range should be ==> approximately [-9.223372e+18] to [9.223372e+18]

Long Integer and Float

If a Long Integer and a float both take 4 bytes to store in memory then why are their ranges different?
Integers are stored like this:
1 bit for the sign (+/-)
31 bits for the value.
Floats are stored differently, giving greater range at the expense of accuracy:
1 bit for the sign (+/-)
N bits for the mantissa S
M bits for the exponent E
Float is represented in the exponential form: (+/-)S*(base)^E
BTW, "long" isn't always 32 bits. See this article.
Different way to encode your numbers.
Long counts up from 1 to 2^(4*8).
Float uses only 23 of the 32 bits for the "counting". But it adds "range" with an exponent in the other bits. So you have bigger numbers, but they are less accurate (in the lower based parts):
1.2424 * 10^54 (mantisse * exponent) is certainly bigger than 2^32. But you can discern a long 2^31 from a long 2^31-1 whereas you can't discern a float 1.24 * 10^54 and a float 1.24 * 10^54 - 1: the 1 just is lost in this representation as float.
They are not always the same size. But even when they are, their ranges are different because they serve different purposes. One is for integers with no decimal places, and one is for decimals.
This can be explained in terms of why a floating point representation can represent a larger range of numbers than a fixed point representation. This text from the Wikipedia entry:
The advantage of floating-point
representation over fixed-point (and
integer) representation is that it can
support a much wider range of values.
For example, a fixed-point
representation that has seven decimal
digits, with the decimal point assumed
to be positioned after the fifth
digit, can represent the numbers
12345.67, 8765.43, 123.00, and so on, whereas a floating-point
representation (such as the IEEE 754
decimal32 format) with seven decimal
digits could in addition represent
1.234567, 123456.7, 0.00001234567, 1234567000000000, and so on. The
floating-point format needs slightly
more storage (to encode the position
of the radix point), so when stored in
the same space, floating-point numbers
achieve their greater range at the
expense of precision.
Indeed a float takes 4 bytes (32bits), but since it's a float you have to store different things in these 32 bits:
1 bit is used for the sign (+/-)
8 bits are used for the exponent
23 bits are used for the significand (the significant digits)
You can see that the range of a float directly depends on the number of bits allocated to the significand, and the min/max values depend on the numbre of bits allocated for the exponent.
With the upper example:
8 bits for the exponent (= size of a char) gives an exponent range [-128,127]
--> max is about 127*log10(2) = 10^38
Regarding a long integer, you've got 1 bit used for the sign and then 31 bits to represent the integer value leading to a max of 2 147 483 647.
You can have a look at Wikipedia for more precise info:
Wikipedia - Floating point
Their ranges are different because they use different ways of representing numbers.
long (in c) is equivalent to long int. The size of this type varies between processors and compilers, but is often, as you say, 32 bits. At 32 bits, it can represent 232 different values. Since we often want to use negative numbers, computers normally represent integers using a format called "two's complement". This way, we can represent numbers from (-231) and up to (231-1). Counting the number 0, this adds up to 232 numbers.
float (in c) is usually a single presicion IEEE 754 formatted number. At 32 bits, this data type can also take 232 different bit patterns, but they are not used to directly represent whole numbers, like in the long. Instead, they represent a sign, and the mantisse and exponent of a normalized decimal number.
In general: when you have more range of values (float has up to 10^many), you have less precision.
This is what happens here. If you need integers, 32-bit long will give you more.
In a handwavey high level, floating point sacrefices integer precision to extend its range. This is done by combining a base value with a scaling factor. For large values, a float will not be able to precisely represent all integers but for small values it will represent better than integer precision.
No, the size of primitive data types in C is Implementation Defined.
This wiki entry clearly states: The floating-point format needs slightly more storage (to encode the position of the radix point), so when stored in the same space, floating-point numbers achieve their greater range at the expense of precision.