Reciprocal representation of integers in floating point numbers - ieee-754

I need to store a lot of different values as doubles between 0 to 1, to have a uniform representation. For example, an ARGB value - that is a 32-bit integer. Can doubles uniquely represent every integer value if i store it as a reciprocal? I know there's enough bits to do it, but I'm not sure whether the exponential spacing will prevent this.

The standard double has 52 bit mantissa, so yes, it is capable to hold and exactly reproduce a 32 bit integer.
Another problem is the requirement that they have to be beetween 0 and 1.
The reciprocal is not the way to do that! Counterexample: 1/3 is not exactly representable by a double.
You will have to divide the values to ensure the range. You may only divide or multiply by powers of two to preserve exact accuracy. So given you have unsigned 32 bit values convert them to double and then divide by 2^32. If you revert that on reading the values should be reproduced exactly. In C or C++ there are even special instructions to manipulate exponent and mantissa of a float or double directly, these may be more efficient and secure.

Related

Casting Double to Long is giving wrong value [duplicate]

I'm currently learning inter-type data convertion in cpp. I have been taught that
For a really large int, we can (for some computers) suffer a loss of
precision when converting to double.
But no reason was provided for the statement.
Could someone please provide an explanation and an example? Thanks
Let's say that the floating point number uses N bits of storage.
Now, let us assume that this float can precisely represent all integers that can be represented by an integer type of N bits. Since the N bit integer requires all of its N bits to represent all of its values, so would be the requirement for this float.
A floating point number should be able to represent fractional numbers. However, since all of the bits are used to represent the integers, there are zero bits left to represent any fractional number. This is a contradiction, and we must conclude that the assumption that float can precisely represent all integers as equally sized integer type must be erroneous.
Since there must be non-representable integers in the range of a N bit integer, it is possible that converting such integer to a floating point of N bits will lose precision, if the converted value happens to be one of the non-representable ones.
Now, since a floating point can represent a subset of rational numbers, some of those representable values may indeed be integers. In particular, the IEEE-754 spec guarantees that a binary double precision floating point can represent all integers up to 253. This property is directly associated with the length of the mantissa.
Therefore it is not possible to lose precision of a 32 bit integer when converting to a double on a system which conforms to IEEE-754.
More technically, the floating point unit of x86 architecture actually uses a 80-bit extended floating point format, which is designed to be able to represent precisely all of 64 bit integers and can be accessed using the long double type.
This may happen if int is 64 bit and double is 64 bit as well. Floating point numbers are composed of mantissa (represents the digits) and exponent. As mantissa for the double in such a case has less bits than the int, then double is able to represent less digits and a loss of precision happens.

converting really large int to double, loss of precision on some computer

I'm currently learning inter-type data convertion in cpp. I have been taught that
For a really large int, we can (for some computers) suffer a loss of
precision when converting to double.
But no reason was provided for the statement.
Could someone please provide an explanation and an example? Thanks
Let's say that the floating point number uses N bits of storage.
Now, let us assume that this float can precisely represent all integers that can be represented by an integer type of N bits. Since the N bit integer requires all of its N bits to represent all of its values, so would be the requirement for this float.
A floating point number should be able to represent fractional numbers. However, since all of the bits are used to represent the integers, there are zero bits left to represent any fractional number. This is a contradiction, and we must conclude that the assumption that float can precisely represent all integers as equally sized integer type must be erroneous.
Since there must be non-representable integers in the range of a N bit integer, it is possible that converting such integer to a floating point of N bits will lose precision, if the converted value happens to be one of the non-representable ones.
Now, since a floating point can represent a subset of rational numbers, some of those representable values may indeed be integers. In particular, the IEEE-754 spec guarantees that a binary double precision floating point can represent all integers up to 253. This property is directly associated with the length of the mantissa.
Therefore it is not possible to lose precision of a 32 bit integer when converting to a double on a system which conforms to IEEE-754.
More technically, the floating point unit of x86 architecture actually uses a 80-bit extended floating point format, which is designed to be able to represent precisely all of 64 bit integers and can be accessed using the long double type.
This may happen if int is 64 bit and double is 64 bit as well. Floating point numbers are composed of mantissa (represents the digits) and exponent. As mantissa for the double in such a case has less bits than the int, then double is able to represent less digits and a loss of precision happens.

How to correctly normalize a floating point value in C++?

Maybe I don't understand the IEEE754 standard that much, but given a set of floating point values that are float or double, for example :
56.543f 3238.124124f 121.3f ...
you are able to convert them in values ranging from 0 to 1, so you normalize them, by taking an appropriate common factor while considering what is the maximum value and the minimum value in the set.
Now my point is that in this transformation I need a much higher precision for the set of destination that ranges from 0 to 1 if compared to the level of precision that I need in the first one, especially if the values in the first set are covering a wide range of numerical values ( really big and really small values ).
How the float or the double ( or the IEEE 754 standard if you want ) type can handle this situation while providing more precision for the second set of values knowing that I will basically not need an integer part ?
Or it doesn't handle this at all and I need fixed point math with a totally different type ?
Floating point numbers are stored in a format similar to scientific notation. Internally, they align the leading 1 of the binary representation to the top of the significand. Each value is carried with the same number of binary digits of precision relative to its own magnitude.
When you compress your set of floating point values to the range 0..1, the only precision loss you will get will be due to the rounding that occurs in the various steps of the process.
If you're merely compressing by scaling, you will lose only a small amount of precision near the LSBs of the mantissa (around 1 or 2 ulp, where ulp means "units of the last place).
If you also need to shift your data, then things get trickier. If your data is all positive, then subtracting off the smallest number will not damage anything. But, if your data is a mixture of positive and negative data, then some of your values near zero may suffer a loss in precision.
If you do all the arithmetic at double precision, you'll carry 53 bits of precision through the calculation. If your precision needs fit within that (which likely they do), then you'll be fine. Otherwise, the exact numerical performance will depend on the distribution of your data.
Single and double IEEE floats have a format where the exponent and fraction parts have fixed bit-width. So this is not possible (i.e. you will always have unused bits if you only store values between 0 and 1). (See: http://en.wikipedia.org/wiki/Single-precision_floating-point_format)
Are you sure the 52-bit wide fraction part of a double is not precise enough?
Edit: If you use the whole range of the floating format, you will lose precision when normalizing the values. The roundings can be off and enough small values will become 0. Unless you know that this is a problem, don't worry. Otherwise you have to look up some other solution as mentioned in other answers.
Having binary floating point values (with an implicit leading one) expressed as
(1+fraction) * 2^exponent where fraction < 1
A division a/b is:
a/b = (1+fraction(a)) / (1+fraction(b)) * 2^(exponent(a) - exponent(b))
Hence division/multiplication has essentially no loss of precision.
A subtraction a-b is:
a-b = (1+fraction(a)) * 2^(exponent(a) - (1+fraction(b)) * exponent(b))
Hence a subtraction/addition might have a loss of precision (big - tiny == big) !
Clamping a value x in a range [min, max] to [0, 1]
(x - min) / (max - min)
will have precision issues if any subtraction has a loss of precision.
Answering your question:
Nothing is, choose a suitable representation (floating point, fraction, multi precision ...) for your algorithms and expected data.
If you have a selection of doubles and you normalize them to between 0.0 and 1.0, there are a number of sources of precision loss. They are all, however, much smaller than you suspect.
First, you will lose some precision in the arithmetic operations required to normalize them as rounding occurs. This is relatively small -- a bit or so per operation -- and usually relatively random.
Second, the exponent component will no longer be using the positive exponent possibility.
Third, as all the values are positive, the sign bit will also be wasted.
Forth, if the input space does not include +inf or -inf or +NaN or -NaN or the like, those code points will also be wasted.
But, for the most part, you'll waste about 3 bits of information in a 64 bit double in your normalization, one of which being the kind of thing that is nearly unavoidable when you deal with finite-bit-width values.
Any 64 bit fixed point representation of the values from 0 to 1 will have far less "range" than doubles. A double can represent something on the order of 10^-300, while a 64 bit fixed point representation that includes 1.0 can only go as low as 10^-19 or so. (The 64 bit fixed point representation can represent 1 - 10^-19 as being distinct from 1, while the double cannot, but the 64 bit fixed point value can not represent anything smaller than 2^-64, while doubles can).
Some of the numbers above are approximate, and may depend on rounding/exact format.
For higher precision you can try http://www.boost.org/doc/libs/1_55_0/libs/multiprecision/doc/html/boost_multiprecision/tut/floats.html.
Note also, that for the numerical critical operations +,- there are special algorithms that minimize the numerical error introduced by the algorithm:
http://en.wikipedia.org/wiki/Kahan_summation_algorithm

C++ I've just read that floats are inexact and do not store exact integer values. What does this mean?

I am thinking of this at a binary level.
would a float of value 1 and an integer of value 1 not compile down to (omitting lots of zeros here)
0001
If they do both compile down to this then where does this inexactness come in.
Resource I'm using is http://www.cprogramming.com/tutorial/lesson1.html
Thanks.
It's possible. Floating point numbers are represented in an exponential notation (a*2^n), where some bits represent a (the significand), and some bits represent n (the exponent).
You can't uniquely represent all the integers in the range of a floating point value, due to the so-called pigeonhole principle. For example, 32-bit floats go up to over 10^38, but on 32 bits you can only represent 2^32 values - that means some integers will have the same representation.
Now, what happens when you try to, for example, do the following:
x = 10^38 - (10^38 - 1)
You should get 1, but you probably won't, because 10^38 and 10^38-1 are so close to each other that the computer has to represent them the same way. So, your 1.0f will usually be 1, but if this 1 is a result of calculation, it might not be.
Here are some examples.
To be precise: Integers can be exactly represented as floats if their binary representation does not use more bits than the float format supplies for the mantissa plus an implicit one bit.
IEEE floats have a mantissa of 23 bits, add one implicit bit, and you can store any integer representable with 24 bits in a float (that's integers up to 16777216). Likewise, a double has 52 mantissa bits, so it can store integers up to 9007199254740992.
Beyond that point, the IEEE format omits first the odd numbers, then all numbers not divisible by 4, and so on. So, even 0xffffff00ul is exactly representable as a float, but 0xffffff01ul is not.
So, yes, you can represent integers as floats, and as long as they don't become larger than the 16e6 or 9e15 limits, you can even expect additions between integers in float format to be exact.
A float will store an int exactly if the int is less than a certain number, but if you have a large enough int, there won't be enough bits in the mantissa to store all the bits of the integer. The missing bits are then assumed to be zero. If the missing bits aren't zero, then your int won't be equal to your float.
Short answer: no, the floating point representation of integers is not that simple.
The representation adopted for the float type by the C language standard is called IEEE 754 single-precision and is probably more complicated than most people would like to delve into, but the link describes it thoroughly in case you're interested.
As for the representation of the integer 1: we can see how it's encoded in the 32-bit base-2 single-precision format defined by IEEE 754 here - 3f80 0000.
Suppose letters stand for a bit, 0/1. Then a floating point number looks (schematically) like:
smmmmee
where s is the sign +/- and the number is .mmmm x 10 ^ ee
Now if you have two immediately following numbers:
.mmm0 x 10 ^ ee
.mmm1 x 10 ^ ee
Then for large exponent ee the difference might be more then 1.
And of course in base 2 a number like 1/5, 0.2, cannot represented exact. Summing fractions wil increase the error.
(Note this is not the exact representation.)
would a float of value 1 and an integer of value 1 not compile down to (omitting lots of zeros here) 0001
No, float will be stored like something similar to 0x00000803f, depending on precision.
What does this mean?
Some numbers cannot be precisely represented in binary form. O.2 in binary form will look like 0.00110011001100110011... which will keep going on(and repeating) forever. No matter how many bits you use to store it, it will be never enough. That's because 5 is not divisible by 2. The only way to precisely represent it is to use ratios to store it.
floating points have limited precision. Roughly speaking, they only store certain amount of digits after first significant non-zero digit, and the rest will be lost. That'll result in errors, for example, with single precision floats 100000000000000001 and 100000000000000002 are most likely rounded off to the same number.
You might also want to read something like this.
Conclusion:
If you're writing financial software, do not use floats. Use Bignums, using libraries like gmp
Contrary to some modern dynamically typed programming languages such as JavaScript or Ruby that have a single basic numeric type, the C programming language has many. That is because C reflects the different way to represent different kinds of numbers within a processor register.
To investigate different representations you can use the union construct where the same data can be viewed as different types.
Define
union {
float x;
int v;
} u;
Assign u.x = 1.0f and printf("0x%08x\n",u.v) to get the 32-bit representation of 1.0f as a floating point number. It should return 0x3f800000 and not 0x00000001 as one might expect.
As mentioned in earlier answers this reflects the representation of a floating number as a 32-bit value as `
1.0f = 0x3F800000 = 0011.1111.1000.0000.0000.0000.0000.0000 =
0 0111.1111 000.0000.0000.0000.0000.0000 = 0 0x7F 0
Here the three parts are sign s=0, exponent e=127, and mantissa m=0 and the floating point value is computed as
value = s * (1 + m * 2^-23) * 2^(e-127)
With this representation any integer number from -16,777,215 to 16,777,215 can be represented exactly. This is the value of (2^24 - 1) since there are only 23 bits in the mantissa. This range is not sufficient for many applications, therefore the float type cannot replace the int type.
The range of exact representation of integers by the double type is wider since the value occupies 64 bits and there are 53 bits reserved for the mantissa. It is exactly from
-9,007,199,254,740,991 to 9,007,199,254,740,991. Yet double requires twice as much memory.
Another source of difficulty is the way fractional numbers are represented. Since decimal fractions cannot be represented exactly (0.1f = 0x3dcccccd = 0.10000000149...) the use of floating point numbers breaks common algebraic identities.
0.1f * 10 != 1.0f
This can be confusing and lead to errors that are hard to detect. In general strict equality should not be used with floating point numbers.
Another example of floating point arithmetic depature from algebraic correctness:
float x = 16777217.0f;
float y = 16777215.0f;
x -= 1.0f;
y += 1.0f;
if (y > x) {printf("16777215.0 + 1.0 > 16777217.0 - 1.0\n");}
Yet another issue is the behaviour of the system when the limits of exact representation are broken. When in integer arithmetic the result of an arithmetic operation is greater than the range of the type, this can be detected in many ways: a special OVERFLOW bit in the processor flags register is flipped, and the result is significantly different from the expected.
In floating point arithmetic as the example above shows, the loss of precision occurs silently.
Hope this helps to understand why one needs many basic numeric types in C.

Long Integer and Float

If a Long Integer and a float both take 4 bytes to store in memory then why are their ranges different?
Integers are stored like this:
1 bit for the sign (+/-)
31 bits for the value.
Floats are stored differently, giving greater range at the expense of accuracy:
1 bit for the sign (+/-)
N bits for the mantissa S
M bits for the exponent E
Float is represented in the exponential form: (+/-)S*(base)^E
BTW, "long" isn't always 32 bits. See this article.
Different way to encode your numbers.
Long counts up from 1 to 2^(4*8).
Float uses only 23 of the 32 bits for the "counting". But it adds "range" with an exponent in the other bits. So you have bigger numbers, but they are less accurate (in the lower based parts):
1.2424 * 10^54 (mantisse * exponent) is certainly bigger than 2^32. But you can discern a long 2^31 from a long 2^31-1 whereas you can't discern a float 1.24 * 10^54 and a float 1.24 * 10^54 - 1: the 1 just is lost in this representation as float.
They are not always the same size. But even when they are, their ranges are different because they serve different purposes. One is for integers with no decimal places, and one is for decimals.
This can be explained in terms of why a floating point representation can represent a larger range of numbers than a fixed point representation. This text from the Wikipedia entry:
The advantage of floating-point
representation over fixed-point (and
integer) representation is that it can
support a much wider range of values.
For example, a fixed-point
representation that has seven decimal
digits, with the decimal point assumed
to be positioned after the fifth
digit, can represent the numbers
12345.67, 8765.43, 123.00, and so on, whereas a floating-point
representation (such as the IEEE 754
decimal32 format) with seven decimal
digits could in addition represent
1.234567, 123456.7, 0.00001234567, 1234567000000000, and so on. The
floating-point format needs slightly
more storage (to encode the position
of the radix point), so when stored in
the same space, floating-point numbers
achieve their greater range at the
expense of precision.
Indeed a float takes 4 bytes (32bits), but since it's a float you have to store different things in these 32 bits:
1 bit is used for the sign (+/-)
8 bits are used for the exponent
23 bits are used for the significand (the significant digits)
You can see that the range of a float directly depends on the number of bits allocated to the significand, and the min/max values depend on the numbre of bits allocated for the exponent.
With the upper example:
8 bits for the exponent (= size of a char) gives an exponent range [-128,127]
--> max is about 127*log10(2) = 10^38
Regarding a long integer, you've got 1 bit used for the sign and then 31 bits to represent the integer value leading to a max of 2 147 483 647.
You can have a look at Wikipedia for more precise info:
Wikipedia - Floating point
Their ranges are different because they use different ways of representing numbers.
long (in c) is equivalent to long int. The size of this type varies between processors and compilers, but is often, as you say, 32 bits. At 32 bits, it can represent 232 different values. Since we often want to use negative numbers, computers normally represent integers using a format called "two's complement". This way, we can represent numbers from (-231) and up to (231-1). Counting the number 0, this adds up to 232 numbers.
float (in c) is usually a single presicion IEEE 754 formatted number. At 32 bits, this data type can also take 232 different bit patterns, but they are not used to directly represent whole numbers, like in the long. Instead, they represent a sign, and the mantisse and exponent of a normalized decimal number.
In general: when you have more range of values (float has up to 10^many), you have less precision.
This is what happens here. If you need integers, 32-bit long will give you more.
In a handwavey high level, floating point sacrefices integer precision to extend its range. This is done by combining a base value with a scaling factor. For large values, a float will not be able to precisely represent all integers but for small values it will represent better than integer precision.
No, the size of primitive data types in C is Implementation Defined.
This wiki entry clearly states: The floating-point format needs slightly more storage (to encode the position of the radix point), so when stored in the same space, floating-point numbers achieve their greater range at the expense of precision.