There is a example in http://www.gotw.ca/gotw/067.htm
int main()
{
double x = 1e8;
//float x = 1e8;
while( x > 0 )
{
--x;
}
}
When you change the double to float, it's a infinite loop in VS2008.
According to the Gotw explanation:
What if float can't exactly represent all integer values from 0 to
1e8? Then the modified program will start counting down, but will
eventually reach a value N which can't be represented and for which
N-1 == N (due to insufficient floating-point precision)... and then
the loop will stay stuck on that value until the machine on which the
program is running runs out of power.
From what I understand, the IEEE754 float is a single precision(32 bits) and the range of float should be +/- 3.4e +/- 38 and it should have a 7 digits significant.
But I still don't understand how exactly this happens: "eventually reach a value N which can't be represented and for which N-1 == N (due to insufficient floating-point precision)." Can someone try to explan this bit ?
A bit of extra info : When I use double x = 1e8, it finished in about 1 sec, when I change it to
float x = 1e8, it runs much longer(still running after 5 min), also if I change it to float x = 1e7;, it finished in about 1 second.
My testing environment is VS2008.
BTW I'm NOT asking the basic IEEE 754 format explanation as I already understand that.
Thanks
Well, for the sake of argument, lets assume we have a processor which represents a floating point number with 7 significant decimal digits, and an mantissa with, say, 2 decimal digits. So now the number 1e8 would be stored as
1.000 000 e 08
(where the "." and "e" need not be actually stored.)
So now you want to compute "1e8 - 1". 1 is represented as
1.000 000 e 00
Now, in order to do the subtraction we first do a subtraction with infinite precision, then normalize so that the first digit before the "." is between 1 and 9, and finally round to the nearest representable value (with break on even, say). The infinite precision result of "1e8 - 1" is
0.99 999 999 e 08
or normalized
9.9 999 999 e 07
As can be seen, the infinite precision result needs one more digit in the significand than what our architecture actually provides; hence we need to round (and re-normalize) the infinitely precise result to 7 significant digits, resulting in
1.000 000 e 08
Hence you end up with "1e8 - 1 == 1e8" and your loop never terminates.
Now, in reality you're using IEEE 754 binary floats, which are a bit different, but the principle is roughly the same.
The operation x-- is (in this case) equivalent to x = x - 1. That means the original value of x is taken, 1 is subtracted (using infinite precision, as mandated by IEEE 754-1985), and then the result is rounded to the next value of the float value space.
The rounded result for the numbers 1.0e8f + i is given for i in [-10;10] below:
-10: 9.9999992E7 (binary +|10011001|01111101011110000011111)
-9: 9.9999992E7 (binary +|10011001|01111101011110000011111)
-8: 9.9999992E7 (binary +|10011001|01111101011110000011111)
-7: 9.9999992E7 (binary +|10011001|01111101011110000011111)
-6: 9.9999992E7 (binary +|10011001|01111101011110000011111)
-5: 9.9999992E7 (binary +|10011001|01111101011110000011111)
-4: 1.0E8 (binary +|10011001|01111101011110000100000)
-3: 1.0E8 (binary +|10011001|01111101011110000100000)
-2: 1.0E8 (binary +|10011001|01111101011110000100000)
-1: 1.0E8 (binary +|10011001|01111101011110000100000)
0: 1.0E8 (binary +|10011001|01111101011110000100000)
1: 1.0E8 (binary +|10011001|01111101011110000100000)
2: 1.0E8 (binary +|10011001|01111101011110000100000)
3: 1.0E8 (binary +|10011001|01111101011110000100000)
4: 1.0E8 (binary +|10011001|01111101011110000100000)
5: 1.00000008E8 (binary +|10011001|01111101011110000100001)
6: 1.00000008E8 (binary +|10011001|01111101011110000100001)
7: 1.00000008E8 (binary +|10011001|01111101011110000100001)
8: 1.00000008E8 (binary +|10011001|01111101011110000100001)
9: 1.00000008E8 (binary +|10011001|01111101011110000100001)
10: 1.00000008E8 (binary +|10011001|01111101011110000100001)
So you can see that 1.0e8f and 1.0e8f + 4 and some other numbers have the same representation. Since you already know the details of the IEEE 754-1985 floating point formats, you also know that the remaining digits must have been rounded away.
What is the result of n - 1 if n - 1 and n have both identical representation due to the approximate nature of floating point numbers?
Regarding "reach" a value that can't be represented, I think Herb was including the possibility of quite esoteric floating point representations.
With any ordinary floating point representations, you will either start with such value (i.e. stuck on first value), or you will be somewhere in the contiguous range of integers centered around zero that can be represented exactly, so that the countdown succeeds.
For IEEE 754 the 32-bit representation, typically float in C++, has 23 bits mantissa, while the 64-bit representation, typically double in C++, has 52 bits mantissa. This means that with double you can at least represent exactly the integers in the range -(2^52-1) ... 2^52-1. I'm not quite sure if the range can be extended with another factor of 2. I get a bit dizzy thinking about it. :-)
Cheers & hth.,
Related
IS -28.91 = 00100.0111 ??
28 -> 11100 then flip and add 1
-28 -> 00100
.91 -> 0111 with the accuracy of 4 decimals places
I have tried to check a lot of places to check my conversion if it is correct but I am failing at it. So I like to ask people here if I am correct.
For addition / subtraction and other operations to work normally (by using binary addition on the whole bit-pattern), the whole thing (integer and fractional parts combined) as an integer has to be x * 2^4.
i.e. the actual value represented by 0b00100.0111 is 0b001000111 / 16.
That means you have to do 2's complement negation (binary subtraction from 0, or use the invert and add 1 identity) for the whole and fractional bits together.
Also, your value for 28 has its MSB set, so it's already negative, i.e. you've overflowed 5-bit signed 2's complement. Presumably you actually have a wider integer part.
For 16-bit 12.4 fixed-point, 28.91:
28.91 * 16 = 462.56, which rounds up to 463.
+463 = 0b0000000111001111
-463 = 0b1111111000110001
As 12.4 fixed-point, this 0b111111100011.0001 bit-pattern represents -463/16 = -28.9375, the nearest representable value to -28.91
As we can see int has 4 byte in memory, that are 32bits, after applying range formula , we can see range of int -2147483648 to 2147483647. I have calculated the ranges of all datatypes besides float and double and long double.
I dont know how they calculated the range of float mentioned below.
Floating point numbers are stored as an exponent and a fraction within the space available.
For some systems where float is implemented as an IEEE 754 value, the results would looks as below.
sign : 1 bit
exponent : 8 bits
fraction : 23 bits
The exponent allows numbers from 2 ^ (-127) (2 to the power -127) to 2 ^ 128 ( 2 to the power 128).
Allowing a range of numbers from
5.87747E-39
3.40282E+38
the fraction point gives a fraction such as .12313
Thus with 23 bits of values, the accuracy of a number is about 7 decimal digits or 1.19 E-7
For more details see wikipedia : IEEE 754-1985
On a given system, the <cfloat> / <float.h> will give the limits. For non IEEE 754 based representations, you would have to understand how the numbers are stored to calculate the limits.
-2^(n-1) to (2^(n-1)-1) is the formula to calculate the range of data types.
Where n = no.of.bits of the primitive data type.
For example: for the byte data type, n = 8 bits
-2^(8-1) to (2^(8-1)-1)
The above calculation will give you -128 to 127. Now, coming to the question of why it’s not 255. The reason is that byte, int, short, and double are signed data types meaning it has half the range below 0 (negative) and half the range above 0 (positive). The first bit represents a sign (+ or -). The remaining bits are 7. That’s why 2^(8-1) = 128. We take 0 as a positive sign, so the range is 2^(8-1) - 1 for positive numbers.
Q1:Will dividing a integer by its divisor lose precision ?
int a=M*N,b=N;//M and N are random non-zero integers.
float c=float(a)/b;
if (c==M)
cout<<"accurate"<<endl;
Q2:Will passing a float value lose precision ?
float a=K;//K is a random float;
if (a==K)
cout<<"accurate"<<endl;
Q1:Will dividing a integer by its divisor lose precision ?
Yes. I used the following program to come up with some numbers:
#include <iostream>
#include <climits>
int main()
{
int M = 10;
int N = 7;
int inaccurateCount = 0;
for (; M < INT_MAX && inaccurateCount < 10; ++M )
{
int a = M*N;
float c = float(a)/N;
if ( c != M )
{
std::cout << "Not accurate for M: " << M << " and N: " << N << std::endl;
inaccurateCount++;
}
}
return 0;
}
and here's the output:
Not accurate for M: 2396747 and N: 7
Not accurate for M: 2396749 and N: 7
Not accurate for M: 2396751 and N: 7
Not accurate for M: 2396753 and N: 7
Not accurate for M: 2396755 and N: 7
Not accurate for M: 2396757 and N: 7
Not accurate for M: 2396759 and N: 7
Not accurate for M: 2396761 and N: 7
Not accurate for M: 2396763 and N: 7
Not accurate for M: 2396765 and N: 7
Q2:Will passing a float value lose precision ?
No, it shouldn't.
Q1:Will dividing a integer by its divisor lose precision ?
You actually asked if converting a int to a float will lose precsion.
Yes, it will typically do that. On today 32-bit (or wider) computer architectures an int stores 32-bit of data: 1 bit sign plus 31 bit significand. A float stores also 32-bit of data, but these are: 1 bit sign, 8 bit exponent, and 23 bit fractional part, cf. IEEE 754 single-precision floating point format (It might not lose precision on a 16-bit architecture, but I can't check that.)
Depending on the floating point number it will be stored in different represantations, one is the normalized form, where the fractional part is prepended by a hidden one, so that, we get a 24 bit significand. This is less than as stored in an int.
For example the integer 01010101 01010101 01010101 01010101 (binary, space only for better reading) cannot be expressed as float without loosing precision. In normalized form this would be 1,010101 01010101 01010101 01010101 * 2^30. So we have 30 significand binary digits after the comma, which cannot be stored in 23 bit (fractional part) without losing precision. The actual round modes defines how the value is shortened.
Note, that it does not depends on if the value is actually "high". The integer 01000000 00000000 00000000 00000000 is in normalized form 1,000000 00000000 00000000 00000000 * 2^30. This number has zero significant bits after the comma and can be stored without losing precision.
Q2: Will passing a float value lose precision ?
No.
Q1:Will dividing a integer by its divisor lose precision ?
If a is to large it might loose precision, otherwise (if a is small enough to be exactly represented as a float) it will not. The loss of precision may actually happen already when you convert a. Also the division will loose precision, but sometimes it could be that these losses of precision will cancel each other.
For example if N = 8388609 and M=5. You have the (binary) mantissa 100...001 and multiply with 101 and end up with 101000...0000101, but then the last two bits will be rounded to zero and you get an error in (float)(N*M), but then when you divide by five, you get 1000...00 and a remainder of 100, which means that it should round up one step and you get back the original number.
Q2:Will passing a float value lose precision ?
No, it will not lose precision. However your code could still fail to identify it as accurate.
The case this could happen is if K is a NaN (for example 0.0/0.0), then x will also become a NaN - however NaN shouldn't (need to) compare equals. In this case one could argue that you lost precision and I agree, but it's not at the point x=K that looses precision - you already lost precision when producing K.
It wall not be exact but to get more accurate answers you can use the value types double and long
Case 1: Yes it loses precision in some cases. For small values of M it will be accurate.
Case 2: No it doesn't lose its precision.
I understand that floating point numbers can often include rounding errors.
When you take the floor or ceiling of a float (or double) in order to convert it to an integer, will the resultant value be exact or can the "floored" value still be an approximation?
Basically, is it possible for something like floor(3.14159265) to return a value which is essentially 2.999999, which would convert to 2 when you try to cast that to an int?
Is it possible for something like floor(3.14159265) to return a value which is essentially 2.999999?
The floor() function returns an floating point value that is an exact integer. So the premise of your question is wrong to begin with.
Now, floor(x) returns the nearest integral value that is not greater than x. It is always true that
floor(x) <= x
and that there exists no integer i, greater than floor(x), such that i <= x.
Looking at floor(3.14159265), this returns 3.0. There's no debate about that. Nothing more to say.
Where it gets interesting is if you write floor(x) where x is the result of an arithmetic expression. Floating point precision and rounding can mean that x falls on the wrong side of an integer. In other words, the true value of the expression that yields x is greater than some integer, i, but that x when evaluated using floating point arithmetic is less than i.
Small integers are representable exactly as floats, but big integers are not.
But, as others pointed out, big integers not representable by float will never be representable by a non-integer, so floor() will never return a non-integer value. Thus, the cast to (int), as long as it does not overflow, will be correct.
But how small is small? Copying shamelessly from this answer:
For float, it is 16,777,217 (224 + 1).
For double, it is 9,007,199,254,740,993 (253 + 1).
Note that the usual range of int (32-bits) is 231, so float is unable to represent all of them exactly. Use double if you need that.
Interestingly, floats can store a certain range of integers exactly, for example:
1 is stored as mantissa 1 (binary 1) * exponent 2^0
2 is stored as mantissa 1 (binary 1) * exponent 2^1
3 is stored as mantissa 1.5 (binary 1.1) * exponent 2^1
4 is stored as mantissa 1 * exponent 2^2
5 is stored as mantissa 1.25 (binary 1.01) * exponent 2^2
6 is stored as mantissa 1.5 (binary 1.1) * exponent 2^2
7 is stored as mantissa 1.75 (binary 1.11) * exponent 2^2
8 is stored as mantissa 1 (binary 1) * exponent 2^3
9 is stored as mantissa 1.125 (binary 1.001) * exponent 2^3
10 is stored as mantissa 1.25 (binary 1.01) * exponent 2^3
...
As you can see, the way exponents increase works in with the perfectly-stored fractional values the mantissa can represent.
You can get a good sense for this by putting number into this great online conversion site.
Once you cross a certain threshold, there's not enough digits in the mantissa to divide the span of the increased exponents without skipping first every odd integer value, then three out of every four, then 7 out of 8 etc.. For numbers over this threshold, the issue is not that they might be different from integer values by some tiny fractional amount, its that all the representable values are integers and not only can no fractional part be represented any more, but as above some of the integers can't be either.
You can observe this in the calculator by considering:
Binary Decimal
+-Exponent Mantissa
0 10010110 11111111111111111111111 16777215
0 10010111 00000000000000000000000 16777216
0 10010111 00000000000000000000001 16777218
See how at this stage, the smallest possible increment of the mantissa is actually "worth 2" in terms of the decimal value represented?
When you take the floor or ceiling of a float (or double) in order to convert it to an integer, will the resultant value be exact or can the "floored" value still be an approximation?
It's always exact. What floor is doing is effectively wiping out any '1's in the mantissa whose significance (their contribution to value) is fractional anyway.
Basically, is it possible for something like floor(3.14159265) to return a value which is essentially 2.999999, which would convert to 2 when you try to cast that to an int?
No.
I'm interested in learning how to convert an integer value into IEEE single precision floating point format using bitwise operators only. However, I'm confused as to what can be done to know how many logical shifts left are needed when calculating for the exponent.
Given an int, say 15, we have:
Binary: 1111
-> 1.111 x 2^3 => After placing a decimal point after the first bit, we find that the 'e' value will be three.
E = Exp - Bias
Therefore, Exp = 130 = 10000010
And the significand will be: 111000000000000000000000
However, I knew that the 'e' value would be three because I was able to see that there are three bits after placing the decimal after the first bit. Is there a more generic way to code for this as a general case?
Again, this is for an int to float conversion, assuming that the integer is non-negative, non-zero, and is not larger than the max space allowed for the mantissa.
Also, could someone explain why rounding is needed for values greater than 23 bits?
Thanks in advance!
First, a paper you should consider reading, if you want to understand floating point foibles better: "What Every Computer Scientist Should Know About Floating Point Arithmetic," http://www.validlab.com/goldberg/paper.pdf
And now to some meat.
The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an unsigned int in the range 0 < value < 224. That's the format you're most likely to encounter on modern hardware, and it's the format you seem to reference in your original question.
IEEE-754 single-precision floats are divided into three fields: A single sign bit, 8 bits of exponent, and 23 bits of significand (sometimes called a mantissa). IEEE-754 uses a hidden 1 significand, meaning that the significand is actually 24 bits total. The bits are packed left to right, with the sign bit in bit 31, exponent in bits 30 .. 23, and the significand in bits 22 .. 0. The following diagram from Wikipedia illustrates:
The exponent has a bias of 127, meaning that the actual exponent associated with the floating point number is 127 less than the value stored in the exponent field. An exponent of 0 therefore would be encoded as 127.
(Note: The full Wikipedia article may be interesting to you. Ref: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )
Therefore, the IEEE-754 number 0x40000000 is interpreted as follows:
Bit 31 = 0: Positive value
Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 21)
Bits 22 .. 0 are all 0: Significand = 1.00000000_00000000_0000000. (Note I restored the hidden 1).
So the value is 1.0 x 21 = 2.0.
To convert an unsigned int in the limited range given above, then, to something in IEEE-754 format, you might use a function like the one below. It takes the following steps:
Aligns the leading 1 of the integer to the position of the hidden 1 in the floating point representation.
While aligning the integer, records the total number of shifts made.
Masks away the hidden 1.
Using the number of shifts made, computes the exponent and appends it to the number.
Using reinterpret_cast, converts the resulting bit-pattern to a float. This part is an ugly hack, because it uses a type-punned pointer. You could also do this by abusing a union. Some platforms provide an intrinsic operation (such as _itof) to make this reinterpretation less ugly.
There are much faster ways to do this; this one is meant to be pedagogically useful, if not super efficient:
float uint_to_float(unsigned int significand)
{
// Only support 0 < significand < 1 << 24.
if (significand == 0 || significand >= 1 << 24)
return -1.0; // or abort(); or whatever you'd like here.
int shifts = 0;
// Align the leading 1 of the significand to the hidden-1
// position. Count the number of shifts required.
while ((significand & (1 << 23)) == 0)
{
significand <<= 1;
shifts++;
}
// The number 1.0 has an exponent of 0, and would need to be
// shifted left 23 times. The number 2.0, however, has an
// exponent of 1 and needs to be shifted left only 22 times.
// Therefore, the exponent should be (23 - shifts). IEEE-754
// format requires a bias of 127, though, so the exponent field
// is given by the following expression:
unsigned int exponent = 127 + 23 - shifts;
// Now merge significand and exponent. Be sure to strip away
// the hidden 1 in the significand.
unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF);
// Reinterpret as a float and return. This is an evil hack.
return *reinterpret_cast< float* >( &merged );
}
You can make this process more efficient using functions that detect the leading 1 in a number. (These sometimes go by names like clz for "count leading zeros", or norm for "normalize".)
You can also extend this to signed numbers by recording the sign, taking the absolute value of the integer, performing the steps above, and then putting the sign into bit 31 of the number.
For integers >= 224, the entire integer does not fit into the significand field of the 32-bit float format. This is why you need to "round": You lose LSBs in order to make the value fit. Thus, multiple integers will end up mapping to the same floating point pattern. The exact mapping depends on the rounding mode (round toward -Inf, round toward +Inf, round toward zero, round toward nearest even). But the fact of the matter is you can't shove 24 bits into fewer than 24 bits without some loss.
You can see this in terms of the code above. It works by aligning the leading 1 to the hidden 1 position. If a value was >= 224, the code would need to shift right, not left, and that necessarily shifts LSBs away. Rounding modes just tell you how to handle the bits shifted away.