Convert Between Floating Point Standards - c++

I am trying to convert an IEEE based floating point number to a MIL-STD 1750A floating point number.
I have attached the specification for both:
I understand how to decompose the floating point 12.375 in IEEE format as per the example on wikipedia.
However, I'm not sure if my interpretation of the MIL-STD is correct.
12.375 = (12)b10 + (0.375)b10 = (1100)b2 + (0.011)b2 = (1100.011)b2
(1100.011)b2 = 0.1100011 x 2^4 => Exponent, E = 4.
4 in normalised 2's complement is = (100)b2 = Exponent
Therefore a MIL-STD 1750A 32 bit floating point number is:
S=0, F=11000110000000000000000, E=00000100
Is my above interpretation correct?
For -12.375, is it just the sign bit that swaps? i.e.:
S=1, F=11000110000000000000000, E=00000100
Or does something funky happen with the fraction part?

The diagram above is a bit misleading, I think. In IEEE format, to switch from positive to negative, you simply flip the first bit. The remaining three bits can be treated as an unsigned number. In the MIL-STD format, the mantissa is a two's complement number, so while the first bit does indicate the sign, the remaining 23 bits do not remain the same.
What I get is
S=1, F=00111010000000000000000, E=00000100

Related

When Will static_casting the Result of ceil Compromise the Result?

static_casting from a floating point to an integer simply strips the fractional point of the number. For example static_cast<int>(13.9999999) yields 13.
Not all integers are representable as floating point numbers. For example internally the closest float to 13,000,000 may be: 12999999.999999.
In this hypothetical case, I'd expect to get an unexpected result from:
const auto foo = 12'999'999.5F;
const auto bar = static_cast<long long>(ceil(foo));
My assumption is that such a breakdown does occur at some point, if not necessarily at 13,000,000. I'd just like to know the range over which I can trust static_cast<long long>(ceif(foo))?
For example internally the closest float to 13,000,000 may be: 12999999.999999.
That is not possible in any normal floating-point format. The floating-point representation of numbers is equivalent to M•be, where b is a fixed base (e.g., 2 for binary floating-point) and M and e are integers with some restrictions on their values. In order for a value like 13,000,000-x to be represented, where x is some positive value less than 1, e must be negative (because M•be for a non-negative e is an integer). If so, then M•b0 is an integer larger than M•be, so it is larger than 13,000,000, and so 13,000,000 can be represented as M'•b0, where M' is a positive integer less than M and hence fits in the range of allowed values for M (in any normal floating-point format). (Perhaps some bizarre floating-point format might impose a strange range on M or e that prevents this, but no normal format does.)
Regarding your code:
auto test = 0LL;
const auto floater = 0.5F;
for(auto i = 0LL; i == test; i = std::ceil(i + floater)) ++test;
cout << test << endl;
When i was 8,388,608, the mathematical result of 8,388,608 + .5 is 8,388,608.5. This is not representable in the float format on your system, so it was rounded to 8,388,608. The ceil of this is 8,388,608. At this point, test was 8,388,609, so the loop stopped. So this code does not demonstrate that 8,388,608.5 is representable and 8,388,609 is not.
Behavior seems to return to normal if I do: ceil(8'388'609.5F) which will correctly return 8,388,610.
8,388,609.5 is not representable in the float format on your system, so it was rounded by the rule “round to nearest, ties to even.” The two nearest representable values are 8,388,609, and 8,388,610. Since they are equally far apart, the result was 8,388,610. That value was passed to ceil, which of course returned 8,388,610.
On Visual Studio 2015 I got 8,388,609 which is a horrifying small safe range.
In the IEEE-754 basic 32-bit binary format, all integers from -16,777,216 to +16,777,216 are representable, because the format has a 24-bit significand.
Floating point numbers are represented by 3 integers, cbq where:
c is the mantissa (so for the number: 12,999,999.999999 c would be 12,999,999,999,999)
q is the exponent (so for the number: 12,999,999.999999 q would be -6)
b is the base (IEEE-754 requires b to be either 10 or 2; in the representation above b is 10)
From this it's easy to see that a floating point with the capability of representing 12,999,999.999999 also has the capability of representing 13,000,000.000000 using a c of 1,300,000,000,000 and a q of -5.
This example is a bit contrived in that the chosen b is 10, where in almost all implementations the chosen base is 2. But it's worth pointing out that even with a b of 2 the q functions as a shift left or right of the mantissa.
Next let's talk about a range here. Obviously a 32-bit floating point cannot represent all the integers represented by a 32-bit integer, as the floating point must also represent so many much larger or smaller numbers. Since the exponent is simply shifting the mantissa, a floating point number can always exactly represent every integer that can be represented by it's mantissa. Given the traditional IEEE-754 binary base floating point numbers:
A 32-bit (float) has a 24-bit mantissa so it can represent all integers in the range [-16,777,215, 16,777,215]
A 64-bit (double) has a 53-bit mantissa so it can represent all integers in the range [-9,007,199,254,740,991, 9,007,199,254,740,991]
A 128-bit (long double depending upon implementation) has a 113-bit mantissa so it can represent all integers in the range [-103,845,937,170,696,552,570,609,926,584,40,191, 103,845,937,170,696,552,570,609,926,584,40,191]
[source]
c++ provides digits as a method of finding this number for a given floating point type. (Though admittedly even a long long is too small to represent a 113-bit mantissa.) For example a float's maximum mantissa could be found by:
(1LL << numeric_limits<float>::digits) - 1LL
Having thoroughly explained the mantissa, let's revisit the exponent section to talk about how a floating point is actually stored. Take 13,000,000.0 that could be represented as:
c = 13, q = 6, b = 10
c = 130, q = 5, b = 10
c = 1,300, q = 4, b = 10
And so on. For the traditional binary format IEEE-754 requires:
The representation is made unique by choosing the smallest representable exponent that retains the most significant bit (MSB) within the selected word size and format. Further, the exponent is not represented directly, but a bias is added so that the smallest representable exponent is represented as 1, with 0 used for subnormal numbers
To explain this in the more familiar base-10 if our mantissa has 14 decimal places, the implementation would look like this:
c = 13,000,000,000,000 so the MSB will be used in the represented number
q = 6 This is a little confusing, it's cause of the bias introduced here; logically q = -6 but the bias is set so that when q = 0 only the MSB of c is immediately to the left of the decimal point, meaning that c = 13,000,000,000,000, q = 0, b = 10 will represent 1.3
b = 10 again the above rules are really only required for base-2 but I've shown them as they would apply to base-10 for the purpose of explaination
Translated back to base-2 this means that a q of numeric_limits<T>::digits - 1 has only zeros after the decimal place. ceil only has an effect if there is a fractional part of the number.
A final point of explanation here, is the range over which ceil will have an effect. After the exponent of a floating point is larger than numeric_limits<T>::digits continuing to increase it only introduces trailing zeros to the resulting number, thus calling ceil when q is greater than or equal to numeric_limits<T>::digits - 2LL. And since we know the MSB of c will be used in the number this means that c must be smaller than (1LL << numeric_limits<T>::digits - 1LL) - 1LL Thus for ceil to have an effect on the traditional binary IEEE-754 floating point:
A 32-bit (float) must be smaller than 8,388,607
A 64-bit (double) must be smaller than 4,503,599,627,370,495
A 128-bit (long double depending upon implementation) must be smaller than 5,192,296,858,534,827,628,530,496,329,220,095

Why does numeric_limits<float>::min() not actually give the smallest possible float?

It seems that we can trivially derive floats that are smaller than numeric_limits<float>::min(). Why. If numeric_limits<float>::min() isn't supposed to be the smallest positive float, what is it supposed to be?
#include <iostream>
#include <limits>
using namespace std;
int main(){
float mind = numeric_limits<float>::min();
float smaller_than_mind = numeric_limits<float>::min()/2;
cout<< ( mind > smaller_than_mind && smaller_than_mind > 0 ) <<endl;
}
Run it here: https://onlinegdb.com/ry3AcxjXz
min() of a floating-point type returns the minimum positive value that has the full expressive power of the format—all bits of its significand are available for use.
Smaller positive values are called subnormal. Although they are representable, high bits of the significand are necessarily zero.
The IEEE-754 64-bit binary floating-point format represents a number with a sign (+ or -, encoded as 0 or 1), an exponent (-1022 to +1023, encoded as 1 to 2046, plus 0 and 2047 as special cases), and a 53-bit significand (encoded with 52 bits plus a clue from the exponent field).
For normal values, the exponent field is 1 to 2046 (representing exponents of -1022 to +1023) and the significand (in binary) is 1.xxx…xxx, where xxx…xxx represents 52 more bits. In all of these values, the value of the lowest bit of the significand is 2-52 times the value of the highest significant bit (the first 1 in it).
For subnormal values, the exponent field is 0. This still represents an exponent of -1022, but it means the high bit of the significand is 0. The significand is now 0.xxx…xxx. As lower and lower values are used in this range, more leading bits of the significand become zero. Now, the value of the lowest bit of the significand is greater than 2-52 times the value of the highest significant bit. You cannot adjust numbers as finely in this interval as in the normal interval because not all the bits of the significand are available for arbitrary values—some leading bits are fixed at 0 to set the scale.
Because of this, the relative errors that occur when working with numbers in this range tend to be greater than the relative errors in the normal range. The floating-point format has this subnormal range because, if it did not, the numbers would just cut off at the smallest normal value, and the gap between that normal value and zero would be a huge relative jump—100% of the value in a single step. By including subnormal numbers, the relative errors increase more gradually, and the absolute errors stay constant from this point until zero is reached.
It is important to know where the bottom of the normal range is. min() tells you this. denorm_min() tells you the ultimate minimum positive value.
According to en.cppreference.com:
For floating-point types with denormalization, min returns the minimum
positive normalized value. Note that this behavior may be unexpected,
especially when compared to the behavior of min for integral types.
float is a type with denormalization, information on normalized floating point numbers.
Because numeric_limits::min returns "For floating types with subnormal numbers, returns the minimum positive normalized value." You can divide that by 2 and get a subnormal (aka denormal on some platforms) number on some systems. These numbers don't store the full precision of the float type, but allow storing values that would otherwise become 0.0.

Floating point resolution seems more limited than it ought to be

I'm seeing some error when simply assigning a floating point value which contains only 4 significant figures. I wrote a short program to debug and I don't understand what the problem is. After verifying the limits of a float on my platform is seems like there shouldn't be any error. What's causing this?
#include <stdlib.h>
#include <stdio.h>
#include <limits>
#include <iostream>
int main(){
printf("float size: %lu\n", sizeof(float));
printf("float max: %e\n", std::numeric_limits<float>::max());
printf("float significant figures: %i\n", std::numeric_limits<float>::digits10);
float a = 760.5e6;
printf("%.9f\n", a);
std::cout.precision(9);
std::cout << a << std::endl;
double b = 760.5e6;
printf("%.9f\n", b);
std::cout << b << std::endl;
return 0;
}
The output:
float size: 4
float max: 3.402823e+38
float significant figures: 6
760499968.000000000
760499968
760500000.000000000
760500000
A float has 24 bits of precision, which is roughly equivalent to 7 decimal digits. A double has 53 bits of precision, which is roughly equivalent to 16 decimal digits.
As mentioned in the comments, 760.5e6 is not exactly representable by float; however, it is exactly representable by double. This is why the printed results for double are exact, and those from float are not.
It is legal to request printing of more decimal digits than are representable by your floating point number, as you did. The results you report are not an error -- they are simply the result of the decimal printing algorithm doing the best it can.
The stored number in your float is 760499968. This is expected behavior for an IEEE 754 binary32 floating point numbers, as floats usually are.
IEEE 754 floating point numbers are stored in three parts: a sign bit, an exponent, and a mantissa. Since all these values are stored as bits the resulting number is sort of the binary equivalent of scientific notation. The mantissa bits are one less than the number of binary digits allowed as significant figures in the binary scientific notation.
Just like with decimal scientific numbers, if the exponent exceeds the significant figures, you're going to lose integer precision.
The analogy only extends so far: the mantissa is a modification of the coefficient found in the decimal scientific notation you might be familiar with, and there are certain bit patterns that have special meaning in the standard.
The ultimate result of this storage mechanism is that the integer 760500000 cannot be exactly represented by IEEE 754 binary32 with its 23-bit mantissa: it loses integer-level precision after the integer at 2^(mantissa_bits + 1), which is 16777217 for 23-bit mantissa floats. The closest integers to 76050000 that can be represented by a float are 760499968 and 76050032, the former of which is chosen for representation due to the round-ties-to-even rule, and printing the integer at a greater precision than the floating point number can represent will naturally result in apparent inaccuracies.
A double, which has 64 bit size in your case, naturally has more precision than a float, which is 32 bit in your case. Therefore, this is an expected result
Specifications do not enforce that any type should correctly represent all numbers less than std::numeric_limits::max() with all their precision.
The number you display is off only in the 8th digit and after. That is well within the 6 digits of accuracy you are guaranteed for a float. If you only printed 6 digits, the output would get rounded and you'd see the value you expect.
printf("%0.6g\n", a);
See http://ideone.com/ZiHYuT

Is it possible for "floor" to return an inaccurate result due to floating point rounding error?

I understand that floating point numbers can often include rounding errors.
When you take the floor or ceiling of a float (or double) in order to convert it to an integer, will the resultant value be exact or can the "floored" value still be an approximation?
Basically, is it possible for something like floor(3.14159265) to return a value which is essentially 2.999999, which would convert to 2 when you try to cast that to an int?
Is it possible for something like floor(3.14159265) to return a value which is essentially 2.999999?
The floor() function returns an floating point value that is an exact integer. So the premise of your question is wrong to begin with.
Now, floor(x) returns the nearest integral value that is not greater than x. It is always true that
floor(x) <= x
and that there exists no integer i, greater than floor(x), such that i <= x.
Looking at floor(3.14159265), this returns 3.0. There's no debate about that. Nothing more to say.
Where it gets interesting is if you write floor(x) where x is the result of an arithmetic expression. Floating point precision and rounding can mean that x falls on the wrong side of an integer. In other words, the true value of the expression that yields x is greater than some integer, i, but that x when evaluated using floating point arithmetic is less than i.
Small integers are representable exactly as floats, but big integers are not.
But, as others pointed out, big integers not representable by float will never be representable by a non-integer, so floor() will never return a non-integer value. Thus, the cast to (int), as long as it does not overflow, will be correct.
But how small is small? Copying shamelessly from this answer:
For float, it is 16,777,217 (224 + 1).
For double, it is 9,007,199,254,740,993 (253 + 1).
Note that the usual range of int (32-bits) is 231, so float is unable to represent all of them exactly. Use double if you need that.
Interestingly, floats can store a certain range of integers exactly, for example:
1 is stored as mantissa 1 (binary 1) * exponent 2^0
2 is stored as mantissa 1 (binary 1) * exponent 2^1
3 is stored as mantissa 1.5 (binary 1.1) * exponent 2^1
4 is stored as mantissa 1 * exponent 2^2
5 is stored as mantissa 1.25 (binary 1.01) * exponent 2^2
6 is stored as mantissa 1.5 (binary 1.1) * exponent 2^2
7 is stored as mantissa 1.75 (binary 1.11) * exponent 2^2
8 is stored as mantissa 1 (binary 1) * exponent 2^3
9 is stored as mantissa 1.125 (binary 1.001) * exponent 2^3
10 is stored as mantissa 1.25 (binary 1.01) * exponent 2^3
...
As you can see, the way exponents increase works in with the perfectly-stored fractional values the mantissa can represent.
You can get a good sense for this by putting number into this great online conversion site.
Once you cross a certain threshold, there's not enough digits in the mantissa to divide the span of the increased exponents without skipping first every odd integer value, then three out of every four, then 7 out of 8 etc.. For numbers over this threshold, the issue is not that they might be different from integer values by some tiny fractional amount, its that all the representable values are integers and not only can no fractional part be represented any more, but as above some of the integers can't be either.
You can observe this in the calculator by considering:
Binary Decimal
+-Exponent Mantissa
0 10010110 11111111111111111111111 16777215
0 10010111 00000000000000000000000 16777216
0 10010111 00000000000000000000001 16777218
See how at this stage, the smallest possible increment of the mantissa is actually "worth 2" in terms of the decimal value represented?
When you take the floor or ceiling of a float (or double) in order to convert it to an integer, will the resultant value be exact or can the "floored" value still be an approximation?
It's always exact. What floor is doing is effectively wiping out any '1's in the mantissa whose significance (their contribution to value) is fractional anyway.
Basically, is it possible for something like floor(3.14159265) to return a value which is essentially 2.999999, which would convert to 2 when you try to cast that to an int?
No.

Decimal to IEEE Single Precision Floating Point

I'm interested in learning how to convert an integer value into IEEE single precision floating point format using bitwise operators only. However, I'm confused as to what can be done to know how many logical shifts left are needed when calculating for the exponent.
Given an int, say 15, we have:
Binary: 1111
-> 1.111 x 2^3 => After placing a decimal point after the first bit, we find that the 'e' value will be three.
E = Exp - Bias
Therefore, Exp = 130 = 10000010
And the significand will be: 111000000000000000000000
However, I knew that the 'e' value would be three because I was able to see that there are three bits after placing the decimal after the first bit. Is there a more generic way to code for this as a general case?
Again, this is for an int to float conversion, assuming that the integer is non-negative, non-zero, and is not larger than the max space allowed for the mantissa.
Also, could someone explain why rounding is needed for values greater than 23 bits?
Thanks in advance!
First, a paper you should consider reading, if you want to understand floating point foibles better: "What Every Computer Scientist Should Know About Floating Point Arithmetic," http://www.validlab.com/goldberg/paper.pdf
And now to some meat.
The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an unsigned int in the range 0 < value < 224. That's the format you're most likely to encounter on modern hardware, and it's the format you seem to reference in your original question.
IEEE-754 single-precision floats are divided into three fields: A single sign bit, 8 bits of exponent, and 23 bits of significand (sometimes called a mantissa). IEEE-754 uses a hidden 1 significand, meaning that the significand is actually 24 bits total. The bits are packed left to right, with the sign bit in bit 31, exponent in bits 30 .. 23, and the significand in bits 22 .. 0. The following diagram from Wikipedia illustrates:
The exponent has a bias of 127, meaning that the actual exponent associated with the floating point number is 127 less than the value stored in the exponent field. An exponent of 0 therefore would be encoded as 127.
(Note: The full Wikipedia article may be interesting to you. Ref: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )
Therefore, the IEEE-754 number 0x40000000 is interpreted as follows:
Bit 31 = 0: Positive value
Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 21)
Bits 22 .. 0 are all 0: Significand = 1.00000000_00000000_0000000. (Note I restored the hidden 1).
So the value is 1.0 x 21 = 2.0.
To convert an unsigned int in the limited range given above, then, to something in IEEE-754 format, you might use a function like the one below. It takes the following steps:
Aligns the leading 1 of the integer to the position of the hidden 1 in the floating point representation.
While aligning the integer, records the total number of shifts made.
Masks away the hidden 1.
Using the number of shifts made, computes the exponent and appends it to the number.
Using reinterpret_cast, converts the resulting bit-pattern to a float. This part is an ugly hack, because it uses a type-punned pointer. You could also do this by abusing a union. Some platforms provide an intrinsic operation (such as _itof) to make this reinterpretation less ugly.
There are much faster ways to do this; this one is meant to be pedagogically useful, if not super efficient:
float uint_to_float(unsigned int significand)
{
// Only support 0 < significand < 1 << 24.
if (significand == 0 || significand >= 1 << 24)
return -1.0; // or abort(); or whatever you'd like here.
int shifts = 0;
// Align the leading 1 of the significand to the hidden-1
// position. Count the number of shifts required.
while ((significand & (1 << 23)) == 0)
{
significand <<= 1;
shifts++;
}
// The number 1.0 has an exponent of 0, and would need to be
// shifted left 23 times. The number 2.0, however, has an
// exponent of 1 and needs to be shifted left only 22 times.
// Therefore, the exponent should be (23 - shifts). IEEE-754
// format requires a bias of 127, though, so the exponent field
// is given by the following expression:
unsigned int exponent = 127 + 23 - shifts;
// Now merge significand and exponent. Be sure to strip away
// the hidden 1 in the significand.
unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF);
// Reinterpret as a float and return. This is an evil hack.
return *reinterpret_cast< float* >( &merged );
}
You can make this process more efficient using functions that detect the leading 1 in a number. (These sometimes go by names like clz for "count leading zeros", or norm for "normalize".)
You can also extend this to signed numbers by recording the sign, taking the absolute value of the integer, performing the steps above, and then putting the sign into bit 31 of the number.
For integers >= 224, the entire integer does not fit into the significand field of the 32-bit float format. This is why you need to "round": You lose LSBs in order to make the value fit. Thus, multiple integers will end up mapping to the same floating point pattern. The exact mapping depends on the rounding mode (round toward -Inf, round toward +Inf, round toward zero, round toward nearest even). But the fact of the matter is you can't shove 24 bits into fewer than 24 bits without some loss.
You can see this in terms of the code above. It works by aligning the leading 1 to the hidden 1 position. If a value was >= 224, the code would need to shift right, not left, and that necessarily shifts LSBs away. Rounding modes just tell you how to handle the bits shifted away.