How to add and subtract 16 bit floating point half precision numbers?

How to add and subtract 16 bit floating point half precision numbers? - bit-manipulation

How do I add and subtract 16 bit floating point half precision numbers?
Say I need to add or subtract:
1 10000 0000000000
1 01111 1111100000
2’s complement form.

The OpenEXR library defines a half-precision floating point class. It's C++, but the code for casting between native IEEE754 float and half should be easy to adapt. see: Half/half.h as a start.

Assuming you are using a denormalized representation similar to that of IEEE single/double precision, just compute the sign = (-1)^S, the mantissa as 1.M if E != 0 and 0.M if E == 0, and the exponent = E - 2^(n-1), operate on these natural representations, and convert back to the 16-bit format.
sign1 = -1
mantissa1 = 1.0
exponent1 = 1
sign2 = -1
mantissa2 = 1.11111
exponent2 = 0
sum:
sign = -1
mantissa = 1.111111
exponent = 1
Representation: 1 10000 1111110000
Naturally, this assumes excess encoding of the exponent.

Related

How to calculate the range of data type float in c++?

As we can see int has 4 byte in memory, that are 32bits, after applying range formula , we can see range of int -2147483648 to 2147483647. I have calculated the ranges of all datatypes besides float and double and long double.
I dont know how they calculated the range of float mentioned below.

Floating point numbers are stored as an exponent and a fraction within the space available.
For some systems where float is implemented as an IEEE 754 value, the results would looks as below.
sign : 1 bit
exponent : 8 bits
fraction : 23 bits
The exponent allows numbers from 2 ^ (-127) (2 to the power -127) to 2 ^ 128 ( 2 to the power 128).
Allowing a range of numbers from
5.87747E-39
3.40282E+38
the fraction point gives a fraction such as .12313
Thus with 23 bits of values, the accuracy of a number is about 7 decimal digits or 1.19 E-7
For more details see wikipedia : IEEE 754-1985
On a given system, the <cfloat> / <float.h> will give the limits. For non IEEE 754 based representations, you would have to understand how the numbers are stored to calculate the limits.

-2^(n-1) to (2^(n-1)-1) is the formula to calculate the range of data types.
Where n = no.of.bits of the primitive data type.
For example: for the byte data type, n = 8 bits
-2^(8-1) to (2^(8-1)-1)
The above calculation will give you -128 to 127. Now, coming to the question of why it’s not 255. The reason is that byte, int, short, and double are signed data types meaning it has half the range below 0 (negative) and half the range above 0 (positive). The first bit represents a sign (+ or -). The remaining bits are 7. That’s why 2^(8-1) = 128. We take 0 as a positive sign, so the range is 2^(8-1) - 1 for positive numbers.

How to check IEEE754 overflow range

consider multipling 2 floating point numbers a and b
Let say expA=127, expB = 10
multplying significands give 1.101 x 2^137
so 137 >= 127 >= -126 so there is overflow
if -126 >= resultExp so there is underflow
here I am not sure how to get 127 and -126
127 (because 01111111?)
-126 (because 10000010?) why not 10000000
what if exp is 5 bit? 01111 >= x >= 10010 ?

The exponent section in floating point, at least in IEEE 754, is not represented in 2's complement as you mentioned, it is represented as follows:
Let us assume that the exponent section is of width n:
All zeroes, 000...000, i.e. Zero, represent denormals as well as +ve or -ve Zeroes in floating point.
The next representation, 000...001, i.e. One represents the least possible exponent, i.e. -126 in single-precision, that means that 0000...010 represents -125, 000...011 represents -124 and son on.
111...110 represents the maximum exponent, i.e. 127 in single-precision.
111...111 represents overflow and NaN representations.

see this: IEEE-754 specs
so you know what bits to extract
this C++ code should work
int e;
float f=1.5e+34;
unsigned int u;
u=((unsigned int*)((void*)(&f)))[0];
e=(u>>23)&&255;
if (e==255); // +/-Inf,NaN...
else if (e==0) e-=128; // denormalized
else e-=127; // normalized
in e is exponent of float f
also do not forget that the mantisa's multiplication can lower the result's exponent by 1
and for denormalized numbers even more
So the solution for a*b=c is:
set a0 as a but set its exponent to 0
set b0 as b but set its exponent to 0
compute c0=a0*b0
compute the result exponent
add the c0's exponent change to it
now you can test for over/underflow safely

Decimal to IEEE Single Precision Floating Point

I'm interested in learning how to convert an integer value into IEEE single precision floating point format using bitwise operators only. However, I'm confused as to what can be done to know how many logical shifts left are needed when calculating for the exponent.
Given an int, say 15, we have:
Binary: 1111
-> 1.111 x 2^3 => After placing a decimal point after the first bit, we find that the 'e' value will be three.
E = Exp - Bias
Therefore, Exp = 130 = 10000010
And the significand will be: 111000000000000000000000
However, I knew that the 'e' value would be three because I was able to see that there are three bits after placing the decimal after the first bit. Is there a more generic way to code for this as a general case?
Again, this is for an int to float conversion, assuming that the integer is non-negative, non-zero, and is not larger than the max space allowed for the mantissa.
Also, could someone explain why rounding is needed for values greater than 23 bits?
Thanks in advance!

First, a paper you should consider reading, if you want to understand floating point foibles better: "What Every Computer Scientist Should Know About Floating Point Arithmetic," http://www.validlab.com/goldberg/paper.pdf
And now to some meat.
The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an unsigned int in the range 0 < value < 224. That's the format you're most likely to encounter on modern hardware, and it's the format you seem to reference in your original question.
IEEE-754 single-precision floats are divided into three fields: A single sign bit, 8 bits of exponent, and 23 bits of significand (sometimes called a mantissa). IEEE-754 uses a hidden 1 significand, meaning that the significand is actually 24 bits total. The bits are packed left to right, with the sign bit in bit 31, exponent in bits 30 .. 23, and the significand in bits 22 .. 0. The following diagram from Wikipedia illustrates:
The exponent has a bias of 127, meaning that the actual exponent associated with the floating point number is 127 less than the value stored in the exponent field. An exponent of 0 therefore would be encoded as 127.
(Note: The full Wikipedia article may be interesting to you. Ref: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )
Therefore, the IEEE-754 number 0x40000000 is interpreted as follows:
Bit 31 = 0: Positive value
Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 21)
Bits 22 .. 0 are all 0: Significand = 1.00000000_00000000_0000000. (Note I restored the hidden 1).
So the value is 1.0 x 21 = 2.0.
To convert an unsigned int in the limited range given above, then, to something in IEEE-754 format, you might use a function like the one below. It takes the following steps:
Aligns the leading 1 of the integer to the position of the hidden 1 in the floating point representation.
While aligning the integer, records the total number of shifts made.
Masks away the hidden 1.
Using the number of shifts made, computes the exponent and appends it to the number.
Using reinterpret_cast, converts the resulting bit-pattern to a float. This part is an ugly hack, because it uses a type-punned pointer. You could also do this by abusing a union. Some platforms provide an intrinsic operation (such as _itof) to make this reinterpretation less ugly.
There are much faster ways to do this; this one is meant to be pedagogically useful, if not super efficient:
float uint_to_float(unsigned int significand)
{
// Only support 0 < significand < 1 << 24.
if (significand == 0 || significand >= 1 << 24)
return -1.0; // or abort(); or whatever you'd like here.
int shifts = 0;
// Align the leading 1 of the significand to the hidden-1
// position. Count the number of shifts required.
while ((significand & (1 << 23)) == 0)
{
significand <<= 1;
shifts++;
}
// The number 1.0 has an exponent of 0, and would need to be
// shifted left 23 times. The number 2.0, however, has an
// exponent of 1 and needs to be shifted left only 22 times.
// Therefore, the exponent should be (23 - shifts). IEEE-754
// format requires a bias of 127, though, so the exponent field
// is given by the following expression:
unsigned int exponent = 127 + 23 - shifts;
// Now merge significand and exponent. Be sure to strip away
// the hidden 1 in the significand.
unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF);
// Reinterpret as a float and return. This is an evil hack.
return *reinterpret_cast< float* >( &merged );
}
You can make this process more efficient using functions that detect the leading 1 in a number. (These sometimes go by names like clz for "count leading zeros", or norm for "normalize".)
You can also extend this to signed numbers by recording the sign, taking the absolute value of the integer, performing the steps above, and then putting the sign into bit 31 of the number.
For integers >= 224, the entire integer does not fit into the significand field of the 32-bit float format. This is why you need to "round": You lose LSBs in order to make the value fit. Thus, multiple integers will end up mapping to the same floating point pattern. The exact mapping depends on the rounding mode (round toward -Inf, round toward +Inf, round toward zero, round toward nearest even). But the fact of the matter is you can't shove 24 bits into fewer than 24 bits without some loss.
You can see this in terms of the code above. It works by aligning the leading 1 to the hidden 1 position. If a value was >= 224, the code would need to shift right, not left, and that necessarily shifts LSBs away. Rounding modes just tell you how to handle the bits shifted away.

What is a subnormal floating point number?

The isnormal() reference page says:
Determines if the given floating point number arg is normal, i.e. is
neither zero, subnormal, infinite, nor NaN.
It's clear what a number being zero, infinite or NaN means. But it also says subnormal. When is a number subnormal?

IEEE 754 basics
First let's review the basics of IEEE 754 numbers are organized.
We'll focus on single precision (32-bit), but everything can be immediately generalized to other precisions.
The format is:
1 bit: sign
8 bits: exponent
23 bits: fraction
Or if you like pictures:
Source.
The sign is simple: 0 is positive, and 1 is negative, end of story.
The exponent is 8 bits long, and so it ranges from 0 to 255.
The exponent is called biased because it has an offset of -127, e.g.:
0 == special case: zero or subnormal, explained below
1 == 2 ^ -126
...
125 == 2 ^ -2
126 == 2 ^ -1
127 == 2 ^ 0
128 == 2 ^ 1
129 == 2 ^ 2
...
254 == 2 ^ 127
255 == special case: infinity and NaN
The leading bit convention
(What follows is a fictitious hypothetical narrative, not based on any actual historical research.)
While designing IEEE 754, engineers noticed that all numbers, except 0.0, have a one 1 in binary as the first digit. E.g.:
25.0 == (binary) 11001 == 1.1001 * 2^4
0.625 == (binary) 0.101 == 1.01 * 2^-1
both start with that annoying 1. part.
Therefore, it would be wasteful to let that digit take up one precision bit almost every single number.
For this reason, they created the "leading bit convention":
always assume that the number starts with one
But then how to deal with 0.0? Well, they decided to create an exception:
if the exponent is 0
and the fraction is 0
then the number represents plus or minus 0.0
so that the bytes 00 00 00 00 also represent 0.0, which looks good.
If we only considered these rules, then the smallest non-zero number that can be represented would be:
exponent: 0
fraction: 1
which looks something like this in a hex fraction due to the leading bit convention:
1.000002 * 2 ^ (-127)
where .000002 is 22 zeroes with a 1 at the end.
We cannot take fraction = 0, otherwise that number would be 0.0.
But then the engineers, who also had a keen aesthetic sense, thought: isn't that ugly? That we jump from straight 0.0 to something that is not even a proper power of 2? Couldn't we represent even smaller numbers somehow? (OK, it was a bit more concerning than "ugly": it was actually people getting bad results for their computations, see "How subnormals improve computations" below).
Subnormal numbers
The engineers scratched their heads for a while, and came back, as usual, with another good idea. What if we create a new rule:
If the exponent is 0, then:
the leading bit becomes 0
the exponent is fixed to -126 (not -127 as if we didn't have this exception)
Such numbers are called subnormal numbers (or denormal numbers which is synonym).
This rule immediately implies that the number such that:
exponent: 0
fraction: 0
is still 0.0, which is kind of elegant as it means one less rule to keep track of.
So 0.0 is actually a subnormal number according to our definition!
With this new rule then, the smallest non-subnormal number is:
exponent: 1 (0 would be subnormal)
fraction: 0
which represents:
1.0 * 2 ^ (-126)
Then, the largest subnormal number is:
exponent: 0
fraction: 0x7FFFFF (23 bits 1)
which equals:
0.FFFFFE * 2 ^ (-126)
where .FFFFFE is once again 23 bits one to the right of the dot.
This is pretty close to the smallest non-subnormal number, which sounds sane.
And the smallest non-zero subnormal number is:
exponent: 0
fraction: 1
which equals:
0.000002 * 2 ^ (-126)
which also looks pretty close to 0.0!
Unable to find any sensible way to represent numbers smaller than that, the engineers were happy, and went back to viewing cat pictures online, or whatever it is that they did in the 70s instead.
As you can see, subnormal numbers do a trade-off between precision and representation length.
As the most extreme example, the smallest non-zero subnormal:
0.000002 * 2 ^ (-126)
has essentially a precision of a single bit instead of 32-bits. For example, if we divide it by two:
0.000002 * 2 ^ (-126) / 2
we actually reach 0.0 exactly!
Visualization
It is always a good idea to have a geometric intuition about what we learn, so here goes.
If we plot IEEE 754 floating point numbers on a line for each given exponent, it looks something like this:
+---+-------+---------------+-------------------------------+
exponent |126| 127 | 128 | 129 |
+---+-------+---------------+-------------------------------+
| | | | |
v v v v v
-------------------------------------------------------------
floats ***** * * * * * * * * * * * *
-------------------------------------------------------------
^ ^ ^ ^ ^
| | | | |
0.5 1.0 2.0 4.0 8.0
From that we can see that:
for each exponent, there is no overlap between the represented numbers
for each exponent, we have the same number 2^23 of floating point numbers (here represented by 4 *)
within each exponent, points are equally spaced
larger exponents cover larger ranges, but with points more spread out
Now, let's bring that down all the way to exponent 0.
Without subnormals, it would hypothetically look like this:
+---+---+-------+---------------+-------------------------------+
exponent | ? | 0 | 1 | 2 | 3 |
+---+---+-------+---------------+-------------------------------+
| | | | | |
v v v v v v
-----------------------------------------------------------------
floats * **** * * * * * * * * * * * *
-----------------------------------------------------------------
^ ^ ^ ^ ^ ^
| | | | | |
0 | 2^-126 2^-125 2^-124 2^-123
|
2^-127
With subnormals, it looks like this:
+-------+-------+---------------+-------------------------------+
exponent | 0 | 1 | 2 | 3 |
+-------+-------+---------------+-------------------------------+
| | | | |
v v v v v
-----------------------------------------------------------------
floats * * * * * * * * * * * * * * * * *
-----------------------------------------------------------------
^ ^ ^ ^ ^ ^
| | | | | |
0 | 2^-126 2^-125 2^-124 2^-123
|
2^-127
By comparing the two graphs, we see that:
subnormals double the length of range of exponent 0, from [2^-127, 2^-126) to [0, 2^-126)
The space between floats in subnormal range is the same as for [0, 2^-126).
the range [2^-127, 2^-126) has half the number of points that it would have without subnormals.
Half of those points go to fill the other half of the range.
the range [0, 2^-127) has some points with subnormals, but none without.
This lack of points in [0, 2^-127) is not very elegant, and is the main reason for subnormals to exist!
since the points are equally spaced:
the range [2^-128, 2^-127) has half the points than [2^-127, 2^-126)
-[2^-129, 2^-128) has half the points than [2^-128, 2^-127)
and so on
This is what we mean when saying that subnormals are a tradeoff between size and precision.
Runnable C example
Now let's play with some actual code to verify our theory.
In almost all current and desktop machines, C float represents single precision IEEE 754 floating point numbers.
This is in particular the case for my Ubuntu 18.04 amd64 Lenovo P51 laptop.
With that assumption, all assertions pass on the following program:
subnormal.c
#if __STDC_VERSION__ < 201112L
#error C11 required
#endif
#ifndef __STDC_IEC_559__
#error IEEE 754 not implemented
#endif
#include <assert.h>
#include <float.h> /* FLT_HAS_SUBNORM */
#include <inttypes.h>
#include <math.h> /* isnormal */
#include <stdlib.h>
#include <stdio.h>
#if FLT_HAS_SUBNORM != 1
#error float does not have subnormal numbers
#endif
typedef struct {
uint32_t sign, exponent, fraction;
} Float32;
Float32 float32_from_float(float f) {
uint32_t bytes;
Float32 float32;
bytes = *(uint32_t*)&f;
float32.fraction = bytes & 0x007FFFFF;
bytes >>= 23;
float32.exponent = bytes & 0x000000FF;
bytes >>= 8;
float32.sign = bytes & 0x000000001;
bytes >>= 1;
return float32;
}
float float_from_bytes(
uint32_t sign,
uint32_t exponent,
uint32_t fraction
) {
uint32_t bytes;
bytes = 0;
bytes |= sign;
bytes <<= 8;
bytes |= exponent;
bytes <<= 23;
bytes |= fraction;
return *(float*)&bytes;
}
int float32_equal(
float f,
uint32_t sign,
uint32_t exponent,
uint32_t fraction
) {
Float32 float32;
float32 = float32_from_float(f);
return
(float32.sign == sign) &&
(float32.exponent == exponent) &&
(float32.fraction == fraction)
;
}
void float32_print(float f) {
Float32 float32 = float32_from_float(f);
printf(
"%" PRIu32 " %" PRIu32 " %" PRIu32 "\n",
float32.sign, float32.exponent, float32.fraction
);
}
int main(void) {
/* Basic examples. */
assert(float32_equal(0.5f, 0, 126, 0));
assert(float32_equal(1.0f, 0, 127, 0));
assert(float32_equal(2.0f, 0, 128, 0));
assert(isnormal(0.5f));
assert(isnormal(1.0f));
assert(isnormal(2.0f));
/* Quick review of C hex floating point literals. */
assert(0.5f == 0x1.0p-1f);
assert(1.0f == 0x1.0p0f);
assert(2.0f == 0x1.0p1f);
/* Sign bit. */
assert(float32_equal(-0.5f, 1, 126, 0));
assert(float32_equal(-1.0f, 1, 127, 0));
assert(float32_equal(-2.0f, 1, 128, 0));
assert(isnormal(-0.5f));
assert(isnormal(-1.0f));
assert(isnormal(-2.0f));
/* The special case of 0.0 and -0.0. */
assert(float32_equal( 0.0f, 0, 0, 0));
assert(float32_equal(-0.0f, 1, 0, 0));
assert(!isnormal( 0.0f));
assert(!isnormal(-0.0f));
assert(0.0f == -0.0f);
/* ANSI C defines FLT_MIN as the smallest non-subnormal number. */
assert(FLT_MIN == 0x1.0p-126f);
assert(float32_equal(FLT_MIN, 0, 1, 0));
assert(isnormal(FLT_MIN));
/* The largest subnormal number. */
float largest_subnormal = float_from_bytes(0, 0, 0x7FFFFF);
assert(largest_subnormal == 0x0.FFFFFEp-126f);
assert(largest_subnormal < FLT_MIN);
assert(!isnormal(largest_subnormal));
/* The smallest non-zero subnormal number. */
float smallest_subnormal = float_from_bytes(0, 0, 1);
assert(smallest_subnormal == 0x0.000002p-126f);
assert(0.0f < smallest_subnormal);
assert(!isnormal(smallest_subnormal));
return EXIT_SUCCESS;
}
GitHub upstream.
Compile and run with:
gcc -ggdb3 -O0 -std=c11 -Wall -Wextra -Wpedantic -Werror -o subnormal.out subnormal.c
./subnormal.out
C++
In addition to exposing all of C's APIs, C++ also exposes some extra subnormal related functionality that is not as readily available in C in <limits>, e.g.:
denorm_min: Returns the minimum positive subnormal value of the type T
In C++ the whole API is templated for each floating point type, and is much nicer.
Implementations
x86_64 and ARMv8 implemens IEEE 754 directly on hardware, which the C code translates to.
Subnormals seem to be less fast than normals in certain implementations: Why does changing 0.1f to 0 slow down performance by 10x? This is mentioned in the ARM manual, see the "ARMv8 details" section of this answer.
ARMv8 details
ARM Architecture Reference Manual ARMv8 DDI 0487C.a manual A1.5.4 "Flush-to-zero" describes a configurable mode where subnormals are rounded to zero to improve performance:
The performance of floating-point processing can be reduced when doing calculations involving denormalized numbers and Underflow exceptions. In many algorithms, this performance can be recovered, without significantly affecting the accuracy of the final result, by replacing the denormalized operands and intermediate results with zeros. To permit this optimization, ARM floating-point implementations allow a Flush-to-zero mode to be used for different floating-point formats as follows:
For AArch64:
If FPCR.FZ==1, then Flush-to-Zero mode is used for all Single-Precision and Double-Precision inputs and outputs of all instructions.
If FPCR.FZ16==1, then Flush-to-Zero mode is used for all Half-Precision inputs and outputs of floating-point instructions, other than:—Conversions between Half-Precision and Single-Precision numbers.—Conversions between Half-Precision and Double-Precision numbers.
A1.5.2 "Floating-point standards, and terminology" Table A1-3 "Floating-point terminology" confirms that subnormals and denormals are synonyms:
This manual IEEE 754-2008
------------------------- -------------
[...]
Denormal, or denormalized Subnormal
C5.2.7 "FPCR, Floating-point Control Register" describes how ARMv8 can optionally raise exceptions or set a flag bits whenever the input of a floating point operation is subnormal:
FPCR.IDE, bit [15] Input Denormal floating-point exception trap enable. Possible values are:
0b0 Untrapped exception handling selected. If the floating-point exception occurs then the FPSR.IDC bit is set to 1.
0b1 Trapped exception handling selected. If the floating-point exception occurs, the PE does not update the FPSR.IDC bit. The trap handling software can decide whether to set the FPSR.IDC bit to 1.
D12.2.88 "MVFR1_EL1, AArch32 Media and VFP Feature Register 1" shows that denormal support is completely optional in fact, and offers a bit to detect if there is support:
FPFtZ, bits [3:0]
Flush to Zero mode. Indicates whether the floating-point implementation provides support only for the Flush-to-Zero mode of operation. Defined values are:
0b0000 Not implemented, or hardware supports only the Flush-to-Zero mode of operation.
0b0001 Hardware supports full denormalized number arithmetic.
All other values are reserved.
In ARMv8-A, the permitted values are 0b0000 and 0b0001.
This suggests that when subnormals are not implemented, implementations just revert to flush-to-zero.
Infinity and NaN
Curious? I've written some things at:
infinity: Ranges of floating point datatype in C?
NaN: What is the difference between quiet NaN and signaling NaN?
How subnormals improve computations
According to the Oracle (formerly Sun) Numerical Computation Guide
[S]ubnormal numbers eliminate underflow as a cause for concern for a variety of computations (typically, multiply followed by add). ... The class of problems that succeed in the presence of gradual underflow, but fail with Store 0, is larger than the fans of Store 0 may realize. ... In the absence of gradual underflow, user programs need to be sensitive to the implicit inaccuracy threshold. For example, in single precision, if underflow occurs in some parts of a calculation, and Store 0 is used to replace underflowed results with 0, then accuracy can be guaranteed only to around 10-31, not 10-38, the usual lower range for single-precision exponents.
The Numerical Computation Guide refers the reader to two other papers:
Underflow and the Reliability of Numerical Software by James Demmel
Combatting the Effects of Underflow and Overflow in Determining Real Roots of Polynomials by S. Linnainmaa
Thanks to Willis Blackburn for contributing to this section of the answer.
Actual history
An Interview with the Old Man of Floating-Point by Charles Severance (1998) is a short real world historical overview in the form of an interview with William Kahan and was suggested by John Coleman in the comments.

In the IEEE754 standard, floating point numbers are represented as binary scientific notation, x = M × 2e. Here M is the mantissa and e is the exponent. Mathematically, you can always choose the exponent so that 1 ≤ M < 2.* However, since in the computer representation the exponent can only have a finite range, there are some numbers which are bigger than zero, but smaller than 1.0 × 2emin. Those numbers are the subnormals or denormals.
Practically, the mantissa is stored without the leading 1, since there is always a leading 1, except for subnormal numbers (and zero). Thus the interpretation is that if the exponent is non-minimal, there is an implicit leading 1, and if the exponent is minimal, there isn't, and the number is subnormal.
*) More generally, 1 ≤ M < B for any base-B scientific notation.

From http://blogs.oracle.com/d/entry/subnormal_numbers:
There are potentially multiple ways of representing the same number,
using decimal as an example, the number 0.1 could be represented as
1*10-1 or 0.1*100 or even 0.01 * 10. The standard dictates that the
numbers are always stored with the first bit as a one. In decimal that
corresponds to the 1*10-1 example.
Now suppose that the lowest exponent that can be represented is -100.
So the smallest number that can be represented in normal form is
1*10-100. However, if we relax the constraint that the leading bit be
a one, then we can actually represent smaller numbers in the same
space. Taking a decimal example we could represent 0.1*10-100. This
is called a subnormal number. The purpose of having subnormal numbers
is to smooth the gap between the smallest normal number and zero.
It is very important to realise that subnormal numbers are represented
with less precision than normal numbers. In fact, they are trading
reduced precision for their smaller size. Hence calculations that use
subnormal numbers are not going to have the same precision as
calculations on normal numbers. So an application which does
significant computation on subnormal numbers is probably worth
investigating to see if rescaling (i.e. multiplying the numbers by
some scaling factor) would yield fewer subnormals, and more accurate
results.

Convert Between Floating Point Standards

I am trying to convert an IEEE based floating point number to a MIL-STD 1750A floating point number.
I have attached the specification for both:
I understand how to decompose the floating point 12.375 in IEEE format as per the example on wikipedia.
However, I'm not sure if my interpretation of the MIL-STD is correct.
12.375 = (12)b10 + (0.375)b10 = (1100)b2 + (0.011)b2 = (1100.011)b2
(1100.011)b2 = 0.1100011 x 2^4 => Exponent, E = 4.
4 in normalised 2's complement is = (100)b2 = Exponent
Therefore a MIL-STD 1750A 32 bit floating point number is:
S=0, F=11000110000000000000000, E=00000100
Is my above interpretation correct?
For -12.375, is it just the sign bit that swaps? i.e.:
S=1, F=11000110000000000000000, E=00000100
Or does something funky happen with the fraction part?

The diagram above is a bit misleading, I think. In IEEE format, to switch from positive to negative, you simply flip the first bit. The remaining three bits can be treated as an unsigned number. In the MIL-STD format, the mantissa is a two's complement number, so while the first bit does indicate the sign, the remaining 23 bits do not remain the same.
What I get is
S=1, F=00111010000000000000000, E=00000100

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to add and subtract 16 bit floating point half precision numbers? - bit-manipulation

How do I add and subtract 16 bit floating point half precision numbers? Say I need to add or subtract: 1 10000 0000000000 1 01111 1111100000 2’s complement form.

The OpenEXR library defines a half-precision floating point class. It's C++, but the code for casting between native IEEE754 float and half should be easy to adapt. see: Half/half.h as a start.

Related

How to calculate the range of data type float in c++?

How to check IEEE754 overflow range

Decimal to IEEE Single Precision Floating Point

What is a subnormal floating point number?

Convert Between Floating Point Standards

Categories

Resources