Related
The theory of fixed-point number is that we divide certain number of bits between integer part and fractional part. This amount is fixed.
For example, 26.5 is stored in that order:
To convert from floating-point to fixed-point, we follow this algorithm:
Calculate x = floating_input * 2^(fractional_bits)
27.3 * 2^10 = 27955.2
Round x to the nearest whole number (e.g. round(x))
27955
Store the rounded x in an integer container
Now if we look on the bit representation of our numbers and on what multiplying on 2^(fractional_bits) makes, we will see:
27 is 11011
27*2^10 is 110 1100 0000 0000 which is shifting on 10 bits to the left.
So we can say, that multiplying on 2^10 indeed gives us "space" in the right part of bits for save forth altering of this number. We can make two such numbers converted in this way, interacting each other and eventually re-converted to familiar view with point by opposite dividing on 2^10.
If we recall that bits are stored in some integer variable, which in turn has its own amount of bits it gets clear that as more bits in that variable are devoted for fraction part as less bits remain for integer part of number.
27.3 * 2^10 = 27955.2 should be rounded for storing in integer type to
27955 which is 110 1101 0011 0011
after that number can be altered somehow, certain value isn't important now, and let's say, we want to retrieve back human-readable value:
27955/2^10 = 27,2998046875
What about amount of bits after point?
Let's say we have two numbers with purpose to multiply them and we chose 10 bits after point
27 * 3.3 = 89.1 expected
27*2^10 = 27 648 is 110 1100 0000 0000
3.3*2^10 = 3 379 is 1101 0011 0011
27 648 * 3 379 = 93 422 592
consequently
27*3.3 = 93 422 592/(2^10*2^10) = 89.09 pretty accurate
Let's take 1 bit after point
27 and 3.3
27*2^1 = 54 is 110110
3.3*2^1 = 6.6 after round 6 is 110
54 * 6 = 324
consequently
27*3.3 = 324/(2^1*2^1) = 81 which is unsatisfying
On practice we can use next code to create and operate with fixed-point number:
#include <iostream>
using namespace std;
const int scale = 10;
#define DoubleToFixed(x) (x*(double)(1<<scale))
#define FixedToDouble(x) ((double)x / (double)(1<<scale))
#define IntToFixed(x) (x << scale)
#define FixedToInt(x) (x >> scale)
#define MUL(x,y) (((x)*(y)) >> scale)
#define DIV(x,y) ((x) << scale)
int main()
{
double a = 7.27;
double b = 3.0;
int f = DoubleToFixed(a);
cout << f<<endl; //7444
cout << FixedToDouble(f)<<endl; //7.26953125
int g = DoubleToFixed(b);
cout << g<<endl; //3072
int c = MUL(f,g);
cout << FixedToDouble(c)<<endl; //21.80859375
}
So, where is connection between the theory of fixed emplacement of point between bits (powers of 2) and practice implementation? If we store fixed-number in int, it is obvious, that there is no place for storing the point in it.
It seems that fixed-point numbers are just conversion for increase performance. And to retrieve human-readable number after calculations, the opposite conversion must present.
I hope, I understand the algorithm. But is the idea of placement of point between digits is just an abstract idea?
Fixed-point formats are used as a way to represent fractional numbers. Quite commonly, processors perform fixed-point or integer arithmetic faster or more efficiently than floating-point arithmetic. Whether fixed-point arithmetic is suitable for an application depends on what numbers the application needs to work with.
Using fixed-point formats does require converting input to the fixed-point format and converting numbers in the fixed-point format to output. But this is also true of integers and floating-point: All input must be converted to whatever internal format is used to represent it, and all output must be produced by converting from internal formats.
And how does multiplying on 2^(fractional_bits) affect the quantity of digits after the point?
Suppose we have some number x that is represented as an integer X = x•2f, where f is the number of fraction bits. Conceptually X is in a fixed-point format. Similarly, we have y represented as Y = y•2f.
If we execute an integer multiplication instruction to produce result Z = XY, then Z = XY = (x•2f)•(y•2f) = xy•22f. Then, if we divide Z by 2f (or, nearly equivalently, shift it right by f bits), we have xy•2f except for any rounding errors that may have occurred in the division. And xy•2f is the fixed-point representation of the product of x and y.
Thus, we can effect a fixed-point multiplication by perform an integer multiplication followed by a shift.
Often, to get rounding instead of truncation, a value of half of 2f is added before the shift, so we compute floor((XY + 2f−1) / 2f):
Multiply X by Y.
Add 2f−1.
Shift right f bits.
It seems that fixed-point numbers are just convertion for encreese performance.
You might as well say that floating-point numbers are a conversion to increase the representable range.
Whatever format your numbers are originally coming in as (strings, voltage levels, integers, etc.), you often convert them to floating point numbers in order to store or operate on them, but neither floating point nor fixed point is a human-readable representation.
Floating point numbers have lower precision and a wider magnitude range; fixed point numbers have higher precision and a narrower magnitude range. (Performance differences depend on the architecture and the important operations.) You shouldn't think of the fixed-point representation as a conversion from floating point, but as an alternative to floating point.
I think you want a class that wraps an int along with the fixed radix point information. Indeed, the use is implicit, but you then define your own multiplication (for example) that works on the fixed point meaning as a whole rather than just multiplying the underlying ints.
You don't want to leave the implicit meaning ... make it known to the compiler in a strong way. You should not have to explicitly call your handling functions; make it part of the class semantics.
static_casting from a floating point to an integer simply strips the fractional point of the number. For example static_cast<int>(13.9999999) yields 13.
Not all integers are representable as floating point numbers. For example internally the closest float to 13,000,000 may be: 12999999.999999.
In this hypothetical case, I'd expect to get an unexpected result from:
const auto foo = 12'999'999.5F;
const auto bar = static_cast<long long>(ceil(foo));
My assumption is that such a breakdown does occur at some point, if not necessarily at 13,000,000. I'd just like to know the range over which I can trust static_cast<long long>(ceif(foo))?
For example internally the closest float to 13,000,000 may be: 12999999.999999.
That is not possible in any normal floating-point format. The floating-point representation of numbers is equivalent to M•be, where b is a fixed base (e.g., 2 for binary floating-point) and M and e are integers with some restrictions on their values. In order for a value like 13,000,000-x to be represented, where x is some positive value less than 1, e must be negative (because M•be for a non-negative e is an integer). If so, then M•b0 is an integer larger than M•be, so it is larger than 13,000,000, and so 13,000,000 can be represented as M'•b0, where M' is a positive integer less than M and hence fits in the range of allowed values for M (in any normal floating-point format). (Perhaps some bizarre floating-point format might impose a strange range on M or e that prevents this, but no normal format does.)
Regarding your code:
auto test = 0LL;
const auto floater = 0.5F;
for(auto i = 0LL; i == test; i = std::ceil(i + floater)) ++test;
cout << test << endl;
When i was 8,388,608, the mathematical result of 8,388,608 + .5 is 8,388,608.5. This is not representable in the float format on your system, so it was rounded to 8,388,608. The ceil of this is 8,388,608. At this point, test was 8,388,609, so the loop stopped. So this code does not demonstrate that 8,388,608.5 is representable and 8,388,609 is not.
Behavior seems to return to normal if I do: ceil(8'388'609.5F) which will correctly return 8,388,610.
8,388,609.5 is not representable in the float format on your system, so it was rounded by the rule “round to nearest, ties to even.” The two nearest representable values are 8,388,609, and 8,388,610. Since they are equally far apart, the result was 8,388,610. That value was passed to ceil, which of course returned 8,388,610.
On Visual Studio 2015 I got 8,388,609 which is a horrifying small safe range.
In the IEEE-754 basic 32-bit binary format, all integers from -16,777,216 to +16,777,216 are representable, because the format has a 24-bit significand.
Floating point numbers are represented by 3 integers, cbq where:
c is the mantissa (so for the number: 12,999,999.999999 c would be 12,999,999,999,999)
q is the exponent (so for the number: 12,999,999.999999 q would be -6)
b is the base (IEEE-754 requires b to be either 10 or 2; in the representation above b is 10)
From this it's easy to see that a floating point with the capability of representing 12,999,999.999999 also has the capability of representing 13,000,000.000000 using a c of 1,300,000,000,000 and a q of -5.
This example is a bit contrived in that the chosen b is 10, where in almost all implementations the chosen base is 2. But it's worth pointing out that even with a b of 2 the q functions as a shift left or right of the mantissa.
Next let's talk about a range here. Obviously a 32-bit floating point cannot represent all the integers represented by a 32-bit integer, as the floating point must also represent so many much larger or smaller numbers. Since the exponent is simply shifting the mantissa, a floating point number can always exactly represent every integer that can be represented by it's mantissa. Given the traditional IEEE-754 binary base floating point numbers:
A 32-bit (float) has a 24-bit mantissa so it can represent all integers in the range [-16,777,215, 16,777,215]
A 64-bit (double) has a 53-bit mantissa so it can represent all integers in the range [-9,007,199,254,740,991, 9,007,199,254,740,991]
A 128-bit (long double depending upon implementation) has a 113-bit mantissa so it can represent all integers in the range [-103,845,937,170,696,552,570,609,926,584,40,191, 103,845,937,170,696,552,570,609,926,584,40,191]
[source]
c++ provides digits as a method of finding this number for a given floating point type. (Though admittedly even a long long is too small to represent a 113-bit mantissa.) For example a float's maximum mantissa could be found by:
(1LL << numeric_limits<float>::digits) - 1LL
Having thoroughly explained the mantissa, let's revisit the exponent section to talk about how a floating point is actually stored. Take 13,000,000.0 that could be represented as:
c = 13, q = 6, b = 10
c = 130, q = 5, b = 10
c = 1,300, q = 4, b = 10
And so on. For the traditional binary format IEEE-754 requires:
The representation is made unique by choosing the smallest representable exponent that retains the most significant bit (MSB) within the selected word size and format. Further, the exponent is not represented directly, but a bias is added so that the smallest representable exponent is represented as 1, with 0 used for subnormal numbers
To explain this in the more familiar base-10 if our mantissa has 14 decimal places, the implementation would look like this:
c = 13,000,000,000,000 so the MSB will be used in the represented number
q = 6 This is a little confusing, it's cause of the bias introduced here; logically q = -6 but the bias is set so that when q = 0 only the MSB of c is immediately to the left of the decimal point, meaning that c = 13,000,000,000,000, q = 0, b = 10 will represent 1.3
b = 10 again the above rules are really only required for base-2 but I've shown them as they would apply to base-10 for the purpose of explaination
Translated back to base-2 this means that a q of numeric_limits<T>::digits - 1 has only zeros after the decimal place. ceil only has an effect if there is a fractional part of the number.
A final point of explanation here, is the range over which ceil will have an effect. After the exponent of a floating point is larger than numeric_limits<T>::digits continuing to increase it only introduces trailing zeros to the resulting number, thus calling ceil when q is greater than or equal to numeric_limits<T>::digits - 2LL. And since we know the MSB of c will be used in the number this means that c must be smaller than (1LL << numeric_limits<T>::digits - 1LL) - 1LL Thus for ceil to have an effect on the traditional binary IEEE-754 floating point:
A 32-bit (float) must be smaller than 8,388,607
A 64-bit (double) must be smaller than 4,503,599,627,370,495
A 128-bit (long double depending upon implementation) must be smaller than 5,192,296,858,534,827,628,530,496,329,220,095
According to The C++ Programming Language - 4th, section 6.2.5:
There are three floating-points types: float (single-precision), double (double-precision), and long double (extended-precision)
Refer to: http://en.wikipedia.org/wiki/Single-precision_floating-point_format
The true significand includes 23 fraction bits to the right of the binary point and an implicit leading bit (to the left of the binary point) with value 1 unless the exponent is stored with all zeros. Thus only 23 fraction bits of the significand appear in the memory format but the total precision is 24 bits (equivalent to log10(224) ≈ 7.225 decimal digits).
→ The maximum digits of floating point number is 7 digits on binary32 interchange format. (a computer number format that occupies 4 bytes (32 bits) in computer memory)
When I test on different compilers (like GCC, VC compiler)
→ It always outputs 6 as the value.
Take a look into float.h of each compiler
→ I found that 6 is fixed.
Question:
Do you know why there is a different here (between actual value theoretical value - 7 - and actual value - 6)?
It sounds like "7" is more reasonable because when I test using below code, the value is still valid, while "8" is invalid
Why don't the compilers check the interchange format for giving decision about the numbers of digits represented in floating-point (instead of using a fixed value)?
Code:
#include <iostream>
#include <limits>
using namespace std;
int main( )
{
cout << numeric_limits<float> :: digits10 << endl;
float f = -9999999;
cout.precision ( 10 );
cout << f << endl;
}
You're not reading the documentation.
std::numeric_limits<float>::digits10 is 6:
The value of std::numeric_limits<T>::digits10 is the number of base-10 digits that can be represented by the type T without change, that is, any number with this many decimal digits can be converted to a value of type T and back to decimal form, without change due to rounding or overflow. For base-radix types, it is the value of digits (digits-1 for floating-point types) multiplied by log10(radix) and rounded down.
The standard 32-bit IEEE 754 floating-point type has a 24 bit fractional part (23 bits written, one implied), which may suggest that it can represent 7 digit decimals (24 * std::log10(2) is 7.22), but relative rounding errors are non-uniform and some floating-point values with 7 decimal digits do not survive conversion to 32-bit float and back: the smallest positive example is 8.589973e9, which becomes 8.589974e9 after the roundtrip. These rounding errors cannot exceed one bit in the representation, and digits10 is calculated as (24-1)*std::log10(2), which is 6.92. Rounding down results in the value 6.
std::numeric_limits<float>::max_digits10 is 9:
The value of std::numeric_limits<T>::max_digits10 is the number of base-10 digits that are necessary to uniquely represent all distinct values of the type T, such as necessary for serialization/deserialization to text. This constant is meaningful for all floating-point types.
Unlike most mathematical operations, the conversion of a floating-point value to text and back is exact as long as at least max_digits10 were used (9 for float, 17 for double): it is guaranteed to produce the same floating-point value, even though the intermediate text representation is not exact. It may take over a hundred decimal digits to represent the precise value of a float in decimal notation.
std::numeric_limits<float>::digits10 equates to FLT_DIG, which is defined by the C standard :
number of decimal digits, q, such that any floating-point number with q decimal digits can be rounded into a floating-point number with p radix b digits and back again without change to the q decimal digits,
⎧ p log10 b if b is a power of 10
⎨
⎩ ⎣( p − 1) log10 b⎦ otherwise
FLT_DIG 6
DBL_DIG 10
LDBL_DIG 10
The reason for the value 6 (and not 7), is due to rounding errors - not all floating point values with 7 decimal digits can be losslessly represented by a 32-bit float. Rounding errors are limited to 1 bit though, so the FLT_DIG value was calculated based on 23 bits (instead of the full 24) :
23 * log10(2) = 6.92
which is rounded down to 6.
With below code, I get result "4.31 43099".
double f = atof("4.31");
long ff = f * 10000L;
std::cout << f << ' ' << ff << '\n';
If I change "double f" to "float f". I get expected result "4.31 43100". I am not sure if changing "double" to "float" is a good solution. Is there any good solution to assure I get "43100"?
You're not going to be able to eliminate the errors in floating point arithmatic (though with proper analysis you can calculate the error). For casual usage one thing you can do to get more intuitive results is to replace the built-in float to integral conversion (which does truncation), with normal rounding:
double f = atof("4.31");
long ff = std::round(f * 10000L);
std::cout << f << ' ' << ff << '\n';
This should output what you expect: 4.31 43100
Also there's no point in using 10000L, because no matter what kind of integral type you use it still gets converted to f's floating point type for the multiplication. just use std::round(f * 10000.0);
The problem is that floating point is inexact by nature when talking about decimal numbers. A decimal number can be rounded either up or down when converted to binary, depending on which value is closest.
In this case you just want to make sure that if the number was rounded down, it's rounded up instead. You do this by adding the smallest amount possible to the value, which is done with the nextafter function if you have C++11:
long ff = std::nextafter(f, 1.1*f) * 10000L;
If you don't have nextafter you can approximate it with numeric_limits.
long ff = (f * (1.0 + std::numeric_limits<double>::epsilon())) * 10000L;
I just saw your comment that you only use 4 decimal places, so this would be simpler but less robust:
long ff = (f * 1.0000001) * 10000L;
With standard C types - i doubt.
There are many values that cannot be represented in those bits - they actually demand more space to be stored. So floating-point processor just uses the closest possible.
Floating pointing numbers cannot store all the values you think it could - there is only limited amount of bits - you can't put more than 4 billion different values in 32 bits. And that's just the first restriction.
Floating point values(in C) are represented as: sign - one sign bit, power - bits which defines the power of two for the number, significand - the bits that actually make the number.
Your actual number is sign * significand * 2 inpowerof(power - normalization).
Double is 1bit of sign, 15 bits of power(normalized to be positive but that is not the point) and 48 bits to represent the value;
It is a lot but not enough to represent all the values, especially when they cannot be easily represented as finite sum of powers of two: like binary 1010.101101(101). For example it cannot represent precisely such values like 1/3 = 0.333333(3). That's the second restriction.
Try to read - decent understanding of advantages and disadvantages of floating point arithmetic may be very handy:
http://en.wikipedia.org/wiki/Floating_point and http://homepage.cs.uiowa.edu/~atkinson/m170.dir/overton.pdf
There have been some confused answers here! What is happening is this: 4.31 can't be exactly represented as either a single- or double-precision number. It turns out that the nearest representable single-precision number is a little more than 4.31, while the nearest representable double-precision number is a little less than 4.31. When a floating-point value is assigned to an integer variable, it is rounded towards zero (not towards the nearest integer!).
So if f is single-precision, f * 10000L is greater than 43100, so it is rounded down to 43100. And if f is double-precision, f * 10000L is less than 43100, so it is rounded down to 43099.
The comment by n.m. suggests f * 10000L + 0.5, which is I think the best solution.
The isnormal() reference page says:
Determines if the given floating point number arg is normal, i.e. is
neither zero, subnormal, infinite, nor NaN.
It's clear what a number being zero, infinite or NaN means. But it also says subnormal. When is a number subnormal?
IEEE 754 basics
First let's review the basics of IEEE 754 numbers are organized.
We'll focus on single precision (32-bit), but everything can be immediately generalized to other precisions.
The format is:
1 bit: sign
8 bits: exponent
23 bits: fraction
Or if you like pictures:
Source.
The sign is simple: 0 is positive, and 1 is negative, end of story.
The exponent is 8 bits long, and so it ranges from 0 to 255.
The exponent is called biased because it has an offset of -127, e.g.:
0 == special case: zero or subnormal, explained below
1 == 2 ^ -126
...
125 == 2 ^ -2
126 == 2 ^ -1
127 == 2 ^ 0
128 == 2 ^ 1
129 == 2 ^ 2
...
254 == 2 ^ 127
255 == special case: infinity and NaN
The leading bit convention
(What follows is a fictitious hypothetical narrative, not based on any actual historical research.)
While designing IEEE 754, engineers noticed that all numbers, except 0.0, have a one 1 in binary as the first digit. E.g.:
25.0 == (binary) 11001 == 1.1001 * 2^4
0.625 == (binary) 0.101 == 1.01 * 2^-1
both start with that annoying 1. part.
Therefore, it would be wasteful to let that digit take up one precision bit almost every single number.
For this reason, they created the "leading bit convention":
always assume that the number starts with one
But then how to deal with 0.0? Well, they decided to create an exception:
if the exponent is 0
and the fraction is 0
then the number represents plus or minus 0.0
so that the bytes 00 00 00 00 also represent 0.0, which looks good.
If we only considered these rules, then the smallest non-zero number that can be represented would be:
exponent: 0
fraction: 1
which looks something like this in a hex fraction due to the leading bit convention:
1.000002 * 2 ^ (-127)
where .000002 is 22 zeroes with a 1 at the end.
We cannot take fraction = 0, otherwise that number would be 0.0.
But then the engineers, who also had a keen aesthetic sense, thought: isn't that ugly? That we jump from straight 0.0 to something that is not even a proper power of 2? Couldn't we represent even smaller numbers somehow? (OK, it was a bit more concerning than "ugly": it was actually people getting bad results for their computations, see "How subnormals improve computations" below).
Subnormal numbers
The engineers scratched their heads for a while, and came back, as usual, with another good idea. What if we create a new rule:
If the exponent is 0, then:
the leading bit becomes 0
the exponent is fixed to -126 (not -127 as if we didn't have this exception)
Such numbers are called subnormal numbers (or denormal numbers which is synonym).
This rule immediately implies that the number such that:
exponent: 0
fraction: 0
is still 0.0, which is kind of elegant as it means one less rule to keep track of.
So 0.0 is actually a subnormal number according to our definition!
With this new rule then, the smallest non-subnormal number is:
exponent: 1 (0 would be subnormal)
fraction: 0
which represents:
1.0 * 2 ^ (-126)
Then, the largest subnormal number is:
exponent: 0
fraction: 0x7FFFFF (23 bits 1)
which equals:
0.FFFFFE * 2 ^ (-126)
where .FFFFFE is once again 23 bits one to the right of the dot.
This is pretty close to the smallest non-subnormal number, which sounds sane.
And the smallest non-zero subnormal number is:
exponent: 0
fraction: 1
which equals:
0.000002 * 2 ^ (-126)
which also looks pretty close to 0.0!
Unable to find any sensible way to represent numbers smaller than that, the engineers were happy, and went back to viewing cat pictures online, or whatever it is that they did in the 70s instead.
As you can see, subnormal numbers do a trade-off between precision and representation length.
As the most extreme example, the smallest non-zero subnormal:
0.000002 * 2 ^ (-126)
has essentially a precision of a single bit instead of 32-bits. For example, if we divide it by two:
0.000002 * 2 ^ (-126) / 2
we actually reach 0.0 exactly!
Visualization
It is always a good idea to have a geometric intuition about what we learn, so here goes.
If we plot IEEE 754 floating point numbers on a line for each given exponent, it looks something like this:
+---+-------+---------------+-------------------------------+
exponent |126| 127 | 128 | 129 |
+---+-------+---------------+-------------------------------+
| | | | |
v v v v v
-------------------------------------------------------------
floats ***** * * * * * * * * * * * *
-------------------------------------------------------------
^ ^ ^ ^ ^
| | | | |
0.5 1.0 2.0 4.0 8.0
From that we can see that:
for each exponent, there is no overlap between the represented numbers
for each exponent, we have the same number 2^23 of floating point numbers (here represented by 4 *)
within each exponent, points are equally spaced
larger exponents cover larger ranges, but with points more spread out
Now, let's bring that down all the way to exponent 0.
Without subnormals, it would hypothetically look like this:
+---+---+-------+---------------+-------------------------------+
exponent | ? | 0 | 1 | 2 | 3 |
+---+---+-------+---------------+-------------------------------+
| | | | | |
v v v v v v
-----------------------------------------------------------------
floats * **** * * * * * * * * * * * *
-----------------------------------------------------------------
^ ^ ^ ^ ^ ^
| | | | | |
0 | 2^-126 2^-125 2^-124 2^-123
|
2^-127
With subnormals, it looks like this:
+-------+-------+---------------+-------------------------------+
exponent | 0 | 1 | 2 | 3 |
+-------+-------+---------------+-------------------------------+
| | | | |
v v v v v
-----------------------------------------------------------------
floats * * * * * * * * * * * * * * * * *
-----------------------------------------------------------------
^ ^ ^ ^ ^ ^
| | | | | |
0 | 2^-126 2^-125 2^-124 2^-123
|
2^-127
By comparing the two graphs, we see that:
subnormals double the length of range of exponent 0, from [2^-127, 2^-126) to [0, 2^-126)
The space between floats in subnormal range is the same as for [0, 2^-126).
the range [2^-127, 2^-126) has half the number of points that it would have without subnormals.
Half of those points go to fill the other half of the range.
the range [0, 2^-127) has some points with subnormals, but none without.
This lack of points in [0, 2^-127) is not very elegant, and is the main reason for subnormals to exist!
since the points are equally spaced:
the range [2^-128, 2^-127) has half the points than [2^-127, 2^-126)
-[2^-129, 2^-128) has half the points than [2^-128, 2^-127)
and so on
This is what we mean when saying that subnormals are a tradeoff between size and precision.
Runnable C example
Now let's play with some actual code to verify our theory.
In almost all current and desktop machines, C float represents single precision IEEE 754 floating point numbers.
This is in particular the case for my Ubuntu 18.04 amd64 Lenovo P51 laptop.
With that assumption, all assertions pass on the following program:
subnormal.c
#if __STDC_VERSION__ < 201112L
#error C11 required
#endif
#ifndef __STDC_IEC_559__
#error IEEE 754 not implemented
#endif
#include <assert.h>
#include <float.h> /* FLT_HAS_SUBNORM */
#include <inttypes.h>
#include <math.h> /* isnormal */
#include <stdlib.h>
#include <stdio.h>
#if FLT_HAS_SUBNORM != 1
#error float does not have subnormal numbers
#endif
typedef struct {
uint32_t sign, exponent, fraction;
} Float32;
Float32 float32_from_float(float f) {
uint32_t bytes;
Float32 float32;
bytes = *(uint32_t*)&f;
float32.fraction = bytes & 0x007FFFFF;
bytes >>= 23;
float32.exponent = bytes & 0x000000FF;
bytes >>= 8;
float32.sign = bytes & 0x000000001;
bytes >>= 1;
return float32;
}
float float_from_bytes(
uint32_t sign,
uint32_t exponent,
uint32_t fraction
) {
uint32_t bytes;
bytes = 0;
bytes |= sign;
bytes <<= 8;
bytes |= exponent;
bytes <<= 23;
bytes |= fraction;
return *(float*)&bytes;
}
int float32_equal(
float f,
uint32_t sign,
uint32_t exponent,
uint32_t fraction
) {
Float32 float32;
float32 = float32_from_float(f);
return
(float32.sign == sign) &&
(float32.exponent == exponent) &&
(float32.fraction == fraction)
;
}
void float32_print(float f) {
Float32 float32 = float32_from_float(f);
printf(
"%" PRIu32 " %" PRIu32 " %" PRIu32 "\n",
float32.sign, float32.exponent, float32.fraction
);
}
int main(void) {
/* Basic examples. */
assert(float32_equal(0.5f, 0, 126, 0));
assert(float32_equal(1.0f, 0, 127, 0));
assert(float32_equal(2.0f, 0, 128, 0));
assert(isnormal(0.5f));
assert(isnormal(1.0f));
assert(isnormal(2.0f));
/* Quick review of C hex floating point literals. */
assert(0.5f == 0x1.0p-1f);
assert(1.0f == 0x1.0p0f);
assert(2.0f == 0x1.0p1f);
/* Sign bit. */
assert(float32_equal(-0.5f, 1, 126, 0));
assert(float32_equal(-1.0f, 1, 127, 0));
assert(float32_equal(-2.0f, 1, 128, 0));
assert(isnormal(-0.5f));
assert(isnormal(-1.0f));
assert(isnormal(-2.0f));
/* The special case of 0.0 and -0.0. */
assert(float32_equal( 0.0f, 0, 0, 0));
assert(float32_equal(-0.0f, 1, 0, 0));
assert(!isnormal( 0.0f));
assert(!isnormal(-0.0f));
assert(0.0f == -0.0f);
/* ANSI C defines FLT_MIN as the smallest non-subnormal number. */
assert(FLT_MIN == 0x1.0p-126f);
assert(float32_equal(FLT_MIN, 0, 1, 0));
assert(isnormal(FLT_MIN));
/* The largest subnormal number. */
float largest_subnormal = float_from_bytes(0, 0, 0x7FFFFF);
assert(largest_subnormal == 0x0.FFFFFEp-126f);
assert(largest_subnormal < FLT_MIN);
assert(!isnormal(largest_subnormal));
/* The smallest non-zero subnormal number. */
float smallest_subnormal = float_from_bytes(0, 0, 1);
assert(smallest_subnormal == 0x0.000002p-126f);
assert(0.0f < smallest_subnormal);
assert(!isnormal(smallest_subnormal));
return EXIT_SUCCESS;
}
GitHub upstream.
Compile and run with:
gcc -ggdb3 -O0 -std=c11 -Wall -Wextra -Wpedantic -Werror -o subnormal.out subnormal.c
./subnormal.out
C++
In addition to exposing all of C's APIs, C++ also exposes some extra subnormal related functionality that is not as readily available in C in <limits>, e.g.:
denorm_min: Returns the minimum positive subnormal value of the type T
In C++ the whole API is templated for each floating point type, and is much nicer.
Implementations
x86_64 and ARMv8 implemens IEEE 754 directly on hardware, which the C code translates to.
Subnormals seem to be less fast than normals in certain implementations: Why does changing 0.1f to 0 slow down performance by 10x? This is mentioned in the ARM manual, see the "ARMv8 details" section of this answer.
ARMv8 details
ARM Architecture Reference Manual ARMv8 DDI 0487C.a manual A1.5.4 "Flush-to-zero" describes a configurable mode where subnormals are rounded to zero to improve performance:
The performance of floating-point processing can be reduced when doing calculations involving denormalized numbers and Underflow exceptions. In many algorithms, this performance can be recovered, without significantly affecting the accuracy of the final result, by replacing the denormalized operands and intermediate results with zeros. To permit this optimization, ARM floating-point implementations allow a Flush-to-zero mode to be used for different floating-point formats as follows:
For AArch64:
If FPCR.FZ==1, then Flush-to-Zero mode is used for all Single-Precision and Double-Precision inputs and outputs of all instructions.
If FPCR.FZ16==1, then Flush-to-Zero mode is used for all Half-Precision inputs and outputs of floating-point instructions, other than:—Conversions between Half-Precision and Single-Precision numbers.—Conversions between Half-Precision and Double-Precision numbers.
A1.5.2 "Floating-point standards, and terminology" Table A1-3 "Floating-point terminology" confirms that subnormals and denormals are synonyms:
This manual IEEE 754-2008
------------------------- -------------
[...]
Denormal, or denormalized Subnormal
C5.2.7 "FPCR, Floating-point Control Register" describes how ARMv8 can optionally raise exceptions or set a flag bits whenever the input of a floating point operation is subnormal:
FPCR.IDE, bit [15] Input Denormal floating-point exception trap enable. Possible values are:
0b0 Untrapped exception handling selected. If the floating-point exception occurs then the FPSR.IDC bit is set to 1.
0b1 Trapped exception handling selected. If the floating-point exception occurs, the PE does not update the FPSR.IDC bit. The trap handling software can decide whether to set the FPSR.IDC bit to 1.
D12.2.88 "MVFR1_EL1, AArch32 Media and VFP Feature Register 1" shows that denormal support is completely optional in fact, and offers a bit to detect if there is support:
FPFtZ, bits [3:0]
Flush to Zero mode. Indicates whether the floating-point implementation provides support only for the Flush-to-Zero mode of operation. Defined values are:
0b0000 Not implemented, or hardware supports only the Flush-to-Zero mode of operation.
0b0001 Hardware supports full denormalized number arithmetic.
All other values are reserved.
In ARMv8-A, the permitted values are 0b0000 and 0b0001.
This suggests that when subnormals are not implemented, implementations just revert to flush-to-zero.
Infinity and NaN
Curious? I've written some things at:
infinity: Ranges of floating point datatype in C?
NaN: What is the difference between quiet NaN and signaling NaN?
How subnormals improve computations
According to the Oracle (formerly Sun) Numerical Computation Guide
[S]ubnormal numbers eliminate underflow as a cause for concern for a variety of computations (typically, multiply followed by add). ... The class of problems that succeed in the presence of gradual underflow, but fail with Store 0, is larger than the fans of Store 0 may realize. ... In the absence of gradual underflow, user programs need to be sensitive to the implicit inaccuracy threshold. For example, in single precision, if underflow occurs in some parts of a calculation, and Store 0 is used to replace underflowed results with 0, then accuracy can be guaranteed only to around 10-31, not 10-38, the usual lower range for single-precision exponents.
The Numerical Computation Guide refers the reader to two other papers:
Underflow and the Reliability of Numerical Software by James Demmel
Combatting the Effects of Underflow and Overflow in Determining Real Roots of Polynomials by S. Linnainmaa
Thanks to Willis Blackburn for contributing to this section of the answer.
Actual history
An Interview with the Old Man of Floating-Point by Charles Severance (1998) is a short real world historical overview in the form of an interview with William Kahan and was suggested by John Coleman in the comments.
In the IEEE754 standard, floating point numbers are represented as binary scientific notation, x = M × 2e. Here M is the mantissa and e is the exponent. Mathematically, you can always choose the exponent so that 1 ≤ M < 2.* However, since in the computer representation the exponent can only have a finite range, there are some numbers which are bigger than zero, but smaller than 1.0 × 2emin. Those numbers are the subnormals or denormals.
Practically, the mantissa is stored without the leading 1, since there is always a leading 1, except for subnormal numbers (and zero). Thus the interpretation is that if the exponent is non-minimal, there is an implicit leading 1, and if the exponent is minimal, there isn't, and the number is subnormal.
*) More generally, 1 ≤ M < B for any base-B scientific notation.
From http://blogs.oracle.com/d/entry/subnormal_numbers:
There are potentially multiple ways of representing the same number,
using decimal as an example, the number 0.1 could be represented as
1*10-1 or 0.1*100 or even 0.01 * 10. The standard dictates that the
numbers are always stored with the first bit as a one. In decimal that
corresponds to the 1*10-1 example.
Now suppose that the lowest exponent that can be represented is -100.
So the smallest number that can be represented in normal form is
1*10-100. However, if we relax the constraint that the leading bit be
a one, then we can actually represent smaller numbers in the same
space. Taking a decimal example we could represent 0.1*10-100. This
is called a subnormal number. The purpose of having subnormal numbers
is to smooth the gap between the smallest normal number and zero.
It is very important to realise that subnormal numbers are represented
with less precision than normal numbers. In fact, they are trading
reduced precision for their smaller size. Hence calculations that use
subnormal numbers are not going to have the same precision as
calculations on normal numbers. So an application which does
significant computation on subnormal numbers is probably worth
investigating to see if rescaling (i.e. multiplying the numbers by
some scaling factor) would yield fewer subnormals, and more accurate
results.