Related
Assuming I am really pressed for memory and want a smaller range (similar to short vs int). Shader languages already support half for a floating-point type with half the precision (not just convert back and forth for the value to be between -1 and 1, that is, return a float like this: shortComingIn / maxRangeOfShort). Is there an implementation that already exists for a 2-byte float?
I am also interested to know any (historical?) reasons as to why there is no 2-byte float.
TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations
There are currently 2 common standard 16-bit float formats: IEEE-754 binary16 and Google's bfloat16. Since they're standardized, obviously anyone who knows the spec can write an implementation. Some examples:
https://github.com/ramenhut/half
https://github.com/minhhn2910/cuda-half2
https://github.com/tianshilei1992/half_precision
https://github.com/acgessler/half_float
Or if you don't want to use them, you can also design a different 16-bit float format and implement it
2-byte floats are generally not used, because even float's precision is not enough for normal operations and double should always be used by default unless you're limited by bandwidth or cache size. Floating-point literals are also double when using without a suffix in C and C-like languages. See
Why are double preferred over float?
Should I use double or float?
When do you use float and when do you use double
However less-than-32-bit floats do exist. They're mainly used for storage purposes, like in graphics when 96 bits per pixel (32 bits per channel * 3 channels) are far too wasted, and will be converted to a normal 32-bit float for calculations (except on some special hardware). Various 10, 11, 14-bit float types exist in OpenGL. Many HDR formats use a 16-bit float for each channel, and Direct3D 9.0 as well as some GPUs like the Radeon R300 and R420 have a 24-bit float format. A 24-bit float is also supported by compilers in some 8-bit microcontrollers like PIC where 32-bit float support is too costly. 8-bit or narrower float types are less useful but due to their simplicity, they're often taught in computer science curriculum. Besides, a small float is also used in ARM's instruction encoding for small floating-point immediates.
The IEEE 754-2008 revision officially added a 16-bit float format, A.K.A binary16 or half-precision, with a 5-bit exponent and an 11-bit mantissa
Some compilers had support for IEEE-754 binary16, but mainly for conversion or vectorized operations and not for computation (because they're not precise enough). For example ARM's toolchain has __fp16 which can be chosen between 2 variants: IEEE and alternative depending on whether you want more range or NaN/inf representations. GCC and Clang also support __fp16 along with the standardized name _Float16. See How to enable __fp16 type on gcc for x86_64
Recently due to the rise of AI, another format called bfloat16 (brain floating-point format) which is a simple truncation of the top 16 bits of IEEE-754 binary32 became common
The motivation behind the reduced mantissa is derived from Google's experiments that showed that it is fine to reduce the mantissa so long it's still possible to represent tiny values closer to zero as part of the summation of small differences during training. Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area.
float32: 242=576 (100%)
float16: 112=121 (21%)
bfloat16: 82=64 (11%)
Many compilers like GCC and ICC now also gained the ability to support bfloat16
More information about bfloat16:
bfloat16 - Hardware Numerics Definition
Using bfloat16 with TensorFlow models
What is tf.bfloat16 "truncated 16-bit floating point"?
In cases where bfloat16 is not enough there's also the rise of a new 19-bit type called TensorFloat
Re: Implementations: Someone has apparently written half for C, which would (of course) work in C++: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/cellperformance-snippets/half.c
Re: Why is float four bytes: Probably because below that, their precision is so limited. In IEEE-754, a "half" only has 11 bits of significand precision, yielding about 3.311 decimal digits of precision (vs. 24 bits in a single yielding between 6 and 9 decimal digits of precision, or 53 bits in a double yielding between 15 and 17 decimal digits of precision).
If you're low on memory, did you consider dropping the float concept? Floats use up a lot of bits just for saving where the decimal point is. You can work around this if you know where you need the decimal point, let's say you want to save a Dollar value, you could just save it in Cents:
uint16_t cash = 50000;
std::cout << "Cash: $" << (cash / 100) << "." << ((cash % 100) < 10 ? "0" : "") << (cash % 100) << std::endl;
That is of course only an option if it's possible for you to predetermine the position of the decimal point. But if you can, always prefer it, because this also speeds up all calculations!
There is an IEEE 754 standard for 16-bit floats.
It's a new format, having been standardized in 2008 based on a GPU released in 2002.
To go a bit further than Kiralein on switching to integers, we could define a range and permit the integer values of a short to represent equal divisions over the range, with some symmetry if straddling zero:
short mappedval = (short)(val/range);
Differences between these integer versions and using half precision floats:
Integers are equally spaced over the range, whereas floats are more densely packed near zero
Using integers will use integer math in the CPU rather than floating-point. That is often faster because integer operations are simpler. Having said that, mapping the values onto an asymmetric range would require extra additions etc to retrieve the value at the end.
The absolute precision loss is more predictable; you know the error in each value so the total loss can be calculated in advance, given the range. Conversely, the relative error is more predictable using floating point.
There may be a small selection of operations which you can do using pairs of values, particularly bitwise operations, by packing two shorts into an int. This can halve the number of cycles needed (or more, if short operations involve a cast to int) and maintains 32-bit width. This is just a diluted version of bit-slicing where 32 bits are acted on in parallel, which is used in crypto.
If your CPU supports F16C, then you can get something up and running fairly quickly with something such as:
// needs to be compiled with -mf16c enabled
#include <immintrin.h>
#include <cstdint>
struct float16
{
private:
uint16_t _value;
public:
inline float16() : _value(0) {}
inline float16(const float16&) = default;
inline float16(float16&&) = default;
inline float16(const float f) : _value(_cvtss_sh(f, _MM_FROUND_CUR_DIRECTION)) {}
inline float16& operator = (const float16&) = default;
inline float16& operator = (float16&&) = default;
inline float16& operator = (const float f) { _value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION); return *this; }
inline operator float () const
{ return _cvtsh_ss(_value); }
inline friend std::istream& operator >> (std::istream& input, float16& h)
{
float f = 0;
input >> f;
h._value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION);
return input;
}
};
Maths is still performed using 32-bit floats (the F16C extensions only provides conversions between 16/32-bit floats - no instructions exist to compute arithmetic with 16-bit floats).
There are probably a variety of types in different implementations. A float equivalent of stdint.h seems like a good idea. Call (alias?) the types by their sizes. (float16_t?) A float being 4 bytes is only right now, but it probably won't get smaller. Terms like half and long mostly become meaningless with time. With 128 or 256-bit computers they could come to mean anything.
I'm working with images (1+1+1 byte/pixel) and I want to express each pixel's value relative to the average. So floating point or carefully fixed point, but not 4 times as big as the raw data please. A 16-bit float sounds about right.
This GCC 7.3 doesn't know "half", maybe in a C++ context.
2 byte float is available in clang C compiler , The data type is represented as __fp16.
Various compilers now support three different half precision formats:
__fp16 is mostly used as a storage format. It is promoted to float as soon as you do calculations on it. Calculations on __fp16 will give a float result. __fp16 has 5 bits exponent and 10 bits mantissa.
_Float16 is the same as __fp16, but used as an interchange and arithmetic format. Calculations on _Float16 will give a _Float16 result.
__bf16 is a storage format with less precision. It has 8 bits exponent and 7 bits mantissa.
All three types are supported by compilers for the ARM architecture and now also by compilers for x86 processors. The AVX512_FP16 instruction set extension will be supported by Intel's forthcoming Golden Cove processors and it is supported by the latest Clang, Gnu, and Intel compilers. Vectors of _Float16 are defined as __m128h, __m256h, and __m512h on compilers that support AVX512_FP16 .
References:
https://developer.arm.com/documentation/100067/0612/Other-Compiler-specific-Features/Half-precision-floating-point-data-types
https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point
Assuming I am really pressed for memory and want a smaller range (similar to short vs int). Shader languages already support half for a floating-point type with half the precision (not just convert back and forth for the value to be between -1 and 1, that is, return a float like this: shortComingIn / maxRangeOfShort). Is there an implementation that already exists for a 2-byte float?
I am also interested to know any (historical?) reasons as to why there is no 2-byte float.
TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations
There are currently 2 common standard 16-bit float formats: IEEE-754 binary16 and Google's bfloat16. Since they're standardized, obviously anyone who knows the spec can write an implementation. Some examples:
https://github.com/ramenhut/half
https://github.com/minhhn2910/cuda-half2
https://github.com/tianshilei1992/half_precision
https://github.com/acgessler/half_float
Or if you don't want to use them, you can also design a different 16-bit float format and implement it
2-byte floats are generally not used, because even float's precision is not enough for normal operations and double should always be used by default unless you're limited by bandwidth or cache size. Floating-point literals are also double when using without a suffix in C and C-like languages. See
Why are double preferred over float?
Should I use double or float?
When do you use float and when do you use double
However less-than-32-bit floats do exist. They're mainly used for storage purposes, like in graphics when 96 bits per pixel (32 bits per channel * 3 channels) are far too wasted, and will be converted to a normal 32-bit float for calculations (except on some special hardware). Various 10, 11, 14-bit float types exist in OpenGL. Many HDR formats use a 16-bit float for each channel, and Direct3D 9.0 as well as some GPUs like the Radeon R300 and R420 have a 24-bit float format. A 24-bit float is also supported by compilers in some 8-bit microcontrollers like PIC where 32-bit float support is too costly. 8-bit or narrower float types are less useful but due to their simplicity, they're often taught in computer science curriculum. Besides, a small float is also used in ARM's instruction encoding for small floating-point immediates.
The IEEE 754-2008 revision officially added a 16-bit float format, A.K.A binary16 or half-precision, with a 5-bit exponent and an 11-bit mantissa
Some compilers had support for IEEE-754 binary16, but mainly for conversion or vectorized operations and not for computation (because they're not precise enough). For example ARM's toolchain has __fp16 which can be chosen between 2 variants: IEEE and alternative depending on whether you want more range or NaN/inf representations. GCC and Clang also support __fp16 along with the standardized name _Float16. See How to enable __fp16 type on gcc for x86_64
Recently due to the rise of AI, another format called bfloat16 (brain floating-point format) which is a simple truncation of the top 16 bits of IEEE-754 binary32 became common
The motivation behind the reduced mantissa is derived from Google's experiments that showed that it is fine to reduce the mantissa so long it's still possible to represent tiny values closer to zero as part of the summation of small differences during training. Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area.
float32: 242=576 (100%)
float16: 112=121 (21%)
bfloat16: 82=64 (11%)
Many compilers like GCC and ICC now also gained the ability to support bfloat16
More information about bfloat16:
bfloat16 - Hardware Numerics Definition
Using bfloat16 with TensorFlow models
What is tf.bfloat16 "truncated 16-bit floating point"?
In cases where bfloat16 is not enough there's also the rise of a new 19-bit type called TensorFloat
Re: Implementations: Someone has apparently written half for C, which would (of course) work in C++: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/cellperformance-snippets/half.c
Re: Why is float four bytes: Probably because below that, their precision is so limited. In IEEE-754, a "half" only has 11 bits of significand precision, yielding about 3.311 decimal digits of precision (vs. 24 bits in a single yielding between 6 and 9 decimal digits of precision, or 53 bits in a double yielding between 15 and 17 decimal digits of precision).
If you're low on memory, did you consider dropping the float concept? Floats use up a lot of bits just for saving where the decimal point is. You can work around this if you know where you need the decimal point, let's say you want to save a Dollar value, you could just save it in Cents:
uint16_t cash = 50000;
std::cout << "Cash: $" << (cash / 100) << "." << ((cash % 100) < 10 ? "0" : "") << (cash % 100) << std::endl;
That is of course only an option if it's possible for you to predetermine the position of the decimal point. But if you can, always prefer it, because this also speeds up all calculations!
There is an IEEE 754 standard for 16-bit floats.
It's a new format, having been standardized in 2008 based on a GPU released in 2002.
To go a bit further than Kiralein on switching to integers, we could define a range and permit the integer values of a short to represent equal divisions over the range, with some symmetry if straddling zero:
short mappedval = (short)(val/range);
Differences between these integer versions and using half precision floats:
Integers are equally spaced over the range, whereas floats are more densely packed near zero
Using integers will use integer math in the CPU rather than floating-point. That is often faster because integer operations are simpler. Having said that, mapping the values onto an asymmetric range would require extra additions etc to retrieve the value at the end.
The absolute precision loss is more predictable; you know the error in each value so the total loss can be calculated in advance, given the range. Conversely, the relative error is more predictable using floating point.
There may be a small selection of operations which you can do using pairs of values, particularly bitwise operations, by packing two shorts into an int. This can halve the number of cycles needed (or more, if short operations involve a cast to int) and maintains 32-bit width. This is just a diluted version of bit-slicing where 32 bits are acted on in parallel, which is used in crypto.
If your CPU supports F16C, then you can get something up and running fairly quickly with something such as:
// needs to be compiled with -mf16c enabled
#include <immintrin.h>
#include <cstdint>
struct float16
{
private:
uint16_t _value;
public:
inline float16() : _value(0) {}
inline float16(const float16&) = default;
inline float16(float16&&) = default;
inline float16(const float f) : _value(_cvtss_sh(f, _MM_FROUND_CUR_DIRECTION)) {}
inline float16& operator = (const float16&) = default;
inline float16& operator = (float16&&) = default;
inline float16& operator = (const float f) { _value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION); return *this; }
inline operator float () const
{ return _cvtsh_ss(_value); }
inline friend std::istream& operator >> (std::istream& input, float16& h)
{
float f = 0;
input >> f;
h._value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION);
return input;
}
};
Maths is still performed using 32-bit floats (the F16C extensions only provides conversions between 16/32-bit floats - no instructions exist to compute arithmetic with 16-bit floats).
There are probably a variety of types in different implementations. A float equivalent of stdint.h seems like a good idea. Call (alias?) the types by their sizes. (float16_t?) A float being 4 bytes is only right now, but it probably won't get smaller. Terms like half and long mostly become meaningless with time. With 128 or 256-bit computers they could come to mean anything.
I'm working with images (1+1+1 byte/pixel) and I want to express each pixel's value relative to the average. So floating point or carefully fixed point, but not 4 times as big as the raw data please. A 16-bit float sounds about right.
This GCC 7.3 doesn't know "half", maybe in a C++ context.
2 byte float is available in clang C compiler , The data type is represented as __fp16.
Various compilers now support three different half precision formats:
__fp16 is mostly used as a storage format. It is promoted to float as soon as you do calculations on it. Calculations on __fp16 will give a float result. __fp16 has 5 bits exponent and 10 bits mantissa.
_Float16 is the same as __fp16, but used as an interchange and arithmetic format. Calculations on _Float16 will give a _Float16 result.
__bf16 is a storage format with less precision. It has 8 bits exponent and 7 bits mantissa.
All three types are supported by compilers for the ARM architecture and now also by compilers for x86 processors. The AVX512_FP16 instruction set extension will be supported by Intel's forthcoming Golden Cove processors and it is supported by the latest Clang, Gnu, and Intel compilers. Vectors of _Float16 are defined as __m128h, __m256h, and __m512h on compilers that support AVX512_FP16 .
References:
https://developer.arm.com/documentation/100067/0612/Other-Compiler-specific-Features/Half-precision-floating-point-data-types
https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point
In my project I have to compute division, multiplication, subtraction, addition on a matrix of double elements.
The problem is that when the size of matrix increases the accuracy of my output is drastically getting affected.
Currently I am using double for each element which I believe uses 8 bytes of memory & has accuracy of 16 digits irrespective of decimal position.
Even for large size of matrix the memory occupied by all the elements is in the range of few kilobytes. So I can afford to use datatypes which require more memory.
So I wanted to know which data type is more precise than double.
I tried searching in some books & I could find long double.
But I dont know what is its precision.
And what if I want more precision than that?
According to Wikipedia, 80-bit "Intel" IEEE 754 extended-precision long double, which is 80 bits padded to 16 bytes in memory, has 64 bits mantissa, with no implicit bit, which gets you 19.26 decimal digits. This has been the almost universal standard for long double for ages, but recently things have started to change.
The newer 128-bit quad-precision format has 112 mantissa bits plus an implicit bit, which gets you 34 decimal digits. GCC implements this as the __float128 type and there is (if memory serves) a compiler option to set long double to it.
You might want to consider the sequence of operations, i.e. do the additions in an ordered sequence starting with the smallest values first. This will increase overall accuracy of the results using the same precision in the mantissa:
1e00 + 1e-16 + ... + 1e-16 (1e16 times) = 1e00
1e-16 + ... + 1e-16 (1e16 times) + 1e00 = 2e00
The point is that adding small numbers to a large number will make them disappear. So the latter approach reduces the numerical error
Floating point data types with greater precision than double are going to depend on your compiler and architecture.
In order to get more than double precision, you may need to rely on some math library that supports arbitrary precision calculations. These probably won't be fast though.
On Intel architectures the precision of long double is 80bits.
What kind of values do you want to represent? Maybe you are better off using fixed precision.
Assuming I am really pressed for memory and want a smaller range (similar to short vs int). Shader languages already support half for a floating-point type with half the precision (not just convert back and forth for the value to be between -1 and 1, that is, return a float like this: shortComingIn / maxRangeOfShort). Is there an implementation that already exists for a 2-byte float?
I am also interested to know any (historical?) reasons as to why there is no 2-byte float.
TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations
There are currently 2 common standard 16-bit float formats: IEEE-754 binary16 and Google's bfloat16. Since they're standardized, obviously anyone who knows the spec can write an implementation. Some examples:
https://github.com/ramenhut/half
https://github.com/minhhn2910/cuda-half2
https://github.com/tianshilei1992/half_precision
https://github.com/acgessler/half_float
Or if you don't want to use them, you can also design a different 16-bit float format and implement it
2-byte floats are generally not used, because even float's precision is not enough for normal operations and double should always be used by default unless you're limited by bandwidth or cache size. Floating-point literals are also double when using without a suffix in C and C-like languages. See
Why are double preferred over float?
Should I use double or float?
When do you use float and when do you use double
However less-than-32-bit floats do exist. They're mainly used for storage purposes, like in graphics when 96 bits per pixel (32 bits per channel * 3 channels) are far too wasted, and will be converted to a normal 32-bit float for calculations (except on some special hardware). Various 10, 11, 14-bit float types exist in OpenGL. Many HDR formats use a 16-bit float for each channel, and Direct3D 9.0 as well as some GPUs like the Radeon R300 and R420 have a 24-bit float format. A 24-bit float is also supported by compilers in some 8-bit microcontrollers like PIC where 32-bit float support is too costly. 8-bit or narrower float types are less useful but due to their simplicity, they're often taught in computer science curriculum. Besides, a small float is also used in ARM's instruction encoding for small floating-point immediates.
The IEEE 754-2008 revision officially added a 16-bit float format, A.K.A binary16 or half-precision, with a 5-bit exponent and an 11-bit mantissa
Some compilers had support for IEEE-754 binary16, but mainly for conversion or vectorized operations and not for computation (because they're not precise enough). For example ARM's toolchain has __fp16 which can be chosen between 2 variants: IEEE and alternative depending on whether you want more range or NaN/inf representations. GCC and Clang also support __fp16 along with the standardized name _Float16. See How to enable __fp16 type on gcc for x86_64
Recently due to the rise of AI, another format called bfloat16 (brain floating-point format) which is a simple truncation of the top 16 bits of IEEE-754 binary32 became common
The motivation behind the reduced mantissa is derived from Google's experiments that showed that it is fine to reduce the mantissa so long it's still possible to represent tiny values closer to zero as part of the summation of small differences during training. Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area.
float32: 242=576 (100%)
float16: 112=121 (21%)
bfloat16: 82=64 (11%)
Many compilers like GCC and ICC now also gained the ability to support bfloat16
More information about bfloat16:
bfloat16 - Hardware Numerics Definition
Using bfloat16 with TensorFlow models
What is tf.bfloat16 "truncated 16-bit floating point"?
In cases where bfloat16 is not enough there's also the rise of a new 19-bit type called TensorFloat
Re: Implementations: Someone has apparently written half for C, which would (of course) work in C++: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/cellperformance-snippets/half.c
Re: Why is float four bytes: Probably because below that, their precision is so limited. In IEEE-754, a "half" only has 11 bits of significand precision, yielding about 3.311 decimal digits of precision (vs. 24 bits in a single yielding between 6 and 9 decimal digits of precision, or 53 bits in a double yielding between 15 and 17 decimal digits of precision).
If you're low on memory, did you consider dropping the float concept? Floats use up a lot of bits just for saving where the decimal point is. You can work around this if you know where you need the decimal point, let's say you want to save a Dollar value, you could just save it in Cents:
uint16_t cash = 50000;
std::cout << "Cash: $" << (cash / 100) << "." << ((cash % 100) < 10 ? "0" : "") << (cash % 100) << std::endl;
That is of course only an option if it's possible for you to predetermine the position of the decimal point. But if you can, always prefer it, because this also speeds up all calculations!
There is an IEEE 754 standard for 16-bit floats.
It's a new format, having been standardized in 2008 based on a GPU released in 2002.
To go a bit further than Kiralein on switching to integers, we could define a range and permit the integer values of a short to represent equal divisions over the range, with some symmetry if straddling zero:
short mappedval = (short)(val/range);
Differences between these integer versions and using half precision floats:
Integers are equally spaced over the range, whereas floats are more densely packed near zero
Using integers will use integer math in the CPU rather than floating-point. That is often faster because integer operations are simpler. Having said that, mapping the values onto an asymmetric range would require extra additions etc to retrieve the value at the end.
The absolute precision loss is more predictable; you know the error in each value so the total loss can be calculated in advance, given the range. Conversely, the relative error is more predictable using floating point.
There may be a small selection of operations which you can do using pairs of values, particularly bitwise operations, by packing two shorts into an int. This can halve the number of cycles needed (or more, if short operations involve a cast to int) and maintains 32-bit width. This is just a diluted version of bit-slicing where 32 bits are acted on in parallel, which is used in crypto.
If your CPU supports F16C, then you can get something up and running fairly quickly with something such as:
// needs to be compiled with -mf16c enabled
#include <immintrin.h>
#include <cstdint>
struct float16
{
private:
uint16_t _value;
public:
inline float16() : _value(0) {}
inline float16(const float16&) = default;
inline float16(float16&&) = default;
inline float16(const float f) : _value(_cvtss_sh(f, _MM_FROUND_CUR_DIRECTION)) {}
inline float16& operator = (const float16&) = default;
inline float16& operator = (float16&&) = default;
inline float16& operator = (const float f) { _value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION); return *this; }
inline operator float () const
{ return _cvtsh_ss(_value); }
inline friend std::istream& operator >> (std::istream& input, float16& h)
{
float f = 0;
input >> f;
h._value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION);
return input;
}
};
Maths is still performed using 32-bit floats (the F16C extensions only provides conversions between 16/32-bit floats - no instructions exist to compute arithmetic with 16-bit floats).
There are probably a variety of types in different implementations. A float equivalent of stdint.h seems like a good idea. Call (alias?) the types by their sizes. (float16_t?) A float being 4 bytes is only right now, but it probably won't get smaller. Terms like half and long mostly become meaningless with time. With 128 or 256-bit computers they could come to mean anything.
I'm working with images (1+1+1 byte/pixel) and I want to express each pixel's value relative to the average. So floating point or carefully fixed point, but not 4 times as big as the raw data please. A 16-bit float sounds about right.
This GCC 7.3 doesn't know "half", maybe in a C++ context.
2 byte float is available in clang C compiler , The data type is represented as __fp16.
Various compilers now support three different half precision formats:
__fp16 is mostly used as a storage format. It is promoted to float as soon as you do calculations on it. Calculations on __fp16 will give a float result. __fp16 has 5 bits exponent and 10 bits mantissa.
_Float16 is the same as __fp16, but used as an interchange and arithmetic format. Calculations on _Float16 will give a _Float16 result.
__bf16 is a storage format with less precision. It has 8 bits exponent and 7 bits mantissa.
All three types are supported by compilers for the ARM architecture and now also by compilers for x86 processors. The AVX512_FP16 instruction set extension will be supported by Intel's forthcoming Golden Cove processors and it is supported by the latest Clang, Gnu, and Intel compilers. Vectors of _Float16 are defined as __m128h, __m256h, and __m512h on compilers that support AVX512_FP16 .
References:
https://developer.arm.com/documentation/100067/0612/Other-Compiler-specific-Features/Half-precision-floating-point-data-types
https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point
Floating point type represents a number by storing its significant digits and its exponent separately on separate binary words so it fits in 16, 32, 64 or 128 bits.
Fixed point type stores numbers with 2 words, one representing the integer part, another representing the part past the radix, in negative exponents, 2^-1, 2^-2, 2^-3, etc.
Float are better because they have wider range in an exponent sense, but not if one wants to store number with more precision for a certain range, for example only using integer from -16 to 16, thus using more bits to hold digits past the radix.
In terms of performances, which one has the best performance, or are there cases where some is faster than the other ?
In video game programming, does everybody use floating point because the FPU makes it faster, or because the performance drop is just negligible, or do they make their own fixed type ?
Why isn't there any fixed type in C/C++ ?
That definition covers a very limited subset of fixed point implementations.
It would be more correct to say that in fixed point only the mantissa is stored and the exponent is a constant determined a-priori. There is no requirement for the binary point to fall inside the mantissa, and definitely no requirement that it fall on a word boundary. For example, all of the following are "fixed point":
64 bit mantissa, scaled by 2-32 (this fits the definition listed in the question)
64 bit mantissa, scaled by 2-33 (now the integer and fractional parts cannot be separated by an octet boundary)
32 bit mantissa, scaled by 24 (now there is no fractional part)
32 bit mantissa, scaled by 2-40 (now there is no integer part)
GPUs tend to use fixed point with no integer part (typically 32-bit mantissa scaled by 2-32). Therefore APIs such as OpenGL and Direct3D often use floating-point types which are capable of holding these values. However, manipulating the integer mantissa is often more efficient so these APIs allow specifying coordinates (in texture space, color space, etc) this way as well.
As for your claim that C++ doesn't have a fixed point type, I disagree. All integer types in C++ are fixed point types. The exponent is often assumed to be zero, but this isn't required and I have quite a bit of fixed-point DSP code implemented in C++ this way.
At the code level, fixed-point arithmetic is simply integer arithmetic with an implied denominator.
For many simple arithmetic operations, fixed-point and integer operations are essentially the same. However, there are some operations which the intermediate values must be represented with a higher number of bits and then rounded off. For example, to multiply two 16-bit fixed-point numbers, the result must be temporarily stored in 32-bit before renormalizing (or saturating) back to 16-bit fixed-point.
When the software does not take advantage of vectorization (such as CPU-based SIMD or GPGPU), integer and fixed-point arithmeric is faster than FPU. When vectorization is used, the efficiency of vectorization matters a lot more, such that the performance differences between fixed-point and floating-point is moot.
Some architectures provide hardware implementations for certain math functions, such as sin, cos, atan, sqrt, for floating-point types only. Some architectures do not provide any hardware implementation at all. In both cases, specialized math software libraries may provide those functions by using only integer or fixed-point arithmetic. Often, such libraries will provide multiple level of precisions, for example, answers which are only accurate up to N-bits of precision, which is less than the full precision of the representation. The limited-precision versions may be faster than the highest-precision version.
Fixed point is widely used in DSP and embedded-systems where often the target processor has no FPU, and fixed point can be implemented reasonably efficiently using an integer ALU.
In terms of performance, that is likley to vary depending on the target architecture and application. Obviously if there is no FPU, then fixed point will be considerably faster. When you have an FPU it will depend on the application too. For example performing some functions such as sqrt() or log() will be much faster when directly supported in the instruction set rather thna implemented algorithmically.
There is no built-in fixed point type in C or C++ I imagine because they (or at least C) were envisaged as systems level languages and the need fixed point is somewhat domain specific, and also perhaps because on a general purpose processor there is typically no direct hardware support for fixed point.
In C++ defining a fixed-point data type class with suitable operator overloads and associated math functions can easily overcome this shortcomming. However there are good and bad solutions to this problem. A good example can be found here: http://www.drdobbs.com/cpp/207000448. The link to the code in that article is broken, but I tracked it down to ftp://66.77.27.238/sourcecode/ddj/2008/0804.zip
You need to be careful when discussing "precision" in this context.
For the same number of bits in representation the maximum fixed point value has more significant bits than any floating point value (because the floating point format has to give some bits away to the exponent), but the minimum fixed point value has fewer than any non-denormalized floating point value (because the fixed point value wastes most of its mantissa in leading zeros).
Also depending on the way you divide the fixed point number up, the floating point value may be able to represent smaller numbers meaning that it has a more precise representation of "tiny but non-zero".
And so on.
The diferrence between floating point and integer math depends on the CPU you have in mind. On Intel chips the difference is not big in clockticks. Int math is still faster because there are multiple integer ALU's that can work in parallel. Compilers are also smart to use special adress calculation instructions to optimize add/multiply in a single instruction. Conversion counts as an operation too, so just choose your type and stick with it.
In C++ you can build your own type for fixed point math. You just define as struct with one int and override the appropriate overloads, and make them do what they normally do plus a shift to put the comma back to the right position.
You dont use float in games because it is faster or slower you use it because it is easier to implement the algorithms in floating point than in fixed point. You are assuming the reason has to do with computing speed and that is not the reason, it has to do with ease of programming.
For example you may define the width of the screen/viewport as going from 0.0 to 1.0, the height of the screen 0.0 to 1.0. The depth of the word 0.0 to 1.0. and so on. Matrix math, etc makes things real easy to implement. Do all of the math that way up to the point where you need to compute real pixels on a real screen size, say 800x400. Project the ray from the eye to the point on the object in the world and compute where it pierces the screen, using 0 to 1 math, then multiply x by 800, y times 400 and place that pixel.
floating point does not store the exponent and mantissa separately and the mantissa is a goofy number, what is left over after the exponent and sign, like 23 bits, not 16 or 32 or 64 bits.
floating point math at its core uses fixed point logic with extra logic and extra steps required. By definition compared apples to apples fixed point math is cheaper because you dont have to manipulate the data on the way into the alu and dont have to manipulate the data on the way out (normalize). When you add in IEEE and all of its garbage that adds even more logic, more clock cycles, etc. (properly signed infinity, quiet and signaling nans, different results for same operation if there is an exception handler enabled). As someone pointed out in a comment in a real system where you can do fixed and float in parallel, you can take advantage of some or all of the processors and recover some clocks that way. both with float and fixed clock rate can be increased by using vast quantities of chip real estate, fixed will remain cheaper, but float can approach fixed speeds using these kinds of tricks as well as parallel operation.
One issue not covered is the answers is a power consumption. Though it highly depends on specific hardware architecture, usually FPU consumes much more energy than ALU in CPU thus if you target mobile applications where power consumption is important it's worth consider fixed point impelementation of the algorithm.
It depends on what you're working on. If you're using fixed point then you lose precision; you have to select the number of places after the decimal place (which may not always be good enough). In floating point you don't need to worry about this as the precision offered is nearly always good enough for the task in hand - uses a standard form implementation to represent the number.
The pros and cons come down to speed and resources. On modern 32bit and 64bit platforms there is really no need to use fixed point. Most systems come with built in FPUs that are hardwired to be optimised for fixed point operations. Furthermore, most modern CPU intrinsics come with operations such as the SIMD set which help optimise vector based methods via vectorisation and unrolling. So fixed point only comes with a down side.
On embedded systems and small microcontrollers (8bit and 16bit) you may not have an FPU nor extended instruction sets. In which case you may be forced to use fixed point methods or the limited floating point instruction sets that are not very fast. So in these circumstances fixed point will be a better - or even your only - choice.