half-precision floating point numbers [duplicate] - c++

Assuming I am really pressed for memory and want a smaller range (similar to short vs int). Shader languages already support half for a floating-point type with half the precision (not just convert back and forth for the value to be between -1 and 1, that is, return a float like this: shortComingIn / maxRangeOfShort). Is there an implementation that already exists for a 2-byte float?
I am also interested to know any (historical?) reasons as to why there is no 2-byte float.

TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations
There are currently 2 common standard 16-bit float formats: IEEE-754 binary16 and Google's bfloat16. Since they're standardized, obviously anyone who knows the spec can write an implementation. Some examples:
https://github.com/ramenhut/half
https://github.com/minhhn2910/cuda-half2
https://github.com/tianshilei1992/half_precision
https://github.com/acgessler/half_float
Or if you don't want to use them, you can also design a different 16-bit float format and implement it
2-byte floats are generally not used, because even float's precision is not enough for normal operations and double should always be used by default unless you're limited by bandwidth or cache size. Floating-point literals are also double when using without a suffix in C and C-like languages. See
Why are double preferred over float?
Should I use double or float?
When do you use float and when do you use double
However less-than-32-bit floats do exist. They're mainly used for storage purposes, like in graphics when 96 bits per pixel (32 bits per channel * 3 channels) are far too wasted, and will be converted to a normal 32-bit float for calculations (except on some special hardware). Various 10, 11, 14-bit float types exist in OpenGL. Many HDR formats use a 16-bit float for each channel, and Direct3D 9.0 as well as some GPUs like the Radeon R300 and R420 have a 24-bit float format. A 24-bit float is also supported by compilers in some 8-bit microcontrollers like PIC where 32-bit float support is too costly. 8-bit or narrower float types are less useful but due to their simplicity, they're often taught in computer science curriculum. Besides, a small float is also used in ARM's instruction encoding for small floating-point immediates.
The IEEE 754-2008 revision officially added a 16-bit float format, A.K.A binary16 or half-precision, with a 5-bit exponent and an 11-bit mantissa
Some compilers had support for IEEE-754 binary16, but mainly for conversion or vectorized operations and not for computation (because they're not precise enough). For example ARM's toolchain has __fp16 which can be chosen between 2 variants: IEEE and alternative depending on whether you want more range or NaN/inf representations. GCC and Clang also support __fp16 along with the standardized name _Float16. See How to enable __fp16 type on gcc for x86_64
Recently due to the rise of AI, another format called bfloat16 (brain floating-point format) which is a simple truncation of the top 16 bits of IEEE-754 binary32 became common
The motivation behind the reduced mantissa is derived from Google's experiments that showed that it is fine to reduce the mantissa so long it's still possible to represent tiny values closer to zero as part of the summation of small differences during training. Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area.
float32: 242=576 (100%)
float16: 112=121 (21%)
bfloat16: 82=64 (11%)
Many compilers like GCC and ICC now also gained the ability to support bfloat16
More information about bfloat16:
bfloat16 - Hardware Numerics Definition
Using bfloat16 with TensorFlow models
What is tf.bfloat16 "truncated 16-bit floating point"?
In cases where bfloat16 is not enough there's also the rise of a new 19-bit type called TensorFloat

Re: Implementations: Someone has apparently written half for C, which would (of course) work in C++: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/cellperformance-snippets/half.c
Re: Why is float four bytes: Probably because below that, their precision is so limited. In IEEE-754, a "half" only has 11 bits of significand precision, yielding about 3.311 decimal digits of precision (vs. 24 bits in a single yielding between 6 and 9 decimal digits of precision, or 53 bits in a double yielding between 15 and 17 decimal digits of precision).

If you're low on memory, did you consider dropping the float concept? Floats use up a lot of bits just for saving where the decimal point is. You can work around this if you know where you need the decimal point, let's say you want to save a Dollar value, you could just save it in Cents:
uint16_t cash = 50000;
std::cout << "Cash: $" << (cash / 100) << "." << ((cash % 100) < 10 ? "0" : "") << (cash % 100) << std::endl;
That is of course only an option if it's possible for you to predetermine the position of the decimal point. But if you can, always prefer it, because this also speeds up all calculations!

There is an IEEE 754 standard for 16-bit floats.
It's a new format, having been standardized in 2008 based on a GPU released in 2002.

To go a bit further than Kiralein on switching to integers, we could define a range and permit the integer values of a short to represent equal divisions over the range, with some symmetry if straddling zero:
short mappedval = (short)(val/range);
Differences between these integer versions and using half precision floats:
Integers are equally spaced over the range, whereas floats are more densely packed near zero
Using integers will use integer math in the CPU rather than floating-point. That is often faster because integer operations are simpler. Having said that, mapping the values onto an asymmetric range would require extra additions etc to retrieve the value at the end.
The absolute precision loss is more predictable; you know the error in each value so the total loss can be calculated in advance, given the range. Conversely, the relative error is more predictable using floating point.
There may be a small selection of operations which you can do using pairs of values, particularly bitwise operations, by packing two shorts into an int. This can halve the number of cycles needed (or more, if short operations involve a cast to int) and maintains 32-bit width. This is just a diluted version of bit-slicing where 32 bits are acted on in parallel, which is used in crypto.

If your CPU supports F16C, then you can get something up and running fairly quickly with something such as:
// needs to be compiled with -mf16c enabled
#include <immintrin.h>
#include <cstdint>
struct float16
{
private:
uint16_t _value;
public:
inline float16() : _value(0) {}
inline float16(const float16&) = default;
inline float16(float16&&) = default;
inline float16(const float f) : _value(_cvtss_sh(f, _MM_FROUND_CUR_DIRECTION)) {}
inline float16& operator = (const float16&) = default;
inline float16& operator = (float16&&) = default;
inline float16& operator = (const float f) { _value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION); return *this; }
inline operator float () const
{ return _cvtsh_ss(_value); }
inline friend std::istream& operator >> (std::istream& input, float16& h)
{
float f = 0;
input >> f;
h._value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION);
return input;
}
};
Maths is still performed using 32-bit floats (the F16C extensions only provides conversions between 16/32-bit floats - no instructions exist to compute arithmetic with 16-bit floats).

There are probably a variety of types in different implementations. A float equivalent of stdint.h seems like a good idea. Call (alias?) the types by their sizes. (float16_t?) A float being 4 bytes is only right now, but it probably won't get smaller. Terms like half and long mostly become meaningless with time. With 128 or 256-bit computers they could come to mean anything.
I'm working with images (1+1+1 byte/pixel) and I want to express each pixel's value relative to the average. So floating point or carefully fixed point, but not 4 times as big as the raw data please. A 16-bit float sounds about right.
This GCC 7.3 doesn't know "half", maybe in a C++ context.

2 byte float is available in clang C compiler , The data type is represented as __fp16.

Various compilers now support three different half precision formats:
__fp16 is mostly used as a storage format. It is promoted to float as soon as you do calculations on it. Calculations on __fp16 will give a float result. __fp16 has 5 bits exponent and 10 bits mantissa.
_Float16 is the same as __fp16, but used as an interchange and arithmetic format. Calculations on _Float16 will give a _Float16 result.
__bf16 is a storage format with less precision. It has 8 bits exponent and 7 bits mantissa.
All three types are supported by compilers for the ARM architecture and now also by compilers for x86 processors. The AVX512_FP16 instruction set extension will be supported by Intel's forthcoming Golden Cove processors and it is supported by the latest Clang, Gnu, and Intel compilers. Vectors of _Float16 are defined as __m128h, __m256h, and __m512h on compilers that support AVX512_FP16 .
References:
https://developer.arm.com/documentation/100067/0612/Other-Compiler-specific-Features/Half-precision-floating-point-data-types
https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point

Related

C++ support of _Float16 [duplicate]

Assuming I am really pressed for memory and want a smaller range (similar to short vs int). Shader languages already support half for a floating-point type with half the precision (not just convert back and forth for the value to be between -1 and 1, that is, return a float like this: shortComingIn / maxRangeOfShort). Is there an implementation that already exists for a 2-byte float?
I am also interested to know any (historical?) reasons as to why there is no 2-byte float.
TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations
There are currently 2 common standard 16-bit float formats: IEEE-754 binary16 and Google's bfloat16. Since they're standardized, obviously anyone who knows the spec can write an implementation. Some examples:
https://github.com/ramenhut/half
https://github.com/minhhn2910/cuda-half2
https://github.com/tianshilei1992/half_precision
https://github.com/acgessler/half_float
Or if you don't want to use them, you can also design a different 16-bit float format and implement it
2-byte floats are generally not used, because even float's precision is not enough for normal operations and double should always be used by default unless you're limited by bandwidth or cache size. Floating-point literals are also double when using without a suffix in C and C-like languages. See
Why are double preferred over float?
Should I use double or float?
When do you use float and when do you use double
However less-than-32-bit floats do exist. They're mainly used for storage purposes, like in graphics when 96 bits per pixel (32 bits per channel * 3 channels) are far too wasted, and will be converted to a normal 32-bit float for calculations (except on some special hardware). Various 10, 11, 14-bit float types exist in OpenGL. Many HDR formats use a 16-bit float for each channel, and Direct3D 9.0 as well as some GPUs like the Radeon R300 and R420 have a 24-bit float format. A 24-bit float is also supported by compilers in some 8-bit microcontrollers like PIC where 32-bit float support is too costly. 8-bit or narrower float types are less useful but due to their simplicity, they're often taught in computer science curriculum. Besides, a small float is also used in ARM's instruction encoding for small floating-point immediates.
The IEEE 754-2008 revision officially added a 16-bit float format, A.K.A binary16 or half-precision, with a 5-bit exponent and an 11-bit mantissa
Some compilers had support for IEEE-754 binary16, but mainly for conversion or vectorized operations and not for computation (because they're not precise enough). For example ARM's toolchain has __fp16 which can be chosen between 2 variants: IEEE and alternative depending on whether you want more range or NaN/inf representations. GCC and Clang also support __fp16 along with the standardized name _Float16. See How to enable __fp16 type on gcc for x86_64
Recently due to the rise of AI, another format called bfloat16 (brain floating-point format) which is a simple truncation of the top 16 bits of IEEE-754 binary32 became common
The motivation behind the reduced mantissa is derived from Google's experiments that showed that it is fine to reduce the mantissa so long it's still possible to represent tiny values closer to zero as part of the summation of small differences during training. Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area.
float32: 242=576 (100%)
float16: 112=121 (21%)
bfloat16: 82=64 (11%)
Many compilers like GCC and ICC now also gained the ability to support bfloat16
More information about bfloat16:
bfloat16 - Hardware Numerics Definition
Using bfloat16 with TensorFlow models
What is tf.bfloat16 "truncated 16-bit floating point"?
In cases where bfloat16 is not enough there's also the rise of a new 19-bit type called TensorFloat
Re: Implementations: Someone has apparently written half for C, which would (of course) work in C++: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/cellperformance-snippets/half.c
Re: Why is float four bytes: Probably because below that, their precision is so limited. In IEEE-754, a "half" only has 11 bits of significand precision, yielding about 3.311 decimal digits of precision (vs. 24 bits in a single yielding between 6 and 9 decimal digits of precision, or 53 bits in a double yielding between 15 and 17 decimal digits of precision).
If you're low on memory, did you consider dropping the float concept? Floats use up a lot of bits just for saving where the decimal point is. You can work around this if you know where you need the decimal point, let's say you want to save a Dollar value, you could just save it in Cents:
uint16_t cash = 50000;
std::cout << "Cash: $" << (cash / 100) << "." << ((cash % 100) < 10 ? "0" : "") << (cash % 100) << std::endl;
That is of course only an option if it's possible for you to predetermine the position of the decimal point. But if you can, always prefer it, because this also speeds up all calculations!
There is an IEEE 754 standard for 16-bit floats.
It's a new format, having been standardized in 2008 based on a GPU released in 2002.
To go a bit further than Kiralein on switching to integers, we could define a range and permit the integer values of a short to represent equal divisions over the range, with some symmetry if straddling zero:
short mappedval = (short)(val/range);
Differences between these integer versions and using half precision floats:
Integers are equally spaced over the range, whereas floats are more densely packed near zero
Using integers will use integer math in the CPU rather than floating-point. That is often faster because integer operations are simpler. Having said that, mapping the values onto an asymmetric range would require extra additions etc to retrieve the value at the end.
The absolute precision loss is more predictable; you know the error in each value so the total loss can be calculated in advance, given the range. Conversely, the relative error is more predictable using floating point.
There may be a small selection of operations which you can do using pairs of values, particularly bitwise operations, by packing two shorts into an int. This can halve the number of cycles needed (or more, if short operations involve a cast to int) and maintains 32-bit width. This is just a diluted version of bit-slicing where 32 bits are acted on in parallel, which is used in crypto.
If your CPU supports F16C, then you can get something up and running fairly quickly with something such as:
// needs to be compiled with -mf16c enabled
#include <immintrin.h>
#include <cstdint>
struct float16
{
private:
uint16_t _value;
public:
inline float16() : _value(0) {}
inline float16(const float16&) = default;
inline float16(float16&&) = default;
inline float16(const float f) : _value(_cvtss_sh(f, _MM_FROUND_CUR_DIRECTION)) {}
inline float16& operator = (const float16&) = default;
inline float16& operator = (float16&&) = default;
inline float16& operator = (const float f) { _value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION); return *this; }
inline operator float () const
{ return _cvtsh_ss(_value); }
inline friend std::istream& operator >> (std::istream& input, float16& h)
{
float f = 0;
input >> f;
h._value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION);
return input;
}
};
Maths is still performed using 32-bit floats (the F16C extensions only provides conversions between 16/32-bit floats - no instructions exist to compute arithmetic with 16-bit floats).
There are probably a variety of types in different implementations. A float equivalent of stdint.h seems like a good idea. Call (alias?) the types by their sizes. (float16_t?) A float being 4 bytes is only right now, but it probably won't get smaller. Terms like half and long mostly become meaningless with time. With 128 or 256-bit computers they could come to mean anything.
I'm working with images (1+1+1 byte/pixel) and I want to express each pixel's value relative to the average. So floating point or carefully fixed point, but not 4 times as big as the raw data please. A 16-bit float sounds about right.
This GCC 7.3 doesn't know "half", maybe in a C++ context.
2 byte float is available in clang C compiler , The data type is represented as __fp16.
Various compilers now support three different half precision formats:
__fp16 is mostly used as a storage format. It is promoted to float as soon as you do calculations on it. Calculations on __fp16 will give a float result. __fp16 has 5 bits exponent and 10 bits mantissa.
_Float16 is the same as __fp16, but used as an interchange and arithmetic format. Calculations on _Float16 will give a _Float16 result.
__bf16 is a storage format with less precision. It has 8 bits exponent and 7 bits mantissa.
All three types are supported by compilers for the ARM architecture and now also by compilers for x86 processors. The AVX512_FP16 instruction set extension will be supported by Intel's forthcoming Golden Cove processors and it is supported by the latest Clang, Gnu, and Intel compilers. Vectors of _Float16 are defined as __m128h, __m256h, and __m512h on compilers that support AVX512_FP16 .
References:
https://developer.arm.com/documentation/100067/0612/Other-Compiler-specific-Features/Half-precision-floating-point-data-types
https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point

Should I use bit manipulation on float point numbers

I'm writing an algorithm, to round a floating number. The input will be a 64bit IEEE754 double type number, very close to X.5, where X is a integer less than 32. The first solution came into my mind is to use a bit mask, to mask off those least significant bits as they represent very small fractions of 2^-n.(Given the exponent is not large).
But the problem is should I do that? Is there any other ways to accomplish the same thing? I feel using bit operation on float point is very controversy. Thanks!
The langugage I'm using is C++ by the way.
Edit:
Thanks guys, for your comments. I appreciate! Let's say I have a float number, can be 1.4999999... or 21.50000012.... I want to round it to 1.5 or 21.5. My goal is to round any number to its nearest to X.5 form, since it can be stored in a IEEE754 float point number.
If your compiler guarantees that you are using IEEE 754 floating-point, I would recommend that you round according to the method delineated in this blog post: add, and then immediately subtract a large constant so as to send the value in the binade of floating-point numbers where the ULP is 0.5. You won't find any faster method, and it does not involve any bit manipulation.
The appropriate constant to round a number between 0 and 32 to the nearest halt-unit for IEEE 754 double-precision is 2251799813685248.0.
Summary: use x = x + 2251799813685248.0 - 2251799813685248.0;.
You can use any of the functions round(), floor(), ceil(), rint(), nearbyint(), and trunc(). All do rounding in different modes, and all are standard C99. The only thing you need to do is to link against the standard math library by specifying -lm as a compiler flag.
As to trying to achieve rounding by bit manipulations, I would stay away from that: a) it will be much slower than using the functions above (they generally use hardware facilities where possible), b) it is reinventing the wheel with a lot of potential for bugs, and c) the newer C standards don't like you doing bit manipulations on floating point types: they use the so called strict aliasing rules that disallow you to just cast a double* to an uint64_t*. You would either need to do your bit manipulation by casting to a unsigned char* and manipulating the IEEE number byte by byte, or you would have to use memcpy() to copy the bit representation from a double variable into an uint64_t and back again. A lot of hassle for something already available in the form of standardized functions and hardware support.
You want to round x to the nearest value of the form d.5. For a generan number you write:
round(x+0.5)-0.5
For a number close to d.5, less than 0.25 away, you can use Pascal's offering:
round(2*x)*0.5
If you're looking for a bit trick and are guaranteed to have doubles in the ranges you describe, then you could do something like this (inline as you see fit):
void RoundNearestHalf(double &d) {
unsigned const maskshift = ((*(unsigned __int64*)&d >> 52) - 1023);
unsigned __int64 const setmask = 0x0008000000000000 >> maskshift;
unsigned __int64 const clearmask = ~0x0007FFFFFFFFFFFF >> maskshift;
*(unsigned __int64*)&d |= setmask;
*(unsigned __int64*)&d &= clearmask;
}
maskshift is the unbiased exponent. For the input range, we know this will be non-negative and no more than 4 (the trick will work for higher values too, but no more than 51). We use this value to make a setmask which sets the 2^-1 (one-half) place in the mantissa, and clearmask which clears all bits in the mantissa of lower value than 2^-1. The result is d rounded to the nearest half.
Note that it would be worth profiling this against other implementations, perhaps using the standard library to determine whether or not its actually faster.
I can't speak about C++ for sure, but in C99 the use of IEEE 754 standard for floating point will be purely normative (not required). In C99 if the __STDC_IEC_559__ macro is set then it declares that IEC 559 (which is more or less IEEE 754) is used for floating point.
I think it should be pointed out that there are functions to handle many types of rounding for you.

More Precise Floating point Data Types than double?

In my project I have to compute division, multiplication, subtraction, addition on a matrix of double elements.
The problem is that when the size of matrix increases the accuracy of my output is drastically getting affected.
Currently I am using double for each element which I believe uses 8 bytes of memory & has accuracy of 16 digits irrespective of decimal position.
Even for large size of matrix the memory occupied by all the elements is in the range of few kilobytes. So I can afford to use datatypes which require more memory.
So I wanted to know which data type is more precise than double.
I tried searching in some books & I could find long double.
But I dont know what is its precision.
And what if I want more precision than that?
According to Wikipedia, 80-bit "Intel" IEEE 754 extended-precision long double, which is 80 bits padded to 16 bytes in memory, has 64 bits mantissa, with no implicit bit, which gets you 19.26 decimal digits. This has been the almost universal standard for long double for ages, but recently things have started to change.
The newer 128-bit quad-precision format has 112 mantissa bits plus an implicit bit, which gets you 34 decimal digits. GCC implements this as the __float128 type and there is (if memory serves) a compiler option to set long double to it.
You might want to consider the sequence of operations, i.e. do the additions in an ordered sequence starting with the smallest values first. This will increase overall accuracy of the results using the same precision in the mantissa:
1e00 + 1e-16 + ... + 1e-16 (1e16 times) = 1e00
1e-16 + ... + 1e-16 (1e16 times) + 1e00 = 2e00
The point is that adding small numbers to a large number will make them disappear. So the latter approach reduces the numerical error
Floating point data types with greater precision than double are going to depend on your compiler and architecture.
In order to get more than double precision, you may need to rely on some math library that supports arbitrary precision calculations. These probably won't be fast though.
On Intel architectures the precision of long double is 80bits.
What kind of values do you want to represent? Maybe you are better off using fixed precision.

Why is there no 2-byte float and does an implementation already exist?

Assuming I am really pressed for memory and want a smaller range (similar to short vs int). Shader languages already support half for a floating-point type with half the precision (not just convert back and forth for the value to be between -1 and 1, that is, return a float like this: shortComingIn / maxRangeOfShort). Is there an implementation that already exists for a 2-byte float?
I am also interested to know any (historical?) reasons as to why there is no 2-byte float.
TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations
There are currently 2 common standard 16-bit float formats: IEEE-754 binary16 and Google's bfloat16. Since they're standardized, obviously anyone who knows the spec can write an implementation. Some examples:
https://github.com/ramenhut/half
https://github.com/minhhn2910/cuda-half2
https://github.com/tianshilei1992/half_precision
https://github.com/acgessler/half_float
Or if you don't want to use them, you can also design a different 16-bit float format and implement it
2-byte floats are generally not used, because even float's precision is not enough for normal operations and double should always be used by default unless you're limited by bandwidth or cache size. Floating-point literals are also double when using without a suffix in C and C-like languages. See
Why are double preferred over float?
Should I use double or float?
When do you use float and when do you use double
However less-than-32-bit floats do exist. They're mainly used for storage purposes, like in graphics when 96 bits per pixel (32 bits per channel * 3 channels) are far too wasted, and will be converted to a normal 32-bit float for calculations (except on some special hardware). Various 10, 11, 14-bit float types exist in OpenGL. Many HDR formats use a 16-bit float for each channel, and Direct3D 9.0 as well as some GPUs like the Radeon R300 and R420 have a 24-bit float format. A 24-bit float is also supported by compilers in some 8-bit microcontrollers like PIC where 32-bit float support is too costly. 8-bit or narrower float types are less useful but due to their simplicity, they're often taught in computer science curriculum. Besides, a small float is also used in ARM's instruction encoding for small floating-point immediates.
The IEEE 754-2008 revision officially added a 16-bit float format, A.K.A binary16 or half-precision, with a 5-bit exponent and an 11-bit mantissa
Some compilers had support for IEEE-754 binary16, but mainly for conversion or vectorized operations and not for computation (because they're not precise enough). For example ARM's toolchain has __fp16 which can be chosen between 2 variants: IEEE and alternative depending on whether you want more range or NaN/inf representations. GCC and Clang also support __fp16 along with the standardized name _Float16. See How to enable __fp16 type on gcc for x86_64
Recently due to the rise of AI, another format called bfloat16 (brain floating-point format) which is a simple truncation of the top 16 bits of IEEE-754 binary32 became common
The motivation behind the reduced mantissa is derived from Google's experiments that showed that it is fine to reduce the mantissa so long it's still possible to represent tiny values closer to zero as part of the summation of small differences during training. Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area.
float32: 242=576 (100%)
float16: 112=121 (21%)
bfloat16: 82=64 (11%)
Many compilers like GCC and ICC now also gained the ability to support bfloat16
More information about bfloat16:
bfloat16 - Hardware Numerics Definition
Using bfloat16 with TensorFlow models
What is tf.bfloat16 "truncated 16-bit floating point"?
In cases where bfloat16 is not enough there's also the rise of a new 19-bit type called TensorFloat
Re: Implementations: Someone has apparently written half for C, which would (of course) work in C++: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/cellperformance-snippets/half.c
Re: Why is float four bytes: Probably because below that, their precision is so limited. In IEEE-754, a "half" only has 11 bits of significand precision, yielding about 3.311 decimal digits of precision (vs. 24 bits in a single yielding between 6 and 9 decimal digits of precision, or 53 bits in a double yielding between 15 and 17 decimal digits of precision).
If you're low on memory, did you consider dropping the float concept? Floats use up a lot of bits just for saving where the decimal point is. You can work around this if you know where you need the decimal point, let's say you want to save a Dollar value, you could just save it in Cents:
uint16_t cash = 50000;
std::cout << "Cash: $" << (cash / 100) << "." << ((cash % 100) < 10 ? "0" : "") << (cash % 100) << std::endl;
That is of course only an option if it's possible for you to predetermine the position of the decimal point. But if you can, always prefer it, because this also speeds up all calculations!
There is an IEEE 754 standard for 16-bit floats.
It's a new format, having been standardized in 2008 based on a GPU released in 2002.
To go a bit further than Kiralein on switching to integers, we could define a range and permit the integer values of a short to represent equal divisions over the range, with some symmetry if straddling zero:
short mappedval = (short)(val/range);
Differences between these integer versions and using half precision floats:
Integers are equally spaced over the range, whereas floats are more densely packed near zero
Using integers will use integer math in the CPU rather than floating-point. That is often faster because integer operations are simpler. Having said that, mapping the values onto an asymmetric range would require extra additions etc to retrieve the value at the end.
The absolute precision loss is more predictable; you know the error in each value so the total loss can be calculated in advance, given the range. Conversely, the relative error is more predictable using floating point.
There may be a small selection of operations which you can do using pairs of values, particularly bitwise operations, by packing two shorts into an int. This can halve the number of cycles needed (or more, if short operations involve a cast to int) and maintains 32-bit width. This is just a diluted version of bit-slicing where 32 bits are acted on in parallel, which is used in crypto.
If your CPU supports F16C, then you can get something up and running fairly quickly with something such as:
// needs to be compiled with -mf16c enabled
#include <immintrin.h>
#include <cstdint>
struct float16
{
private:
uint16_t _value;
public:
inline float16() : _value(0) {}
inline float16(const float16&) = default;
inline float16(float16&&) = default;
inline float16(const float f) : _value(_cvtss_sh(f, _MM_FROUND_CUR_DIRECTION)) {}
inline float16& operator = (const float16&) = default;
inline float16& operator = (float16&&) = default;
inline float16& operator = (const float f) { _value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION); return *this; }
inline operator float () const
{ return _cvtsh_ss(_value); }
inline friend std::istream& operator >> (std::istream& input, float16& h)
{
float f = 0;
input >> f;
h._value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION);
return input;
}
};
Maths is still performed using 32-bit floats (the F16C extensions only provides conversions between 16/32-bit floats - no instructions exist to compute arithmetic with 16-bit floats).
There are probably a variety of types in different implementations. A float equivalent of stdint.h seems like a good idea. Call (alias?) the types by their sizes. (float16_t?) A float being 4 bytes is only right now, but it probably won't get smaller. Terms like half and long mostly become meaningless with time. With 128 or 256-bit computers they could come to mean anything.
I'm working with images (1+1+1 byte/pixel) and I want to express each pixel's value relative to the average. So floating point or carefully fixed point, but not 4 times as big as the raw data please. A 16-bit float sounds about right.
This GCC 7.3 doesn't know "half", maybe in a C++ context.
2 byte float is available in clang C compiler , The data type is represented as __fp16.
Various compilers now support three different half precision formats:
__fp16 is mostly used as a storage format. It is promoted to float as soon as you do calculations on it. Calculations on __fp16 will give a float result. __fp16 has 5 bits exponent and 10 bits mantissa.
_Float16 is the same as __fp16, but used as an interchange and arithmetic format. Calculations on _Float16 will give a _Float16 result.
__bf16 is a storage format with less precision. It has 8 bits exponent and 7 bits mantissa.
All three types are supported by compilers for the ARM architecture and now also by compilers for x86 processors. The AVX512_FP16 instruction set extension will be supported by Intel's forthcoming Golden Cove processors and it is supported by the latest Clang, Gnu, and Intel compilers. Vectors of _Float16 are defined as __m128h, __m256h, and __m512h on compilers that support AVX512_FP16 .
References:
https://developer.arm.com/documentation/100067/0612/Other-Compiler-specific-Features/Half-precision-floating-point-data-types
https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point

Converting float to double

How expensive is the conversion of a float to a double? Is it as trivial as an int to long conversion?
EDIT: I'm assuming a platform where where float is 4 bytes and double is 8 bytes
Platform considerations
This depends on platform used for float computation. With x87 FPU the conversion is free, as the register content is the same - the only price you may sometimes pay is the memory traffic, but in many cases there is even no traffic, as you can simply use the value without any conversion. x87 is actually a strange beast in this respect - it is hard to properly distinguish between floats and doubles on it, as the instructions and registers used are the same, what is different are load/store instructions and computation precision itself is controlled using status bits. Using mixed float/double computations may result in unexpected results (and there are compiler command line options to control exact behaviour and optimization strategies because of this).
When you use SSE (and sometimes Visual Studio uses SSE by default), it may be different, as you may need to transfer the value in the FPU registers or do something explicit to perform the conversion.
Memory savings performance
As a summary, and answering to your comment elsewhere: if you want to store results of floating computations into 32b storage, the result will be same speed or faster, because:
If you do this on x87, the conversion is free - the only difference will be fstp dword[] will be used instead of fstp qword[].
If you do this with SSE enabled, you may even see some performance gain, as some float computations can be done with SSE once the precision of the computation is only float insteead of default double.
In all cases the memory traffic is lower
Float to double conversions happen for free on some platforms (PPC, x86 if your compiler/runtime uses the "to hell with what type you told me to use, i'm going to evaluate everything in long double anyway, nyah nyah" evaluation mode).
On an x86 environment where floating-point evaluation is actually done in the specified type using SSE registers, conversions between float and double are about as expensive as a floating-point add or multiply (i.e., unlikely to be a performance consideration unless you're doing a lot of them).
In an embedded environment that lacks hardware floating-point, they can be somewhat costly.
I can't imagine it'd be too much more complex. The big difference between converting int to long and converting float to double is that the int types have two components (sign and value) while floating point numbers have three components (sign, mantissa, and exponent).
IEEE 754 single precision is encoded
in 32 bits using 1 bit for the sign, 8
bits for the exponent, and 23 bits for
the significand. However, it uses a
hidden bit, so the significand is 24
bits (p = 24), even though it is
encoded using only 23 bits.
-- David Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic
So, converting between float and double is going to keep the same sign bit, set the last 23/24 bits of the float's mantissa to the double's mantissa, and set the last 8 bits of the float's exponent to the double's exponent.
This behavior may even be guaranteed by IEEE 754... I haven't checked it, so I'm not sure.
This is specific to the C++ implementation you are using. In C++, the default floating-point type is double. A compiler should issue a warning for the following code:
float a = 3.45;
because the double value 3.45 is being assigned to a float. If you need to use float specifically, suffix the value with f:
float a = 3.45f;
The point is, all floating-point numbers are by default double. It's safe to stick to this default if you are not sure of the implementation details of your compiler and don't have significant understanding of floating point computation. Avoid the cast.
Also see section 4.5 of The C++ Programming Language.
probably a bit slower than converting int to long, as memory required is larger and manipulation is more complex. A good reference about memory alignment issues
Maybe this help:
#include <stdlib.h>
#include <stdio.h>
#include <conio.h>
double _ftod(float fValue)
{
char czDummy[30];
printf(czDummy,"%9.5f",fValue);
double dValue = strtod(czDummy,NULL);
return dValue;
}
int main(int argc, char* argv[])
{
float fValue(250.84f);
double dValue = _ftod(fValue);//good conversion
double dValue2 = fValue;//wrong conversion
printf("%f\n",dValue);//250.840000
printf("%f\n",dValue2);//250.839996
getch();
return 0;
}