How expensive is the conversion of a float to a double? Is it as trivial as an int to long conversion?
EDIT: I'm assuming a platform where where float is 4 bytes and double is 8 bytes
Platform considerations
This depends on platform used for float computation. With x87 FPU the conversion is free, as the register content is the same - the only price you may sometimes pay is the memory traffic, but in many cases there is even no traffic, as you can simply use the value without any conversion. x87 is actually a strange beast in this respect - it is hard to properly distinguish between floats and doubles on it, as the instructions and registers used are the same, what is different are load/store instructions and computation precision itself is controlled using status bits. Using mixed float/double computations may result in unexpected results (and there are compiler command line options to control exact behaviour and optimization strategies because of this).
When you use SSE (and sometimes Visual Studio uses SSE by default), it may be different, as you may need to transfer the value in the FPU registers or do something explicit to perform the conversion.
Memory savings performance
As a summary, and answering to your comment elsewhere: if you want to store results of floating computations into 32b storage, the result will be same speed or faster, because:
If you do this on x87, the conversion is free - the only difference will be fstp dword[] will be used instead of fstp qword[].
If you do this with SSE enabled, you may even see some performance gain, as some float computations can be done with SSE once the precision of the computation is only float insteead of default double.
In all cases the memory traffic is lower
Float to double conversions happen for free on some platforms (PPC, x86 if your compiler/runtime uses the "to hell with what type you told me to use, i'm going to evaluate everything in long double anyway, nyah nyah" evaluation mode).
On an x86 environment where floating-point evaluation is actually done in the specified type using SSE registers, conversions between float and double are about as expensive as a floating-point add or multiply (i.e., unlikely to be a performance consideration unless you're doing a lot of them).
In an embedded environment that lacks hardware floating-point, they can be somewhat costly.
I can't imagine it'd be too much more complex. The big difference between converting int to long and converting float to double is that the int types have two components (sign and value) while floating point numbers have three components (sign, mantissa, and exponent).
IEEE 754 single precision is encoded
in 32 bits using 1 bit for the sign, 8
bits for the exponent, and 23 bits for
the significand. However, it uses a
hidden bit, so the significand is 24
bits (p = 24), even though it is
encoded using only 23 bits.
-- David Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic
So, converting between float and double is going to keep the same sign bit, set the last 23/24 bits of the float's mantissa to the double's mantissa, and set the last 8 bits of the float's exponent to the double's exponent.
This behavior may even be guaranteed by IEEE 754... I haven't checked it, so I'm not sure.
This is specific to the C++ implementation you are using. In C++, the default floating-point type is double. A compiler should issue a warning for the following code:
float a = 3.45;
because the double value 3.45 is being assigned to a float. If you need to use float specifically, suffix the value with f:
float a = 3.45f;
The point is, all floating-point numbers are by default double. It's safe to stick to this default if you are not sure of the implementation details of your compiler and don't have significant understanding of floating point computation. Avoid the cast.
Also see section 4.5 of The C++ Programming Language.
probably a bit slower than converting int to long, as memory required is larger and manipulation is more complex. A good reference about memory alignment issues
Maybe this help:
#include <stdlib.h>
#include <stdio.h>
#include <conio.h>
double _ftod(float fValue)
{
char czDummy[30];
printf(czDummy,"%9.5f",fValue);
double dValue = strtod(czDummy,NULL);
return dValue;
}
int main(int argc, char* argv[])
{
float fValue(250.84f);
double dValue = _ftod(fValue);//good conversion
double dValue2 = fValue;//wrong conversion
printf("%f\n",dValue);//250.840000
printf("%f\n",dValue2);//250.839996
getch();
return 0;
}
Related
Assuming I am really pressed for memory and want a smaller range (similar to short vs int). Shader languages already support half for a floating-point type with half the precision (not just convert back and forth for the value to be between -1 and 1, that is, return a float like this: shortComingIn / maxRangeOfShort). Is there an implementation that already exists for a 2-byte float?
I am also interested to know any (historical?) reasons as to why there is no 2-byte float.
TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations
There are currently 2 common standard 16-bit float formats: IEEE-754 binary16 and Google's bfloat16. Since they're standardized, obviously anyone who knows the spec can write an implementation. Some examples:
https://github.com/ramenhut/half
https://github.com/minhhn2910/cuda-half2
https://github.com/tianshilei1992/half_precision
https://github.com/acgessler/half_float
Or if you don't want to use them, you can also design a different 16-bit float format and implement it
2-byte floats are generally not used, because even float's precision is not enough for normal operations and double should always be used by default unless you're limited by bandwidth or cache size. Floating-point literals are also double when using without a suffix in C and C-like languages. See
Why are double preferred over float?
Should I use double or float?
When do you use float and when do you use double
However less-than-32-bit floats do exist. They're mainly used for storage purposes, like in graphics when 96 bits per pixel (32 bits per channel * 3 channels) are far too wasted, and will be converted to a normal 32-bit float for calculations (except on some special hardware). Various 10, 11, 14-bit float types exist in OpenGL. Many HDR formats use a 16-bit float for each channel, and Direct3D 9.0 as well as some GPUs like the Radeon R300 and R420 have a 24-bit float format. A 24-bit float is also supported by compilers in some 8-bit microcontrollers like PIC where 32-bit float support is too costly. 8-bit or narrower float types are less useful but due to their simplicity, they're often taught in computer science curriculum. Besides, a small float is also used in ARM's instruction encoding for small floating-point immediates.
The IEEE 754-2008 revision officially added a 16-bit float format, A.K.A binary16 or half-precision, with a 5-bit exponent and an 11-bit mantissa
Some compilers had support for IEEE-754 binary16, but mainly for conversion or vectorized operations and not for computation (because they're not precise enough). For example ARM's toolchain has __fp16 which can be chosen between 2 variants: IEEE and alternative depending on whether you want more range or NaN/inf representations. GCC and Clang also support __fp16 along with the standardized name _Float16. See How to enable __fp16 type on gcc for x86_64
Recently due to the rise of AI, another format called bfloat16 (brain floating-point format) which is a simple truncation of the top 16 bits of IEEE-754 binary32 became common
The motivation behind the reduced mantissa is derived from Google's experiments that showed that it is fine to reduce the mantissa so long it's still possible to represent tiny values closer to zero as part of the summation of small differences during training. Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area.
float32: 242=576 (100%)
float16: 112=121 (21%)
bfloat16: 82=64 (11%)
Many compilers like GCC and ICC now also gained the ability to support bfloat16
More information about bfloat16:
bfloat16 - Hardware Numerics Definition
Using bfloat16 with TensorFlow models
What is tf.bfloat16 "truncated 16-bit floating point"?
In cases where bfloat16 is not enough there's also the rise of a new 19-bit type called TensorFloat
Re: Implementations: Someone has apparently written half for C, which would (of course) work in C++: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/cellperformance-snippets/half.c
Re: Why is float four bytes: Probably because below that, their precision is so limited. In IEEE-754, a "half" only has 11 bits of significand precision, yielding about 3.311 decimal digits of precision (vs. 24 bits in a single yielding between 6 and 9 decimal digits of precision, or 53 bits in a double yielding between 15 and 17 decimal digits of precision).
If you're low on memory, did you consider dropping the float concept? Floats use up a lot of bits just for saving where the decimal point is. You can work around this if you know where you need the decimal point, let's say you want to save a Dollar value, you could just save it in Cents:
uint16_t cash = 50000;
std::cout << "Cash: $" << (cash / 100) << "." << ((cash % 100) < 10 ? "0" : "") << (cash % 100) << std::endl;
That is of course only an option if it's possible for you to predetermine the position of the decimal point. But if you can, always prefer it, because this also speeds up all calculations!
There is an IEEE 754 standard for 16-bit floats.
It's a new format, having been standardized in 2008 based on a GPU released in 2002.
To go a bit further than Kiralein on switching to integers, we could define a range and permit the integer values of a short to represent equal divisions over the range, with some symmetry if straddling zero:
short mappedval = (short)(val/range);
Differences between these integer versions and using half precision floats:
Integers are equally spaced over the range, whereas floats are more densely packed near zero
Using integers will use integer math in the CPU rather than floating-point. That is often faster because integer operations are simpler. Having said that, mapping the values onto an asymmetric range would require extra additions etc to retrieve the value at the end.
The absolute precision loss is more predictable; you know the error in each value so the total loss can be calculated in advance, given the range. Conversely, the relative error is more predictable using floating point.
There may be a small selection of operations which you can do using pairs of values, particularly bitwise operations, by packing two shorts into an int. This can halve the number of cycles needed (or more, if short operations involve a cast to int) and maintains 32-bit width. This is just a diluted version of bit-slicing where 32 bits are acted on in parallel, which is used in crypto.
If your CPU supports F16C, then you can get something up and running fairly quickly with something such as:
// needs to be compiled with -mf16c enabled
#include <immintrin.h>
#include <cstdint>
struct float16
{
private:
uint16_t _value;
public:
inline float16() : _value(0) {}
inline float16(const float16&) = default;
inline float16(float16&&) = default;
inline float16(const float f) : _value(_cvtss_sh(f, _MM_FROUND_CUR_DIRECTION)) {}
inline float16& operator = (const float16&) = default;
inline float16& operator = (float16&&) = default;
inline float16& operator = (const float f) { _value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION); return *this; }
inline operator float () const
{ return _cvtsh_ss(_value); }
inline friend std::istream& operator >> (std::istream& input, float16& h)
{
float f = 0;
input >> f;
h._value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION);
return input;
}
};
Maths is still performed using 32-bit floats (the F16C extensions only provides conversions between 16/32-bit floats - no instructions exist to compute arithmetic with 16-bit floats).
There are probably a variety of types in different implementations. A float equivalent of stdint.h seems like a good idea. Call (alias?) the types by their sizes. (float16_t?) A float being 4 bytes is only right now, but it probably won't get smaller. Terms like half and long mostly become meaningless with time. With 128 or 256-bit computers they could come to mean anything.
I'm working with images (1+1+1 byte/pixel) and I want to express each pixel's value relative to the average. So floating point or carefully fixed point, but not 4 times as big as the raw data please. A 16-bit float sounds about right.
This GCC 7.3 doesn't know "half", maybe in a C++ context.
2 byte float is available in clang C compiler , The data type is represented as __fp16.
Various compilers now support three different half precision formats:
__fp16 is mostly used as a storage format. It is promoted to float as soon as you do calculations on it. Calculations on __fp16 will give a float result. __fp16 has 5 bits exponent and 10 bits mantissa.
_Float16 is the same as __fp16, but used as an interchange and arithmetic format. Calculations on _Float16 will give a _Float16 result.
__bf16 is a storage format with less precision. It has 8 bits exponent and 7 bits mantissa.
All three types are supported by compilers for the ARM architecture and now also by compilers for x86 processors. The AVX512_FP16 instruction set extension will be supported by Intel's forthcoming Golden Cove processors and it is supported by the latest Clang, Gnu, and Intel compilers. Vectors of _Float16 are defined as __m128h, __m256h, and __m512h on compilers that support AVX512_FP16 .
References:
https://developer.arm.com/documentation/100067/0612/Other-Compiler-specific-Features/Half-precision-floating-point-data-types
https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point
Assuming I am really pressed for memory and want a smaller range (similar to short vs int). Shader languages already support half for a floating-point type with half the precision (not just convert back and forth for the value to be between -1 and 1, that is, return a float like this: shortComingIn / maxRangeOfShort). Is there an implementation that already exists for a 2-byte float?
I am also interested to know any (historical?) reasons as to why there is no 2-byte float.
TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations
There are currently 2 common standard 16-bit float formats: IEEE-754 binary16 and Google's bfloat16. Since they're standardized, obviously anyone who knows the spec can write an implementation. Some examples:
https://github.com/ramenhut/half
https://github.com/minhhn2910/cuda-half2
https://github.com/tianshilei1992/half_precision
https://github.com/acgessler/half_float
Or if you don't want to use them, you can also design a different 16-bit float format and implement it
2-byte floats are generally not used, because even float's precision is not enough for normal operations and double should always be used by default unless you're limited by bandwidth or cache size. Floating-point literals are also double when using without a suffix in C and C-like languages. See
Why are double preferred over float?
Should I use double or float?
When do you use float and when do you use double
However less-than-32-bit floats do exist. They're mainly used for storage purposes, like in graphics when 96 bits per pixel (32 bits per channel * 3 channels) are far too wasted, and will be converted to a normal 32-bit float for calculations (except on some special hardware). Various 10, 11, 14-bit float types exist in OpenGL. Many HDR formats use a 16-bit float for each channel, and Direct3D 9.0 as well as some GPUs like the Radeon R300 and R420 have a 24-bit float format. A 24-bit float is also supported by compilers in some 8-bit microcontrollers like PIC where 32-bit float support is too costly. 8-bit or narrower float types are less useful but due to their simplicity, they're often taught in computer science curriculum. Besides, a small float is also used in ARM's instruction encoding for small floating-point immediates.
The IEEE 754-2008 revision officially added a 16-bit float format, A.K.A binary16 or half-precision, with a 5-bit exponent and an 11-bit mantissa
Some compilers had support for IEEE-754 binary16, but mainly for conversion or vectorized operations and not for computation (because they're not precise enough). For example ARM's toolchain has __fp16 which can be chosen between 2 variants: IEEE and alternative depending on whether you want more range or NaN/inf representations. GCC and Clang also support __fp16 along with the standardized name _Float16. See How to enable __fp16 type on gcc for x86_64
Recently due to the rise of AI, another format called bfloat16 (brain floating-point format) which is a simple truncation of the top 16 bits of IEEE-754 binary32 became common
The motivation behind the reduced mantissa is derived from Google's experiments that showed that it is fine to reduce the mantissa so long it's still possible to represent tiny values closer to zero as part of the summation of small differences during training. Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area.
float32: 242=576 (100%)
float16: 112=121 (21%)
bfloat16: 82=64 (11%)
Many compilers like GCC and ICC now also gained the ability to support bfloat16
More information about bfloat16:
bfloat16 - Hardware Numerics Definition
Using bfloat16 with TensorFlow models
What is tf.bfloat16 "truncated 16-bit floating point"?
In cases where bfloat16 is not enough there's also the rise of a new 19-bit type called TensorFloat
Re: Implementations: Someone has apparently written half for C, which would (of course) work in C++: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/cellperformance-snippets/half.c
Re: Why is float four bytes: Probably because below that, their precision is so limited. In IEEE-754, a "half" only has 11 bits of significand precision, yielding about 3.311 decimal digits of precision (vs. 24 bits in a single yielding between 6 and 9 decimal digits of precision, or 53 bits in a double yielding between 15 and 17 decimal digits of precision).
If you're low on memory, did you consider dropping the float concept? Floats use up a lot of bits just for saving where the decimal point is. You can work around this if you know where you need the decimal point, let's say you want to save a Dollar value, you could just save it in Cents:
uint16_t cash = 50000;
std::cout << "Cash: $" << (cash / 100) << "." << ((cash % 100) < 10 ? "0" : "") << (cash % 100) << std::endl;
That is of course only an option if it's possible for you to predetermine the position of the decimal point. But if you can, always prefer it, because this also speeds up all calculations!
There is an IEEE 754 standard for 16-bit floats.
It's a new format, having been standardized in 2008 based on a GPU released in 2002.
To go a bit further than Kiralein on switching to integers, we could define a range and permit the integer values of a short to represent equal divisions over the range, with some symmetry if straddling zero:
short mappedval = (short)(val/range);
Differences between these integer versions and using half precision floats:
Integers are equally spaced over the range, whereas floats are more densely packed near zero
Using integers will use integer math in the CPU rather than floating-point. That is often faster because integer operations are simpler. Having said that, mapping the values onto an asymmetric range would require extra additions etc to retrieve the value at the end.
The absolute precision loss is more predictable; you know the error in each value so the total loss can be calculated in advance, given the range. Conversely, the relative error is more predictable using floating point.
There may be a small selection of operations which you can do using pairs of values, particularly bitwise operations, by packing two shorts into an int. This can halve the number of cycles needed (or more, if short operations involve a cast to int) and maintains 32-bit width. This is just a diluted version of bit-slicing where 32 bits are acted on in parallel, which is used in crypto.
If your CPU supports F16C, then you can get something up and running fairly quickly with something such as:
// needs to be compiled with -mf16c enabled
#include <immintrin.h>
#include <cstdint>
struct float16
{
private:
uint16_t _value;
public:
inline float16() : _value(0) {}
inline float16(const float16&) = default;
inline float16(float16&&) = default;
inline float16(const float f) : _value(_cvtss_sh(f, _MM_FROUND_CUR_DIRECTION)) {}
inline float16& operator = (const float16&) = default;
inline float16& operator = (float16&&) = default;
inline float16& operator = (const float f) { _value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION); return *this; }
inline operator float () const
{ return _cvtsh_ss(_value); }
inline friend std::istream& operator >> (std::istream& input, float16& h)
{
float f = 0;
input >> f;
h._value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION);
return input;
}
};
Maths is still performed using 32-bit floats (the F16C extensions only provides conversions between 16/32-bit floats - no instructions exist to compute arithmetic with 16-bit floats).
There are probably a variety of types in different implementations. A float equivalent of stdint.h seems like a good idea. Call (alias?) the types by their sizes. (float16_t?) A float being 4 bytes is only right now, but it probably won't get smaller. Terms like half and long mostly become meaningless with time. With 128 or 256-bit computers they could come to mean anything.
I'm working with images (1+1+1 byte/pixel) and I want to express each pixel's value relative to the average. So floating point or carefully fixed point, but not 4 times as big as the raw data please. A 16-bit float sounds about right.
This GCC 7.3 doesn't know "half", maybe in a C++ context.
2 byte float is available in clang C compiler , The data type is represented as __fp16.
Various compilers now support three different half precision formats:
__fp16 is mostly used as a storage format. It is promoted to float as soon as you do calculations on it. Calculations on __fp16 will give a float result. __fp16 has 5 bits exponent and 10 bits mantissa.
_Float16 is the same as __fp16, but used as an interchange and arithmetic format. Calculations on _Float16 will give a _Float16 result.
__bf16 is a storage format with less precision. It has 8 bits exponent and 7 bits mantissa.
All three types are supported by compilers for the ARM architecture and now also by compilers for x86 processors. The AVX512_FP16 instruction set extension will be supported by Intel's forthcoming Golden Cove processors and it is supported by the latest Clang, Gnu, and Intel compilers. Vectors of _Float16 are defined as __m128h, __m256h, and __m512h on compilers that support AVX512_FP16 .
References:
https://developer.arm.com/documentation/100067/0612/Other-Compiler-specific-Features/Half-precision-floating-point-data-types
https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point
Given a real value, can we check if a float data type is enough to store the number, or a double is required?
I know precision varies from architecture to architecture. Is there any C/C++ function to determine the right data type?
For background, see What Every Computer Scientist Should Know About Floating-Point Arithmetic
Unfortunately, I don't think there is any way to automate the decision.
Generally, when people represent numbers in floating point, rather than as strings, the intent is to do arithmetic using the numbers. Even if all the inputs fit in a given floating point type with acceptable precision, you still have to consider rounding error and intermediate results.
In practice, most calculations will work with enough precision for usable results, using a 64 bit type. Many calculations will not get usable results using only 32 bits.
In modern processors, buses and arithmetic units are wide enough to give 32 bit and 64 bit floating point similar performance. The main motivation for using 32 bit is to save space when storing a very large array.
That leads to the following strategy:
If arrays are large enough to justify spending significant effort to halve their size, do analysis and experiments to decide whether a 32 bit type gives good enough results, and if so use it. Otherwise, use a 64 bit type.
I think your question presupposes a way to specify any "real number" to C / C++ (or any other program) without precision loss.
Suppose that you get this real number by specifying it in code or through user input; a way to check if a float or a double would be enough to store it without precision loss is to just count the number of significant bits and check that against the data range for float and double.
If the number is given as an expression (i.e. 1/7 or sqrt(2)), you will also want ways of detecting:
If the number is rational, whether it has repeating decimals, or cyclic decimals.
Or, What happens when you have an irrational number?
More over, there are numbers, such as 0.9, that float / double cannot in theory represent "exactly" )at least not in our binary computation paradigm) - see Jon Skeet's excellent answer on this.
Lastly, see additional discussion on float vs. double.
Precision is not very platform-dependent. Although platforms are allowed to be different, float is almost universally IEEE standard single precision and double is double precision.
Single precision assigns 23 bits of "mantissa," or binary digits after the radix point (decimal point). Since the bit before the dot is always one, this equates to a 24-bit fraction. Dividing by log2(10) = 3.3, a float gets you 7.2 decimal digits of precision.
Following the same process for double yields 15.9 digits and long double yields 19.2 (for systems using the Intel 80-bit format).
The bits besides the mantissa are used for exponent. The number of exponent bits determines the range of numbers allowed. Single goes to ~ 10±38, double goes to ~ 10±308.
As for whether you need 7, 16, or 19 digits or if limited-precision representation is appropriate at all, that's really outside the scope of the question. It depends on the algorithm and the application.
A very detailed post that may or may not answer your question.
An entire series in floating point complexities!
Couldn't you simply store it to a float and a double variable and than compare these two? This should implicitely convert the float back to a double - if there is no difference, the float is sufficient?
float f = value;
double d = value;
if ((double)f == d)
{
// float is sufficient
}
You cannot represent real number with float or double variables, but only a subset of rational numbers.
When you do floating point computation, your CPU floating point unit will decide the best approximation for you.
I might be wrong but I thought that float (4 bytes) and double (8 bytes) floating point representation were actually specified independently of comp architectures.
Assuming I am really pressed for memory and want a smaller range (similar to short vs int). Shader languages already support half for a floating-point type with half the precision (not just convert back and forth for the value to be between -1 and 1, that is, return a float like this: shortComingIn / maxRangeOfShort). Is there an implementation that already exists for a 2-byte float?
I am also interested to know any (historical?) reasons as to why there is no 2-byte float.
TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations
There are currently 2 common standard 16-bit float formats: IEEE-754 binary16 and Google's bfloat16. Since they're standardized, obviously anyone who knows the spec can write an implementation. Some examples:
https://github.com/ramenhut/half
https://github.com/minhhn2910/cuda-half2
https://github.com/tianshilei1992/half_precision
https://github.com/acgessler/half_float
Or if you don't want to use them, you can also design a different 16-bit float format and implement it
2-byte floats are generally not used, because even float's precision is not enough for normal operations and double should always be used by default unless you're limited by bandwidth or cache size. Floating-point literals are also double when using without a suffix in C and C-like languages. See
Why are double preferred over float?
Should I use double or float?
When do you use float and when do you use double
However less-than-32-bit floats do exist. They're mainly used for storage purposes, like in graphics when 96 bits per pixel (32 bits per channel * 3 channels) are far too wasted, and will be converted to a normal 32-bit float for calculations (except on some special hardware). Various 10, 11, 14-bit float types exist in OpenGL. Many HDR formats use a 16-bit float for each channel, and Direct3D 9.0 as well as some GPUs like the Radeon R300 and R420 have a 24-bit float format. A 24-bit float is also supported by compilers in some 8-bit microcontrollers like PIC where 32-bit float support is too costly. 8-bit or narrower float types are less useful but due to their simplicity, they're often taught in computer science curriculum. Besides, a small float is also used in ARM's instruction encoding for small floating-point immediates.
The IEEE 754-2008 revision officially added a 16-bit float format, A.K.A binary16 or half-precision, with a 5-bit exponent and an 11-bit mantissa
Some compilers had support for IEEE-754 binary16, but mainly for conversion or vectorized operations and not for computation (because they're not precise enough). For example ARM's toolchain has __fp16 which can be chosen between 2 variants: IEEE and alternative depending on whether you want more range or NaN/inf representations. GCC and Clang also support __fp16 along with the standardized name _Float16. See How to enable __fp16 type on gcc for x86_64
Recently due to the rise of AI, another format called bfloat16 (brain floating-point format) which is a simple truncation of the top 16 bits of IEEE-754 binary32 became common
The motivation behind the reduced mantissa is derived from Google's experiments that showed that it is fine to reduce the mantissa so long it's still possible to represent tiny values closer to zero as part of the summation of small differences during training. Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area.
float32: 242=576 (100%)
float16: 112=121 (21%)
bfloat16: 82=64 (11%)
Many compilers like GCC and ICC now also gained the ability to support bfloat16
More information about bfloat16:
bfloat16 - Hardware Numerics Definition
Using bfloat16 with TensorFlow models
What is tf.bfloat16 "truncated 16-bit floating point"?
In cases where bfloat16 is not enough there's also the rise of a new 19-bit type called TensorFloat
Re: Implementations: Someone has apparently written half for C, which would (of course) work in C++: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/cellperformance-snippets/half.c
Re: Why is float four bytes: Probably because below that, their precision is so limited. In IEEE-754, a "half" only has 11 bits of significand precision, yielding about 3.311 decimal digits of precision (vs. 24 bits in a single yielding between 6 and 9 decimal digits of precision, or 53 bits in a double yielding between 15 and 17 decimal digits of precision).
If you're low on memory, did you consider dropping the float concept? Floats use up a lot of bits just for saving where the decimal point is. You can work around this if you know where you need the decimal point, let's say you want to save a Dollar value, you could just save it in Cents:
uint16_t cash = 50000;
std::cout << "Cash: $" << (cash / 100) << "." << ((cash % 100) < 10 ? "0" : "") << (cash % 100) << std::endl;
That is of course only an option if it's possible for you to predetermine the position of the decimal point. But if you can, always prefer it, because this also speeds up all calculations!
There is an IEEE 754 standard for 16-bit floats.
It's a new format, having been standardized in 2008 based on a GPU released in 2002.
To go a bit further than Kiralein on switching to integers, we could define a range and permit the integer values of a short to represent equal divisions over the range, with some symmetry if straddling zero:
short mappedval = (short)(val/range);
Differences between these integer versions and using half precision floats:
Integers are equally spaced over the range, whereas floats are more densely packed near zero
Using integers will use integer math in the CPU rather than floating-point. That is often faster because integer operations are simpler. Having said that, mapping the values onto an asymmetric range would require extra additions etc to retrieve the value at the end.
The absolute precision loss is more predictable; you know the error in each value so the total loss can be calculated in advance, given the range. Conversely, the relative error is more predictable using floating point.
There may be a small selection of operations which you can do using pairs of values, particularly bitwise operations, by packing two shorts into an int. This can halve the number of cycles needed (or more, if short operations involve a cast to int) and maintains 32-bit width. This is just a diluted version of bit-slicing where 32 bits are acted on in parallel, which is used in crypto.
If your CPU supports F16C, then you can get something up and running fairly quickly with something such as:
// needs to be compiled with -mf16c enabled
#include <immintrin.h>
#include <cstdint>
struct float16
{
private:
uint16_t _value;
public:
inline float16() : _value(0) {}
inline float16(const float16&) = default;
inline float16(float16&&) = default;
inline float16(const float f) : _value(_cvtss_sh(f, _MM_FROUND_CUR_DIRECTION)) {}
inline float16& operator = (const float16&) = default;
inline float16& operator = (float16&&) = default;
inline float16& operator = (const float f) { _value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION); return *this; }
inline operator float () const
{ return _cvtsh_ss(_value); }
inline friend std::istream& operator >> (std::istream& input, float16& h)
{
float f = 0;
input >> f;
h._value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION);
return input;
}
};
Maths is still performed using 32-bit floats (the F16C extensions only provides conversions between 16/32-bit floats - no instructions exist to compute arithmetic with 16-bit floats).
There are probably a variety of types in different implementations. A float equivalent of stdint.h seems like a good idea. Call (alias?) the types by their sizes. (float16_t?) A float being 4 bytes is only right now, but it probably won't get smaller. Terms like half and long mostly become meaningless with time. With 128 or 256-bit computers they could come to mean anything.
I'm working with images (1+1+1 byte/pixel) and I want to express each pixel's value relative to the average. So floating point or carefully fixed point, but not 4 times as big as the raw data please. A 16-bit float sounds about right.
This GCC 7.3 doesn't know "half", maybe in a C++ context.
2 byte float is available in clang C compiler , The data type is represented as __fp16.
Various compilers now support three different half precision formats:
__fp16 is mostly used as a storage format. It is promoted to float as soon as you do calculations on it. Calculations on __fp16 will give a float result. __fp16 has 5 bits exponent and 10 bits mantissa.
_Float16 is the same as __fp16, but used as an interchange and arithmetic format. Calculations on _Float16 will give a _Float16 result.
__bf16 is a storage format with less precision. It has 8 bits exponent and 7 bits mantissa.
All three types are supported by compilers for the ARM architecture and now also by compilers for x86 processors. The AVX512_FP16 instruction set extension will be supported by Intel's forthcoming Golden Cove processors and it is supported by the latest Clang, Gnu, and Intel compilers. Vectors of _Float16 are defined as __m128h, __m256h, and __m512h on compilers that support AVX512_FP16 .
References:
https://developer.arm.com/documentation/100067/0612/Other-Compiler-specific-Features/Half-precision-floating-point-data-types
https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point
C++ offers three floating point types: float, double, and long double. I infrequently use floating-point in my code, but when I do, I'm always caught out by warnings on innocuous lines like
float PiForSquares = 4.0;
The problem is that the literal 4.0 is a double, not a float - Which is irritating.
For integer types, we have short int, int and long int, which is pretty straightforward. Why doesn't C just have short float, float and long float? And where on earth did "double" come from?
EDIT: It seems the relationship between floating types is similar to that of integers. double must be at least as big as float, and long double is at least as big as double. No other guarantees of precision/range are made.
The terms "single precision" and "double precision" originated in FORTRAN and were already in wide use when C was invented. On early 1970s machines, single precision was significantly more efficient and as today, used half as much memory as double precision. Hence it was a reasonable default for floating-point numbers.
long double was added much later when the IEEE standard made allowances for the Intel 80287 floating-point chip, which used 80-bit floating-point numbers instead of the classic 64-bit double precision.
Questioner is incorrect about guarantees; today almost all languages guarantee to implement IEEE 754 binary floating-point numbers at single precision (32 bits) and double precision (64 bits). Some also offer extended precision (80 bits), which shows up in C as long double. The IEEE floating-point standard, spearheaded by William Kahan, was a triumph of good engineering over expediency: on the machines of the day, it looked prohibitively expensive, but on today's machines it is dirt cheap, and the portability and predictability of IEEE floating-point numbers must save gazillions of dollars every year.
You probably knew this, but you can make literal floats/long doubles
float f = 4.0f;
long double f = 4.0l;
Double is the default because thats what most people use. Long doubles may be overkill or and floats have very bad precision. Double works for almost every application.
Why the naming? One day all we had was 32 bit floating point numbers (well really all we had was fixed point numbers, but I digress). Anyway, when floating point became a popular feature in modern architectures, C was probably the language dujour then, and the name "float" was given. Seemed to make sense.
At the time, double may have been thought of, but not really implemented in the cpu's/fp cpus of the time, which were 16 or 32 bits. Once the double became used in more architectures, C probably got around to adding it. C needed something a name for something twice the size of a float, hence we got a double. Then someone needed even more precision, we thought he was crazy. We added it anyway. The name quadtuple(?) was overkill. Long double was good enough, and nobody made a lot of noise.
Part of the confusion is that good-ole "int" seems to change with the time. It used to be that "int" meant 16 bit integer. Float, however, is bound to the IEEE std as the 32-bit IEEE floating point number. For that reason, C kept float defined as 32 bit and made double and long double to refer to the longer standards.
Literals
The problem is that the literal 4.0 is a double, not a float - Which is irritating.
With constants there is one important difference between integers and floats. While it is relatively easy to decide which integer type to use (you select smallest enough to hold the value, with some added complexity for signed/unsigned), with floats it is not this easy. Many values (including simple ones like 0.1) cannot be exactly represented by float numbers and therefore choice of type affects not only performance, but also result value. It seems C language designers preferred robustness against performance in this case and they therefore decided the default representation should be the more exact one.
History
Why doesn't C just have short float, float and long float? And where on earth did "double" come from?
The terms "single precision" and "double precision" originated in FORTRAN and were already in wide use when C was invented.
First, these names are not specific to C++, but are pretty much common practice for any floating-point datatype that implements IEEE 754.
The name 'double' refers to 'double precision', while float is often said to be 'single precision'.
The two most common floating point formats use 32-bits and 64-bits, the longer one is "double" the size of the first one so it was called a "double".
A double is named such because it is double the "precision" of a float. Really, what this means is that it uses twice the space of a floating point value -- if your float is a 32-bit, then your double will be a 64-bit.
The name double precision is a bit of a misnomer, since a double precision float has a precision of the mantissa of 52-bits, where a single precision float has a mantissa precision of 23-bits (double that is 56). More on floating point here: Floating Point - Wikipedia, including
links at the bottom to articles on single and double precision floats.
The name long double is likely just down the same tradition as the long integer vs. short integer for integral types, except in this case they reversed it since 'int' is equivalent to 'long int'.
In fixed-point representation, there are a fixed number of digits after the radix point (a generalization of the decimal point in decimal representations). Contrast to this to floating-point representations where the radix point can move, or float, within the digits of the number being represented. Thus the name "floating-point representation." This was abbreviated to "float."
In K&R C, float referred to floating-point representations with 32-bit binary representations and double referred to floating-point representations with 64-bit binary representations, or double the size and whence the name. However, the original K&R specification required that all floating-point computations be done in double precision.
In the initial IEEE 754 standard (IEEE 754-1985), the gold standard for floating-point representations and arithmetic, definitions were provided for binary representations of single-precision and double-precision floating point numbers. Double-precision numbers were aptly name as they were represented by twice as many bits as single-precision numbers.
For detailed information on floating-point representations, read David Goldberg's article, What Every Computer Scientist Should Know About Floating-Point Arithmetic.
They're called single-precision and double-precision because they're related to the natural size (not sure of the term) of the processor. So a 32-bit processor's single-precision would be 32 bits long, and its double-precision would be double that - 64 bits long. They just decided to call the single-precision type "float" in C.
double is short for "double precision".
long double, I guess, comes from not wanting to add another keyword when a floating-point type with even higher precision started to appear on processors.
Okay, historically here is the way it used to be:
The original machines used for C had 16 bit words broken into 2 bytes, and a char was one byte. Addresses were 16 bits, so sizeof(foo*) was 2, sizeof(char) was 1. An int was 16 bits, so sizeof(int) was also 2. Then the VAX (extended addressing) machines came along, and an address was 32 bits. A char was still 1 byte, but sizeof(foo*) was now 4.
There was some confusion, which settled down in the Berkeley compilers so that a short was now 2 bytes and an int was 4 bytes, as those were well-suited to efficient code. A long became 8 bytes, because there was an efficient addressing method for 8-byte blocks --- which were called double words. 4 byte blocks were words and sure enugh, 2-byte blocks were halfwords.
The implementation of floating point numbers were such that they fit into single words, or double words. To remain consistent, the doubleword floating point number was then called a "double".
It should be noted that double does NOT have to be able to hold values greater in magnitude than those of float; it only has to be more precise.
hence the %f for a float type, and a %lf for a long float which is the same as double.