Initializing double at compile-time - c++

I'm writting a compile-time implementation of floating-point arithmetic through template metaprogramming. My implementation has the following characteristics:
16 bit signed integer exponent.
32 bit unsigned integer mantissa, with no implicit most significant 1 (Thats done to simplify debugging).
The type is as follows:
template<bool S , std::int16_t E , std::uint32_t M>
struct number
{
static constexpr const bool sign = S;
static constexpr const std::int16_t exponent = E;
static constexpr const std::uint32_t mantissa = M;
};
The operations work well, but now I need a method to extract those values at compile-time and get the corresponding double values. Since the goal of compile-time arithmetic is to speed up computation injecting the solutions directly on the executable, I need a way to effectively initialize a double constant at compile-time.
So simple solutions involving std::pow( 2.0 , E ) are not allowed.
As far I know double-precission IEE754 floats have a 10 bit signed exponent and a 53 bit wide unsigned integer mantissa. My attemped solution was to use type punning via an union:
template<bool S , std::int16_t E , std::uint32_t M>
struct to_runtime<tml::floating::number<S,E,M>>
{
static constexpr const long unsigned int mantissa = M << (53 - 32);
static constexpr const int exponent = E + (53 - 32);
struct double_parts
{
unsigned int sign : 1;
int exponent : 10;
long unsigned int mantissa : 53;
};
union double_rep
{
double d;
double_parts parts;
};
static constexpr const double_parts parts = { .sign = ((bool)S) ? 0 : 1 , .exponent = exponent , .mantissa = mantissa };
static constexpr const double_rep rep = { .parts = parts };
static constexpr double execute()
{
return rep.d;
}
};
But this solution is not portable, invokes undefined behaviour (Since when doing type punning we read the member of the union which has not been written), and also I have some issues when realizing the conversion (This solution doesn't return the correct number).
Is there any other way to initialize a double at compile-time given my data (sign, exponent, mantissa)?

You may implement a constexpr pow2(std::int16_t), something like:
constexpr double pow2(std::int16_t e)
{
return e == 0 ? 1. :
e > 0 ? 2. * pow2(std::int16_t(e - 1)) :
0.5 * pow2(std::int16_t(e + 1));
}
or
constexpr double pow2(std::int16_t e)
{
return e == 0 ? 1. :
((e & 1) ? (e > 0 ? 2. : 0.5) : 1.)
* pow2(std::int16_t(e / 2))
* pow2(std::int16_t(e / 2));
}
And then
template<bool S , std::int16_t E , std::uint32_t M>
struct number
{
static constexpr const double value = (sign ? -1. : 1.) * M * pow2(E);
};
Live example

Related

Create compile-time constants with first N bits set

In C++17, is there are way to generate at compile-time a constant with the first N bits set?
In pseudocode, I am looking for something like:
constexpr uint32_t MY_CONSTANT = setBits<2>();
Would be equivalent to:
constexpr uint32_t MY_CONSTANT = 0b11;
In other words, given a compile-time constant N, return a compile-time constant M where bits 0 to (N-1) are 1 (set).
I don't think there's a ready made function for it in the standard library (although std::bitset::set is constexpr since C++23). You could make your own though:
template<class T, std::size_t N>
constexpr T setBits() {
if constexpr (N == sizeof(unsigned long long) * CHAR_BIT) return ~T{};
else return static_cast<T>((1ull << N) - 1);
}
constexpr auto MY_CONSTANT = setBits<std::uint32_t, 2>();
Example for setBits<std::uint8_t, 2>():
0b00000001
<< 2
-------------
= 0b00000100
0b00000100
- 1
-------------
= 0b00000011
Or negate 0 to get all bits set and right shift away all but N bits:
template<class T, std::size_t N>
constexpr T setBits() {
if constexpr (N == 0) return 0;
else return ~T{} >> (sizeof(T) * CHAR_BIT - N);
}

Signed int from bitset<n>

How can I convert given bitset of a length N (where 0 < N < 64) to signed int. For instance, given:
std::bitset<13> b("1111111101100");
I would like to get back the value -20, not 8172.
My approach:
int t = (static_cast<int>(b.to_ullong()));
if(t > pow(2, 13)/2)
t -= pow(2, 13);
Is there a more generic way to approach this?
Edit: Also the bitset is actually std::bitset<64> and the N can be run-time known value passed by other means.
We can write a function template to do this for us:
template <size_t N, class = std::enable_if_t<(N > 0 && N < 64)>
int64_t as_signed(const std::bitset<N>& b)
{
int64_t v = b.to_ullong(); // safe since we know N < 64
return b[N-1] ? ((1LL << N) - v) : v;
}
Perhaps best is to let compiler to sign-extend it itself:
struct S { int64_t x:N; } s;
int64_t result = s.x = b.to_ullong();
Compiler likely optimizes that s out.
It must be is safe since the int64_t (where available) is required to be two's complement.
Edit: When the actual bit count to extend is only known run-time then most portable algorithm is with mask:
// Do this if bits above position N in b may be are not zero to clear those.
int64_t x = b.to_ullong() & ((1ULL << N) - 1);
// Otherwise just
int64_t x = b.to_ullong();
int64_t const mask = 1ULL << (N - 1);
int64_t result = (x ^ mask) - mask;
A slightly faster but less portable method with dynamic bit counts is with bit shifts (works when architecture has signed arithmetic right shift):
int const shift = 64 - N;
int64_t result = ((int64_t)b.to_ullong() << shift) >> shift;

Alternative to reinterpret_cast with constexpr functions

Below, you will find a constexpr string literal to CRC32 computation.
I had to reinterpret the string literal character from char to unsigned char. Because reinterpret_cast is not available in constexpr function, the workaround is a small utility function to Two's complement manually but i am a little disappointed with it.
Does it exist a more elegant solution to deal with that kind of manipulation ?
#include <iostream>
class Crc32Gen {
uint32_t m_[256] {};
static constexpr unsigned char reinterpret_cast_schar_to_uchar( char v ) {
return v>=0 ? v : ~(v-1);
}
public:
// algorithm from http://create.stephan-brumme.com/crc32/#sarwate
constexpr Crc32Gen() {
constexpr uint32_t polynomial = 0xEDB88320;
for (unsigned int i = 0; i <= 0xFF; i++) {
uint32_t crc = i;
for (unsigned int j = 0; j < 8; j++)
crc = (crc >> 1) ^ (-int(crc & 1) & polynomial);
m_[i] = crc;
}
}
constexpr uint32_t operator()( const char* data ) const {
uint32_t crc = ~0;
while (auto c = reinterpret_cast_schar_to_uchar(*data++))
crc = (crc >> 8) ^ m_[(crc & 0xFF) ^ c];
return ~crc;
}
};
constexpr Crc32Gen const crc32Gen_;
int main() {
constexpr auto const val = crc32Gen_( "The character code for É is greater than 127" );
std::cout << std::hex << val << std::endl;
}
Edit : in that case, static_cast<unsigned char>(*data++) is enough.
Two's complement is not guaranteed by the standard; in clause 3.9.1:
7 - [...] The representations of integral types
shall define values by use of a pure binary numeration system. [Example: this International Standard
permits 2's complement, 1's complement and signed magnitude representations for integral types. — end
example ]
So any code that assumes two's complement is going to have to perform the appropriate manipulations manually.
That said, your conversion function is unnecessary (and possibly incorrect); for signed-to-unsigned conversions you can just use the standard integral conversion (4.7):
2 - If the destination type is unsigned, the resulting value is the least unsigned integer congruent to the source integer (modulo 2n where n is the number of bits used to represent the unsigned type). [ Note: In a two's complement representation, this conversion is conceptual and there is no change in the bit pattern (if there is no truncation). — end note ]
Corrected code, using static_cast::
constexpr uint32_t operator()( const char* data ) const {
uint32_t crc = ~0;
while (auto c = static_cast<unsigned char>(*data++))
crc = (crc >> 8) ^ m_[(crc & 0xFF) ^ c];
return ~crc;
}

std::ratio power of a std::ratio at compile-time?

I have a challenging question from a mathematical, algorithmic and metaprogramming recursion point of view. Consider the following declaration:
template<class R1, class R2>
using ratio_power = /* to be defined */;
based on the example of the std::ratio operations like std::ratio_add. Given, two std::ratio R1 and R2 this operation should compute R1^R2 if and only if R1^R2 is a rational number. If it is irrational, then the implementation should fail, like when one try to multiply two very big ratios and the compiler say that there is an integer overflow.
Three questions:
Do you think this is possible without exploding the compilation
time?
What algorithm to use?
How to implement this operation?
You need two building blocks for this calculation:
the n-th power of an integer at compile-time
the n-th root of an integer at compile-time
Note: I use int as type for numerator and denominator to save some typing, I hope the main point comes across. I extract the following code from a working implementation but I cannot guarantee that I will not make a typo somewhere ;)
The first one is rather easy: You use x^(2n) = x^n * x^n or x^(2n+1) = x^n * x^n * x
That way, you instantiate the least templates, e.g. x^39 be calculated something like that:
x39 = x19 * x19 * x
x19 = x9 * x9 * x
x9 = x4 * x4 * x
x4 = x2 * x2
x2 = x1 * x1
x1 = x0 * x
x0 = 1
template <int Base, int Exponent>
struct static_pow
{
static const int temp = static_pow<Base, Exponent / 2>::value;
static const int value = temp * temp * (Exponent % 2 == 1 ? Base : 1);
};
template <int Base>
struct static_pow<Base, 0>
{
static const int value = 1;
};
The second one is a bit tricky and works with a bracketing algorithm:
Given x and N we want to find a number r so that r^N = x
set the interval [low, high] that contains the solution to [1, 1 + x / N]
calculate the midpoint mean = (low + high) / 2
determine, if mean^N >= x
if yes, set the interval to [low, mean]
if not, set the interval to [mean+1, high]
if the interval contains only one number, the calculation is finished
otherwise, iterate again
This algorithm gives the largest integer s that folfills s^N <= x
So check whether s^N == x. If yes, the N-th root of x is integral, otherwise not.
Now lets write that as compile-time program:
basic interface:
template <int x, int N>
struct static_root : static_root_helper<x, N, 1, 1 + x / N> { };
helper:
template <int x, int N, int low, int high>
struct static_root_helper
{
static const int mean = (low + high) / 2;
static const bool is_left = calculate_left<mean, N, x>::value;
static const int value = static_root_helper<x, N, (is_left ? low : mean + 1), (is_left ? mean, high)>::value;
};
endpoint of recursion where the interval consists of only one entry:
template <int x, int N, int mid>
struct static_root_helper<x, N, mid, mid>
{
static const int value = mid;
};
helper to detect multiplication overflow (You can exchange the boost-header for c++11 constexpr-numeric_limits, I think). Returns true, if the multiplication a * b would overflow.
#include "boost/integer_traits.hpp"
template <typename T, T a, T b>
struct mul_overflow
{
static_assert(std::is_integral<T>::value, "T must be integral");
static const bool value = (a > boost::integer_traits<T>::const_max / b);
};
Now we need to implement calculate_left that calculates whether the solution of x^N is left of mean or right of mean. We want to be able to calculate arbitrary roots so a naive implementation like static_pow > x will overflow very quickly and give wrong results. Therefore we use the following scheme:
We want to calculate if x^N > B
set A = x and i = 1
if A >= B we are already finished -> A^N will surely be larger than B
will A * x overflow?
if yes -> A^N will surely be larger than B
if not -> A *= x and i += 1
if i == N, we are finished and we can do a simple comparison to B
now lets write this as metaprogram
template <int A, int N, int B>
struct calculate_left : calculate_left_helper<A, 1, A, N, B, (A >= B)> { };
template <int x, int i, int A, int N, int B, bool short_circuit>
struct calulate_left_helper
{
static const bool overflow = mul_overflow<int, x, A>::value;
static const int next = calculate_next<x, A, overflow>::value;
static const bool value = calculate_left_helper<next, i + 1, A, N, B, (overflow || next >= B)>::value;
};
endpoint where i == N
template <int x, int A, int N, int B, bool short_circuit>
struct calculate_left_helper<x, N, A, N, B, short_circuit>
{
static const bool value = (x >= B);
};
endpoints for short-circuit
template <int x, int i, int A, int N, int B>
struct calculate_down_helper<x, i, A, N, B, true>
{
static const bool value = true;
};
template <int x, int A, int N, int B>
struct calculate_down_helper<x, N, A, N, B, true>
{
static const bool value = true;
};
helper to calculate the next value of x * A, takex overflow into account to eliminate compiler warnings:
template <int a, int b, bool overflow>
struct calculate_next
{
static const int value = a * b;
};
template <int a, int b>
struct calculate_next<a, b, true>
{
static const int value = 0; // any value will do here, calculation will short-circuit anyway
};
So, that should be it. We need an additional helper
template <int x, int N>
struct has_integral_root
{
static const int root = static_root<x, N>::value;
static const bool value = (static_pow<root, N>::value == x);
};
Now we can implement ratio_pow as follows:
template <typename, typename> struct ratio_pow;
template <int N1, int D1, int N2, int D2>
struct ratio_pow<std::ratio<N1, D1>, std::ratio<N2, D2>>
{
// ensure that all roots are integral
static_assert(has_integral_root<std::ratio<N1, D1>::num, std::ratio<N2, D2>::den>::value, "numerator has no integral root");
static_assert(has_integral_root<std::ratio<N1, D1>::den, std::ratio<N2, D2>::den>::value, "denominator has no integral root");
// calculate the "D2"-th root of (N1 / D1)
static const int num1 = static_root<std::ratio<N1, D1>::num, std::ratio<N2, D2>::den>::value;
static const int den1 = static_root<std::ratio<N1, D1>::den, std::ratio<N2, D2>::den>::value;
// exchange num1 and den1 if the exponent is negative and set the exp to the absolute value of the exponent
static const bool positive_exponent = std::ratio<N2, D2>::num >= 0;
static const int num2 = positive_exponent ? num1 : den1;
static const int den2 = positive_exponent ? den1 : num1;
static const int exp = positive_exponent ? std::ratio<N2, D2>::num : - std::ratio<N2, D2>::num;
//! calculate (num2 / den2) ^ "N2"
typedef std::ratio<static_pow<num2, exp>::value, static_pow<den2, exp>::value> type;
};
So, I hope at least the basic idea comes across.
Yes, it's possible.
Let's define R1 = P1/Q1, R2 = P2/Q2, and R1^R2 = R3 = P3/Q3. Assume further that P and Q are co-primes.
R1^R2 = R1^(P2/Q2) = R3
R1 ^ P2 = R3 ^ Q2.
R1^P2 is known and has a unique factoring into primes 2^a * 3^b * 5^c * ... Note that a, b, c can be negative as R1 is P1/Q1. Now the first question is whether all of a,b,c are multiples of known factor Q2. If not, then you fail directly. If they are, then R3 = 2^(a/Q2) * 3^(b/Q2) * 5^(c/Q2) ....
All divisions are either exact or the result does not exist, so we can use pure integer math in our templates. Factoring a number is fairly straightforward in templates (partial specialization on x%y==0).
Example: 2^(1/2) = R3 -> a=1, b=0, c=0, ... and a%2 != 0 -> impossible. (1/9)^(1/2) -> a=0, b=-2, b%2 = 0, possible, result = 3^(-2/2).

32-bit to 16-bit Floating Point Conversion

I need a cross-platform library/algorithm that will convert between 32-bit and 16-bit floating point numbers. I don't need to perform math with the 16-bit numbers; I just need to decrease the size of the 32-bit floats so they can be sent over the network. I am working in C++.
I understand how much precision I would be losing, but that's OK for my application.
The IEEE 16-bit format would be great.
Complete conversion from single precision to half precision. This is a direct copy from my SSE version, so it's branch-less. It makes use of the fact that -true == ~0 to preform branchless selections (GCC converts if statements into an unholy mess of conditional jumps, while Clang just converts them to conditional moves.)
Update (2019-11-04): reworked to support single and double precision values with fully correct rounding. I also put a corresponding if statement above each branchless select as a comment for clarity. All incoming NaNs are converted to the base quiet NaN for speed and sanity, as there is no way to reliably convert an embedded NaN message between formats.
#include <cstdint> // uint32_t, uint64_t, etc.
#include <cstring> // memcpy
#include <climits> // CHAR_BIT
#include <limits> // numeric_limits
#include <utility> // is_integral_v, is_floating_point_v, forward
namespace std
{
template< typename T , typename U >
T bit_cast( U&& u ) {
static_assert( sizeof( T ) == sizeof( U ) );
union { T t; }; // prevent construction
std::memcpy( &t, &u, sizeof( t ) );
return t;
}
} // namespace std
template< typename T > struct native_float_bits;
template<> struct native_float_bits< float >{ using type = std::uint32_t; };
template<> struct native_float_bits< double >{ using type = std::uint64_t; };
template< typename T > using native_float_bits_t = typename native_float_bits< T >::type;
static_assert( sizeof( float ) == sizeof( native_float_bits_t< float > ) );
static_assert( sizeof( double ) == sizeof( native_float_bits_t< double > ) );
template< typename T, int SIG_BITS, int EXP_BITS >
struct raw_float_type_info {
using raw_type = T;
static constexpr int sig_bits = SIG_BITS;
static constexpr int exp_bits = EXP_BITS;
static constexpr int bits = sig_bits + exp_bits + 1;
static_assert( std::is_integral_v< raw_type > );
static_assert( sig_bits >= 0 );
static_assert( exp_bits >= 0 );
static_assert( bits <= sizeof( raw_type ) * CHAR_BIT );
static constexpr int exp_max = ( 1 << exp_bits ) - 1;
static constexpr int exp_bias = exp_max >> 1;
static constexpr raw_type sign = raw_type( 1 ) << ( bits - 1 );
static constexpr raw_type inf = raw_type( exp_max ) << sig_bits;
static constexpr raw_type qnan = inf | ( inf >> 1 );
static constexpr auto abs( raw_type v ) { return raw_type( v & ( sign - 1 ) ); }
static constexpr bool is_nan( raw_type v ) { return abs( v ) > inf; }
static constexpr bool is_inf( raw_type v ) { return abs( v ) == inf; }
static constexpr bool is_zero( raw_type v ) { return abs( v ) == 0; }
};
using raw_flt16_type_info = raw_float_type_info< std::uint16_t, 10, 5 >;
using raw_flt32_type_info = raw_float_type_info< std::uint32_t, 23, 8 >;
using raw_flt64_type_info = raw_float_type_info< std::uint64_t, 52, 11 >;
//using raw_flt128_type_info = raw_float_type_info< uint128_t, 112, 15 >;
template< typename T, int SIG_BITS = std::numeric_limits< T >::digits - 1,
int EXP_BITS = sizeof( T ) * CHAR_BIT - SIG_BITS - 1 >
struct float_type_info
: raw_float_type_info< native_float_bits_t< T >, SIG_BITS, EXP_BITS > {
using flt_type = T;
static_assert( std::is_floating_point_v< flt_type > );
};
template< typename E >
struct raw_float_encoder
{
using enc = E;
using enc_type = typename enc::raw_type;
template< bool DO_ROUNDING, typename F >
static auto encode( F value )
{
using flt = float_type_info< F >;
using raw_type = typename flt::raw_type;
static constexpr auto sig_diff = flt::sig_bits - enc::sig_bits;
static constexpr auto bit_diff = flt::bits - enc::bits;
static constexpr auto do_rounding = DO_ROUNDING && sig_diff > 0;
static constexpr auto bias_mul = raw_type( enc::exp_bias ) << flt::sig_bits;
if constexpr( !do_rounding ) { // fix exp bias
// when not rounding, fix exp first to avoid mixing float and binary ops
value *= std::bit_cast< F >( bias_mul );
}
auto bits = std::bit_cast< raw_type >( value );
auto sign = bits & flt::sign; // save sign
bits ^= sign; // clear sign
auto is_nan = flt::inf < bits; // compare before rounding!!
if constexpr( do_rounding ) {
static constexpr auto min_norm = raw_type( flt::exp_bias - enc::exp_bias + 1 ) << flt::sig_bits;
static constexpr auto sub_rnd = enc::exp_bias < sig_diff
? raw_type( 1 ) << ( flt::sig_bits - 1 + enc::exp_bias - sig_diff )
: raw_type( enc::exp_bias - sig_diff ) << flt::sig_bits;
static constexpr auto sub_mul = raw_type( flt::exp_bias + sig_diff ) << flt::sig_bits;
bool is_sub = bits < min_norm;
auto norm = std::bit_cast< F >( bits );
auto subn = norm;
subn *= std::bit_cast< F >( sub_rnd ); // round subnormals
subn *= std::bit_cast< F >( sub_mul ); // correct subnormal exp
norm *= std::bit_cast< F >( bias_mul ); // fix exp bias
bits = std::bit_cast< raw_type >( norm );
bits += ( bits >> sig_diff ) & 1; // add tie breaking bias
bits += ( raw_type( 1 ) << ( sig_diff - 1 ) ) - 1; // round up to half
//if( is_sub ) bits = std::bit_cast< raw_type >( subn );
bits ^= -is_sub & ( std::bit_cast< raw_type >( subn ) ^ bits );
}
bits >>= sig_diff; // truncate
//if( enc::inf < bits ) bits = enc::inf; // fix overflow
bits ^= -( enc::inf < bits ) & ( enc::inf ^ bits );
//if( is_nan ) bits = enc::qnan;
bits ^= -is_nan & ( enc::qnan ^ bits );
bits |= sign >> bit_diff; // restore sign
return enc_type( bits );
}
template< typename F >
static F decode( enc_type value )
{
using flt = float_type_info< F >;
using raw_type = typename flt::raw_type;
static constexpr auto sig_diff = flt::sig_bits - enc::sig_bits;
static constexpr auto bit_diff = flt::bits - enc::bits;
static constexpr auto bias_mul = raw_type( 2 * flt::exp_bias - enc::exp_bias ) << flt::sig_bits;
raw_type bits = value;
auto sign = bits & enc::sign; // save sign
bits ^= sign; // clear sign
auto is_norm = bits < enc::inf;
bits = ( sign << bit_diff ) | ( bits << sig_diff );
auto val = std::bit_cast< F >( bits ) * std::bit_cast< F >( bias_mul );
bits = std::bit_cast< raw_type >( val );
//if( !is_norm ) bits |= flt::inf;
bits |= -!is_norm & flt::inf;
return std::bit_cast< F >( bits );
}
};
using flt16_encoder = raw_float_encoder< raw_flt16_type_info >;
template< typename F >
auto quick_encode_flt16( F && value )
{ return flt16_encoder::encode< false >( std::forward< F >( value ) ); }
template< typename F >
auto encode_flt16( F && value )
{ return flt16_encoder::encode< true >( std::forward< F >( value ) ); }
template< typename F = float, typename X >
auto decode_flt16( X && value )
{ return flt16_encoder::decode< F >( std::forward< X >( value ) ); }
Of course full IEEE support isn't always needed. If your values don't require logarithmic resolution approaching zero, then linearizing them to a fixed point format is much faster, as was already mentioned.
Half to float:
float f = ((h&0x8000)<<16) | (((h&0x7c00)+0x1C000)<<13) | ((h&0x03FF)<<13);
Float to half:
uint32_t x = *((uint32_t*)&f);
uint16_t h = ((x>>16)&0x8000)|((((x&0x7f800000)-0x38000000)>>13)&0x7c00)|((x>>13)&0x03ff);
std::frexp extracts the significand and exponent from normal floats or doubles -- then you need to decide what to do with exponents that are too large to fit in a half-precision float (saturate...?), adjust accordingly, and put the half-precision number together. This article has C source code to show you how to perform the conversion.
Given your needs (-1000, 1000), perhaps it would be better to use a fixed-point representation.
//change to 20000 to SHORT_MAX if you don't mind whole numbers
//being turned into fractional ones
const int compact_range = 20000;
short compactFloat(double input) {
return round(input * compact_range / 1000);
}
double expandToFloat(short input) {
return ((double)input) * 1000 / compact_range;
}
This will give you accuracy to the nearest 0.05. If you change 20000 to SHORT_MAX you'll get a bit more accuracy but some whole numbers will end up as decimals on the other end.
Why so over-complicated? My implementation does not need any additional library, complies with the IEEE-754 FP16 format, manages both normalized and denormalized numbers, is branch-less, takes about 40-ish clock cycles for a back and forth conversion and ditches NaN or Inf for an extended range. That's the magical power of bit operations.
typedef unsigned short ushort;
typedef unsigned int uint;
uint as_uint(const float x) {
return *(uint*)&x;
}
float as_float(const uint x) {
return *(float*)&x;
}
float half_to_float(const ushort x) { // IEEE-754 16-bit floating-point format (without infinity): 1-5-10, exp-15, +-131008.0, +-6.1035156E-5, +-5.9604645E-8, 3.311 digits
const uint e = (x&0x7C00)>>10; // exponent
const uint m = (x&0x03FF)<<13; // mantissa
const uint v = as_uint((float)m)>>23; // evil log2 bit hack to count leading zeros in denormalized format
return as_float((x&0x8000)<<16 | (e!=0)*((e+112)<<23|m) | ((e==0)&(m!=0))*((v-37)<<23|((m<<(150-v))&0x007FE000))); // sign : normalized : denormalized
}
ushort float_to_half(const float x) { // IEEE-754 16-bit floating-point format (without infinity): 1-5-10, exp-15, +-131008.0, +-6.1035156E-5, +-5.9604645E-8, 3.311 digits
const uint b = as_uint(x)+0x00001000; // round-to-nearest-even: add last bit after truncated mantissa
const uint e = (b&0x7F800000)>>23; // exponent
const uint m = b&0x007FFFFF; // mantissa; in line below: 0x007FF000 = 0x00800000-0x00001000 = decimal indicator flag - initial rounding
return (b&0x80000000)>>16 | (e>112)*((((e-112)<<10)&0x7C00)|m>>13) | ((e<113)&(e>101))*((((0x007FF000+m)>>(125-e))+1)>>1) | (e>143)*0x7FFF; // sign : normalized : denormalized : saturate
}
Example for how to use it and to check that the conversion is correct:
#include <iostream>
void print_bits(const ushort x) {
for(int i=15; i>=0; i--) {
cout << ((x>>i)&1);
if(i==15||i==10) cout << " ";
if(i==10) cout << " ";
}
cout << endl;
}
void print_bits(const float x) {
uint b = *(uint*)&x;
for(int i=31; i>=0; i--) {
cout << ((b>>i)&1);
if(i==31||i==23) cout << " ";
if(i==23) cout << " ";
}
cout << endl;
}
int main() {
const float x = 1.0f;
const ushort x_compressed = float_to_half(x);
const float x_decompressed = half_to_float(x_compressed);
print_bits(x);
print_bits(x_compressed);
print_bits(x_decompressed);
return 0;
}
Output:
0 01111111 00000000000000000000000
0 01111 0000000000
0 01111111 00000000000000000000000
I have published an adapted version of this FP32<->FP16 conversion algorithm in this paper with detailed description on how the bit manipulation magic works. In this paper I also provide several ultra-fast conversion algorithms for different 16-bit Posit formats.
If you're sending a stream of information across, you could probably do better than this, especially if everything is in a consistent range, as your application seems to have.
Send a small header, that just consists of a float32 minimum and maximum, then you can send across your information as a 16 bit interpolation value between the two. As you also say that precision isn't much of an issue, you could even send 8bits at a time.
Your value would be something like, at reconstruction time:
float t = _t / numeric_limits<unsigned short>::max(); // With casting, naturally ;)
float val = h.min + t * (h.max - h.min);
Hope that helps.
-Tom
This question is already a bit old, but for the sake of completeness, you might also take a look at this paper for half-to-float and float-to-half conversion.
They use a branchless table-driven approach with relatively small look-up tables. It is completely IEEE-conformant and even beats Phernost's IEEE-conformant branchless conversion routines in performance (at least on my machine). But of course his code is much better suited to SSE and is not that prone to memory latency effects.
This conversion for 16-to-32-bit floating point is quite fast for cases where you do not have to account for infinities or NaNs, and can accept denormals-as-zero (DAZ). I.e. it is suitable for performance-sensitive calculations, but you should beware of division by zero if you expect to encounter denormals.
Note that this is most suitable for x86 or other platforms that have conditional moves or "set if" equivalents.
Strip the sign bit off the input
Align the most significant bit of the mantissa to the 22nd bit
Adjust the exponent bias
Set bits to all-zero if the input exponent is zero
Re-insert sign bit
The reverse applies for single-to-half-precision, with some additions.
void float32(float* __restrict out, const uint16_t in) {
uint32_t t1;
uint32_t t2;
uint32_t t3;
t1 = in & 0x7fff; // Non-sign bits
t2 = in & 0x8000; // Sign bit
t3 = in & 0x7c00; // Exponent
t1 <<= 13; // Align mantissa on MSB
t2 <<= 16; // Shift sign bit into position
t1 += 0x38000000; // Adjust bias
t1 = (t3 == 0 ? 0 : t1); // Denormals-as-zero
t1 |= t2; // Re-insert sign bit
*((uint32_t*)out) = t1;
};
void float16(uint16_t* __restrict out, const float in) {
uint32_t inu = *((uint32_t*)&in);
uint32_t t1;
uint32_t t2;
uint32_t t3;
t1 = inu & 0x7fffffff; // Non-sign bits
t2 = inu & 0x80000000; // Sign bit
t3 = inu & 0x7f800000; // Exponent
t1 >>= 13; // Align mantissa on MSB
t2 >>= 16; // Shift sign bit into position
t1 -= 0x1c000; // Adjust bias
t1 = (t3 > 0x38800000) ? 0 : t1; // Flush-to-zero
t1 = (t3 < 0x8e000000) ? 0x7bff : t1; // Clamp-to-max
t1 = (t3 == 0 ? 0 : t1); // Denormals-as-zero
t1 |= t2; // Re-insert sign bit
*((uint16_t*)out) = t1;
};
Note that you can change the constant 0x7bff to 0x7c00 for it to overflow to infinity.
See GitHub for source code.
Most of the approaches described in the other answers here either do not round correctly on conversion from float to half, throw away subnormals which is a problem since 2**-14 becomes your smallest non-zero number, or do unfortunate things with Inf / NaN. Inf is also a problem because the largest finite number in half is a bit less than 2^16. OpenEXR was unnecessarily slow and complicated, last I looked at it. A fast correct approach will use the FPU to do the conversion, either as a direct instruction, or using the FPU rounding hardware to make the right thing happen. Any half to float conversion should be no slower than a 2^16 element lookup table.
The following are hard to beat:
On OS X / iOS, you can use vImageConvert_PlanarFtoPlanar16F and vImageConvert_Planar16FtoPlanarF. See Accelerate.framework.
Intel ivybridge added SSE instructions for this. See f16cintrin.h.
Similar instructions were added to the ARM ISA for Neon. See vcvt_f32_f16 and vcvt_f16_f32 in arm_neon.h. On iOS you will need to use the arm64 or armv7s arch to get access to them.
This code converts a 32-bit floating point number to 16-bits and back.
#include <x86intrin.h>
#include <iostream>
int main()
{
float f32;
unsigned short f16;
f32 = 3.14159265358979323846;
f16 = _cvtss_sh(f32, 0);
std::cout << f32 << std::endl;
f32 = _cvtsh_ss(f16);
std::cout << f32 << std::endl;
return 0;
}
I tested with the Intel icpc 16.0.2:
$ icpc a.cpp
g++ 7.3.0:
$ g++ -march=native a.cpp
and clang++ 6.0.0:
$ clang++ -march=native a.cpp
It prints:
$ ./a.out
3.14159
3.14062
Documentation about these intrinsics is available at:
https://software.intel.com/en-us/node/524287
https://clang.llvm.org/doxygen/f16cintrin_8h.html
The question is old and has already been answered, but I figured it would be worth mentioning an open source C++ library that can create 16bit IEEE compliant half precision floats and has a class that acts pretty much identically to the built in float type, but with 16 bits instead of 32. It is the "half" class of the OpenEXR library. The code is under a permissive BSD style license. I don't believe it has any dependencies outside of the standard library.
I had this same exact problem, and found this link very helpful. Just import the file "ieeehalfprecision.c" into your project and use it like this :
float myFloat = 1.24;
uint16_t resultInHalf;
singles2halfp(&resultInHalf, &myFloat, 1); // it accepts a series of floats, so use 1 to input 1 float
// an example to revert the half float back
float resultInSingle;
halfp2singles(&resultInSingle, &resultInHalf, 1);
I also change some code (See the comment by the author (James Tursa) in the link) :
#define INT16_TYPE int16_t
#define UINT16_TYPE uint16_t
#define INT32_TYPE int32_t
#define UINT32_TYPE uint32_t
I have found an implementation of conversion from half-float to single-float format and back with using of AVX2. There are much more faster than software implementation of these algorithms. I hope it will be useful.
32-bit float to 16-bit float conversion:
#include <immintrin.h"
inline void Float32ToFloat16(const float * src, uint16_t * dst)
{
_mm_storeu_si128((__m128i*)dst, _mm256_cvtps_ph(_mm256_loadu_ps(src), 0));
}
void Float32ToFloat16(const float * src, size_t size, uint16_t * dst)
{
assert(size >= 8);
size_t fullAlignedSize = size&~(32-1);
size_t partialAlignedSize = size&~(8-1);
size_t i = 0;
for (; i < fullAlignedSize; i += 32)
{
Float32ToFloat16(src + i + 0, dst + i + 0);
Float32ToFloat16(src + i + 8, dst + i + 8);
Float32ToFloat16(src + i + 16, dst + i + 16);
Float32ToFloat16(src + i + 24, dst + i + 24);
}
for (; i < partialAlignedSize; i += 8)
Float32ToFloat16(src + i, dst + i);
if(partialAlignedSize != size)
Float32ToFloat16(src + size - 8, dst + size - 8);
}
16-bit float to 32-bit float conversion:
#include <immintrin.h"
inline void Float16ToFloat32(const uint16_t * src, float * dst)
{
_mm256_storeu_ps(dst, _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)src)));
}
void Float16ToFloat32(const uint16_t * src, size_t size, float * dst)
{
assert(size >= 8);
size_t fullAlignedSize = size&~(32-1);
size_t partialAlignedSize = size&~(8-1);
size_t i = 0;
for (; i < fullAlignedSize; i += 32)
{
Float16ToFloat32<align>(src + i + 0, dst + i + 0);
Float16ToFloat32<align>(src + i + 8, dst + i + 8);
Float16ToFloat32<align>(src + i + 16, dst + i + 16);
Float16ToFloat32<align>(src + i + 24, dst + i + 24);
}
for (; i < partialAlignedSize; i += 8)
Float16ToFloat32<align>(src + i, dst + i);
if (partialAlignedSize != size)
Float16ToFloat32<false>(src + size - 8, dst + size - 8);
}
Thanks Code for decimal to single precision
We actually can try to edit the same code to half precision , however it is not possible with gcc C compiler , so do the following
sudo apt install clang
Then try the following code
// A C code to convert Decimal value to IEEE 16-bit floating point Half precision
#include <stdio.h>
void printBinary(int n, int i)
{
int k;
for (k = i - 1; k >= 0; k--) {
if ((n >> k) & 1)
printf("1");
else
printf("0");
}
}
typedef union {
__fp16 f;
struct
{
unsigned int mantissa : 10;
unsigned int exponent : 5;
unsigned int sign : 1;
} raw;
} myfloat;
// Driver Code
int main()
{
myfloat var;
var.f = 11;
printf("%d | ", var.raw.sign);
printBinary(var.raw.exponent, 5);
printf(" | ");
printBinary(var.raw.mantissa, 10);
printf("\n");
return 0;
}
Compile the code in your terminal
clang code_name.c -o code_name
./code_name
Here
__fp16
is a 2 byte float data type supported in clang C compiler