Modifying a function to use SSE intrinsics

Modifying a function to use SSE intrinsics - c++

I am trying to calculate the approximate value of the radical: sqrt(i + sqrt(i + sqrt(i + ...))) using SSE in order to get a speedup from vectorization (I also read that the SIMD square-root function runs approximately 4.7x faster than the innate FPU square-root function). However, I am having problems getting the same functionality in the vectorized version; I am getting the incorrect value and I'm not sure
My original function is this:
template <typename T>
T CalculateRadical( T tValue, T tEps = std::numeric_limits<T>::epsilon() )
{
static std::unordered_map<T,T> setResults;
auto it = setResults.find( tValue );
if( it != setResults.end() )
{
return it->second;
}
T tPrev = std::sqrt(tValue + std::sqrt(tValue)), tCurr = std::sqrt(tValue + tPrev);
// Keep iterating until we get convergence:
while( std::abs( tPrev - tCurr ) > tEps )
{
tPrev = tCurr;
tCurr = std::sqrt(tValue + tPrev);
}
setResults.insert( std::make_pair( tValue, tCurr ) );
return tCurr;
}
And the SIMD equivalent (when this template function is instantiated with T = float and given tEps = 0.0005f) I have written is:
// SSE intrinsics hard-coded function:
__m128 CalculateRadicals( __m128 values )
{
static std::unordered_map<float, __m128> setResults;
// Store our epsilon as a vector for quick comparison:
__declspec(align(16)) float flEps[4] = { 0.0005f, 0.0005f, 0.0005f, 0.0005f };
__m128 eps = _mm_load_ps( flEps );
union U {
__m128 vec;
float flArray[4];
};
U u;
u.vec = values;
float flFirstVal = u.flArray[0];
auto it = setResults.find( flFirstVal );
if( it != setResults.end( ) )
{
return it->second;
}
__m128 prev = _mm_sqrt_ps( _mm_add_ps( values, _mm_sqrt_ps( values ) ) );
__m128 curr = _mm_sqrt_ps( _mm_add_ps( values, prev ) );
while( _mm_movemask_ps( _mm_cmplt_ps( _mm_sub_ps( curr, prev ), eps ) ) != 0xF )
{
prev = curr;
curr = _mm_sqrt_ps( _mm_add_ps( values, prev ) );
}
setResults.insert( std::make_pair( flFirstVal, curr ) );
return curr;
}
I am calling the function in a loop using the following code:
long long N;
std::cin >> N;
float flExpectation = 0.0f;
long long iMultipleOf4 = (N / 4LL) * 4LL;
for( long long i = iMultipleOf4; i > 0LL; i -= 4LL )
{
__declspec(align(16)) float flArray[4] = { static_cast<float>(i - 3), static_cast<float>(i - 2), static_cast<float>(i - 1), static_cast<float>(i) };
__m128 arg = _mm_load_ps( flArray );
__m128 vec = CalculateRadicals( arg );
float flSum = Sum( vec );
flExpectation += flSum;
}
for( long long i = iMultipleOf4; i < N; ++i )
{
flExpectation += CalculateRadical( static_cast<float>(i), 0.0005f );
}
flExpectation /= N;
I get the following outputs for input 5:
With SSE version: 2.20873
With FPU verison: 1.69647
Where does the discrepancy come from, what am I doing wrong in the SIMD equivalent?
EDIT: I've realized that the Sum function is relevant here:
float Sum( __m128 vec1 )
{
float flTemp[4];
_mm_storeu_ps( flTemp, vec1 );
return flTemp[0] + flTemp[1] + flTemp[2] + flTemp[3];
}

SSE intrinsics can be pretty tedious sometimes...
But not here. You just screwed up your loop :
for( long long i = iMultipleOf4; i > 0LL; i -= 4LL )
I doubt it's doing what you expected. If iMultipleOf4 is 4, then your function will compute with 4,3,2,1 but not 0. And then your 2nd loop redo the computation with 4.
The two function give the same results for me, and the loops gives the same flExpectation after correction. Though there still is a small difference, probably because the FPUs have slight differences in how they compute.

Related

Parallelization doubles the execution time

I'm trying to parallelize a loop with OpenMP but the result is that the user time spent (reported by /usr/bin/time) becomes twice as large compared to the unparallelized code. So the wall clock execution times of the parallelized and unparallelized code are approximately the same. Here is the code:
Real ComputeEPPMRMatrixElement7Opt4(
UniformGrid3d const & ugLeft, ColumnVector const & vLeft,
UniformGrid3d const & ugRight, ColumnVector const & vRight
)
{
assert( ugLeft.si == ugRight.si );
UniformGrid3d const ugInt = GridIntersection( ugLeft, ugRight );
if( ! EmptyGrid( ugInt ) )
{
Real rSum = 0.0;
Real const * const arLeft = vLeft.data( );
Real const * const arRight = vRight.data( );
#pragma omp parallel for reduction(+:rSum)
for( Integer nX = ugInt.tiMinX; nX <= ugInt.tiMaxX; nX++ )
{
for( Integer nY = ugInt.tiMinY; nY <= ugInt.tiMaxY; nY++ )
{
// We may do the following optimization because ugInt is the
// intersection of the left and the right grids.
// Actually it should be safe to remove the range checks from
// the two following function calls.
Integer const iLeft =
ugLeft.GetVectorIndexWithCheck( nX, nY, ugInt.tiMinZ );
Integer const iRight =
ugRight.GetVectorIndexWithCheck( nX, nY, ugInt.tiMinZ );
Real const * arLeft1 = &arLeft[ iLeft ];
Real const * arRight1 = &arRight[ iRight ];
for( Integer nZ = ugInt.tiMinZ; nZ <= ugInt.tiMaxZ; nZ++ )
{
Real const rLeft = *arLeft1++;
Real const rRight = *arRight1++;
rSum += rLeft * rRight;
}
}
}
Real const rScale = exp2( -3 * ugInt.si );
return rScale * rSum;
}
else
{
return 0.0;
}
}
Note that Real is an alias for double. What is wrong?

ifft results are different from original signal

FFT works fine, but when I want to take IFFT I always see the same graph from its results. Results are complex and graph always the same regardless of the original signal.
in real part graph is a -sin with period = frame size
in imaginary part it is a -cos with the same period
Where can be a problem?
original signal:
IFFT real value (on pics are only half of frame):
Algorithm FFT that I use.
double** FFT(double** f, int s, bool inverse) {
if (s == 1) return f;
int sH = s / 2;
double** fOdd = new double*[sH];
double** fEven = new double*[sH];
for (int i = 0; i < sH; i++) {
int j = 2 * i;
fOdd[i] = f[j];
fEven[i] = f[j + 1];
}
double** sOdd = FFT(fOdd, sH, inverse);
double** sEven = FFT(fEven, sH, inverse);
double**spectr = new double*[s];
double arg = inverse ? DoublePI / s : -DoublePI / s;
double*oBase = new double[2]{ cos(arg),sin(arg) };
double*o = new double[2]{ 1,0 };
for (int i = 0; i < sH; i++) {
double* sO1 = Mul(o, sOdd[i]);
spectr[i] = Sum(sEven[i], sO1);
spectr[i + sH] = Dif(sEven[i], sO1);
o = Mul(o, oBase);
}
return spectr;
}

The "butterfly" portion is applying the coefficients incorrectly:
for (int i = 0; i < sH; i++) {
double* sO1 = sOdd[i];
double* sE1 = Mul(o, sEven[i]);
spectr[i] = Sum(sO1, sE1);
spectr[i + sH] = Dif(sO1, sE1);
o = Mul(o, oBase);
}
Side Note:
I kept your notation but it makes things confusing:
fOdd has indexes 0, 2, 4, 6, ... so it should be fEven
fEven has indexes 1, 3, 5, 7, ... so it should be fOdd
really sOdd should be sLower and sEven should be sUpper since they correspond to the 0:s/2 and s/2:s-1 elements of the spectrum respectively:
sLower = FFT(fEven, sH, inverse); // fEven is 0, 2, 4, ...
sUpper = FFT(fOdd, sH, inverse); // fOdd is 1, 3, 5, ...
Then the butterfly becomes:
for (int i = 0; i < sH; i++) {
double* sL1 = sLower[i];
double* sU1 = Mul(o, sUpper[i]);
spectr[i] = Sum(sL1, sU1);
spectr[i + sH] = Dif(sL1, sU1);
o = Mul(o, oBase);
}
When written like this it is easier to compare to this pseudocode example on wikipedia.
And #Dai is correct you are going to leak a lot of memory

Regarding the memory, you can use the std::vector to encapsulate dynamically-allocated arrays and to ensure they're deallocated when execution leaves scope. You could use unique_ptr<double[]> but the performance gains are not worth it IMO and you lose the safety of the at() method.
(Based on #Robb's answer)
A few other tips:
Avoid cryptic identifiers - programs should be readable, and names like "f" and "s" make your program harder to read and maintain.
Type-based Hungarian notation is frowned upon as modern editors show type information automatically so it adds unnecessary complication to identifier names.
Use size_t for indexes, not int
The STL is your friend, use it!
Preemptively prevent bugs by using const to prevent accidental mutation of read-only data.
Like so:
#include <vector>
using namespace std;
vector<double> fastFourierTransform(const vector<double> signal, const bool inverse) {
if( signal.size() < 2 ) return signal;
const size_t half = signal.size() / 2;
vector<double> lower; lower.reserve( half );
vector<double> upper; upper.reserve( half );
bool isEven = true;
for( size_t i = 0; i < signal.size(); i++ ) {
if( isEven ) lower.push_back( signal.at( i ) );
else upper.push_back( signal.at( i ) );
isEven = !isEven;
}
vector<double> lowerFft = fastFourierTransform( lower, inverse );
vector<double> upperFft = fastFourierTransform( upper, inverse );
vector<double> result;
result.reserve( signal.size() );
double arg = ( inverse ? 1 : -1 ) * ( DoublePI / signal.size() );
// Ideally these should be local `double` values passed directly into `Mul`.
unique_ptr<double[]> oBase = make_unique<double[]>( 2 );
oBase[0] = cos(arg);
oBase[1] = sin(arg);
unique_ptr<double[]> o = make_unique<double[]>( 2 );
o[0] = 0;
o[1] = 0;
for( size_t i = 0; i < half; i++ ) {
double* lower1 = lower.at( i );
double* upper1 = Mul( o, upper.at( i ) );
result.at( i ) = Sum( lower1, upper1 );
result.at( i + half ) = Dif( lower1, upper1 );
o = Mul( o, oBase );
}
// My knowledge of move-semantics of STL containers is a bit rusty - so there's probably a better way to return the output 'result' vector.
return result;
}

Lambda function in accumulate

I'm trying to learn how to use lamba functions, and want to do something like:
Given a vector = {1,2,3,4,5}
I want the sum of pairwise sums = (1+2)+(2+3)+...
Below is my attempt, which is not working properly.
#include <vector>
#include <algorithm>
using namespace std;
vector <double> data = {1,10,100};
double mean = accumulate(data.begin(),data.end(),0.0);
double foo()
{
auto bar = accumulate(data.begin(),data.end(),0.0,[&](int k, int l){return (k+l);});
return bar
}
I tried changing the return statement to return (data.at(k)+data.at(l)), which didn't quite work.

Adding pairwise sums is the same as summing over everything twice except the first and last elements. No need for a fancy lambda.
auto result = std::accumulate(std::begin(data), std::end(data), 0.0)
* 2.0 - data.front() - data.end();
Or a little safer:
auto result = std::accumulate(std::begin(data), std::end(data), 0.0)
* 2.0 - (!data.empty() ? data.front() : 0) - (data.size() > 1 ? data.back() : 0);
If you insist on a lambda, you can move the doubling inside:
result = std::accumulate(std::begin(data), std::end(data), 0.0,
[](double lhs, double rhs){return lhs + 2.0*rhs;})
- data.front() - data.back();
Note that lhs within the lambda is the current sum, not the next two numbers in the sequence.
If you insist on doing all the work within the lambda, you can track an index by using generalized capture:
result = std::accumulate(std::begin(data), std::end(data), 0.0,
[currIndex = 0U, lastIndex = data.size()-1] (double lhs, double rhs) mutable
{
double result = lhs + rhs;
if (currIndex != 0 && currIndex != lastIndex)
result += rhs;
++currIndex;
return result;
});
Demo of all approaches

You misunderstand how std::accumulate works. Let's say you have int array[], then accumulate does:
int value = initial_val;
value = lambda( value, array[0] );
value = lambda( value, array[1] );
...
return value;
this is pseudo code, but it should be pretty easy to understand how it works. So in your case std::accumulate does not seem to be applicable. You may write a loop, or create your own special accumulate function:
auto lambda = []( int a, int b ) { return a + b; };
auto sum = 0.0;
for( auto it = data.begin(); it != data.end(); ++it ) {
auto itn = std::next( it );
if( itn == data.end() ) break;
sum += lambda( *it, *itn );
}

You could capture a variable in the lambda to keep the last value:
#include <vector>
#include <algorithm>
#include <numeric>
std::vector<double> data = {1,10,100};
double mean = accumulate(data.begin(), data.end(), 0.0);
double foo()
{
double last{0};
auto bar = accumulate(data.begin(), data.end(), 0.0, [&](auto k, auto l)
{
auto total = l + last;
last = l;
return total+k;
});
return bar;
}
int main()
{
auto val = foo();
}

You could use some sort of index, and add the next number.
size_t index = 1;
auto bar = accumulate(data.begin(), data.end(), 0.0, [&index, &data](double a, double b) {
if (index < data.size())
return a + b + data[index++];
else
return a + b;
});
Note you have a vector of doubles but are using ints to sum.

Evaluating the arithmetic intensity of a for loop

I am parallelizing the execution of the following loop on a CUDA GPU:
// define m, lp, N
for(int i=0; i<N; ++i){
float p, s;
int q;
s = m + sqrt( ARR1[ ARR2[i] ] )*ARR3[i];
if ( ARR4[2*i] <= ARR10[i] ){
if ( s > 0){
p = lp*s;
q = floor( ARR4[2*i+1]*ARR5[i]/p );
} else{
p = -lp/s;
q = -floor( ARR4[2*i+1]*ARR6[i] );
}
} else{
if ( s > 0){
p = lp/s;
q = -floor( ARR4[2*i+1]*ARR6[i] );
} else{
p = -lp*s;
q = floor( ARR4[2*i+1]*ARR5[i]/p );
}
}
if ( q != 0){
ARR7[i] = p;
ARR8[i] = q;
} else{
ARR7[i] = 0;
ARR8[i] = 0;
}
ARR9[i] = i;
}
I would like to evaluate its arithmetic intensity. m and lp are defined outside of the loop.
I count 11 memory operations: ARR2[i], ARR1[ARR2[i]], ARR3[i], ARR4[2*i], ARR4[2*i+1], ARR5[i], ARR6[i], ARR7[i], ARR8[i], ARR9[i], ARR10[i],
... and 9 floating-point operations (counting floor and sqrt as one FLOP each): m + sqrt( ARR1[ ARR2[i] ] )*ARR3[i] (3), p = lp*s or variations (1), q = floor( ARR4[2*i+1]*ARR5[i]/p ) or variations (5, including 2 for index calculation).
Since all array elements are 4-bit long, this gives me an arithmetic intensity of 9/(4*11) = 0.2045. Is this correct? Am I counting memory and arithmetic operations correctly? In particular, I'm unsure whether the index array calculation 2*i+1 should count towards the FLOP count, and whether the scalar values m and lp should count towards the data movement count (or are they kept in registers and therefore do not count, see AXPY example on p. 16 here.

32-bit to 16-bit Floating Point Conversion

I need a cross-platform library/algorithm that will convert between 32-bit and 16-bit floating point numbers. I don't need to perform math with the 16-bit numbers; I just need to decrease the size of the 32-bit floats so they can be sent over the network. I am working in C++.
I understand how much precision I would be losing, but that's OK for my application.
The IEEE 16-bit format would be great.

Complete conversion from single precision to half precision. This is a direct copy from my SSE version, so it's branch-less. It makes use of the fact that -true == ~0 to preform branchless selections (GCC converts if statements into an unholy mess of conditional jumps, while Clang just converts them to conditional moves.)
Update (2019-11-04): reworked to support single and double precision values with fully correct rounding. I also put a corresponding if statement above each branchless select as a comment for clarity. All incoming NaNs are converted to the base quiet NaN for speed and sanity, as there is no way to reliably convert an embedded NaN message between formats.
#include <cstdint> // uint32_t, uint64_t, etc.
#include <cstring> // memcpy
#include <climits> // CHAR_BIT
#include <limits> // numeric_limits
#include <utility> // is_integral_v, is_floating_point_v, forward
namespace std
{
template< typename T , typename U >
T bit_cast( U&& u ) {
static_assert( sizeof( T ) == sizeof( U ) );
union { T t; }; // prevent construction
std::memcpy( &t, &u, sizeof( t ) );
return t;
}
} // namespace std
template< typename T > struct native_float_bits;
template<> struct native_float_bits< float >{ using type = std::uint32_t; };
template<> struct native_float_bits< double >{ using type = std::uint64_t; };
template< typename T > using native_float_bits_t = typename native_float_bits< T >::type;
static_assert( sizeof( float ) == sizeof( native_float_bits_t< float > ) );
static_assert( sizeof( double ) == sizeof( native_float_bits_t< double > ) );
template< typename T, int SIG_BITS, int EXP_BITS >
struct raw_float_type_info {
using raw_type = T;
static constexpr int sig_bits = SIG_BITS;
static constexpr int exp_bits = EXP_BITS;
static constexpr int bits = sig_bits + exp_bits + 1;
static_assert( std::is_integral_v< raw_type > );
static_assert( sig_bits >= 0 );
static_assert( exp_bits >= 0 );
static_assert( bits <= sizeof( raw_type ) * CHAR_BIT );
static constexpr int exp_max = ( 1 << exp_bits ) - 1;
static constexpr int exp_bias = exp_max >> 1;
static constexpr raw_type sign = raw_type( 1 ) << ( bits - 1 );
static constexpr raw_type inf = raw_type( exp_max ) << sig_bits;
static constexpr raw_type qnan = inf | ( inf >> 1 );
static constexpr auto abs( raw_type v ) { return raw_type( v & ( sign - 1 ) ); }
static constexpr bool is_nan( raw_type v ) { return abs( v ) > inf; }
static constexpr bool is_inf( raw_type v ) { return abs( v ) == inf; }
static constexpr bool is_zero( raw_type v ) { return abs( v ) == 0; }
};
using raw_flt16_type_info = raw_float_type_info< std::uint16_t, 10, 5 >;
using raw_flt32_type_info = raw_float_type_info< std::uint32_t, 23, 8 >;
using raw_flt64_type_info = raw_float_type_info< std::uint64_t, 52, 11 >;
//using raw_flt128_type_info = raw_float_type_info< uint128_t, 112, 15 >;
template< typename T, int SIG_BITS = std::numeric_limits< T >::digits - 1,
int EXP_BITS = sizeof( T ) * CHAR_BIT - SIG_BITS - 1 >
struct float_type_info
: raw_float_type_info< native_float_bits_t< T >, SIG_BITS, EXP_BITS > {
using flt_type = T;
static_assert( std::is_floating_point_v< flt_type > );
};
template< typename E >
struct raw_float_encoder
{
using enc = E;
using enc_type = typename enc::raw_type;
template< bool DO_ROUNDING, typename F >
static auto encode( F value )
{
using flt = float_type_info< F >;
using raw_type = typename flt::raw_type;
static constexpr auto sig_diff = flt::sig_bits - enc::sig_bits;
static constexpr auto bit_diff = flt::bits - enc::bits;
static constexpr auto do_rounding = DO_ROUNDING && sig_diff > 0;
static constexpr auto bias_mul = raw_type( enc::exp_bias ) << flt::sig_bits;
if constexpr( !do_rounding ) { // fix exp bias
// when not rounding, fix exp first to avoid mixing float and binary ops
value *= std::bit_cast< F >( bias_mul );
}
auto bits = std::bit_cast< raw_type >( value );
auto sign = bits & flt::sign; // save sign
bits ^= sign; // clear sign
auto is_nan = flt::inf < bits; // compare before rounding!!
if constexpr( do_rounding ) {
static constexpr auto min_norm = raw_type( flt::exp_bias - enc::exp_bias + 1 ) << flt::sig_bits;
static constexpr auto sub_rnd = enc::exp_bias < sig_diff
? raw_type( 1 ) << ( flt::sig_bits - 1 + enc::exp_bias - sig_diff )
: raw_type( enc::exp_bias - sig_diff ) << flt::sig_bits;
static constexpr auto sub_mul = raw_type( flt::exp_bias + sig_diff ) << flt::sig_bits;
bool is_sub = bits < min_norm;
auto norm = std::bit_cast< F >( bits );
auto subn = norm;
subn *= std::bit_cast< F >( sub_rnd ); // round subnormals
subn *= std::bit_cast< F >( sub_mul ); // correct subnormal exp
norm *= std::bit_cast< F >( bias_mul ); // fix exp bias
bits = std::bit_cast< raw_type >( norm );
bits += ( bits >> sig_diff ) & 1; // add tie breaking bias
bits += ( raw_type( 1 ) << ( sig_diff - 1 ) ) - 1; // round up to half
//if( is_sub ) bits = std::bit_cast< raw_type >( subn );
bits ^= -is_sub & ( std::bit_cast< raw_type >( subn ) ^ bits );
}
bits >>= sig_diff; // truncate
//if( enc::inf < bits ) bits = enc::inf; // fix overflow
bits ^= -( enc::inf < bits ) & ( enc::inf ^ bits );
//if( is_nan ) bits = enc::qnan;
bits ^= -is_nan & ( enc::qnan ^ bits );
bits |= sign >> bit_diff; // restore sign
return enc_type( bits );
}
template< typename F >
static F decode( enc_type value )
{
using flt = float_type_info< F >;
using raw_type = typename flt::raw_type;
static constexpr auto sig_diff = flt::sig_bits - enc::sig_bits;
static constexpr auto bit_diff = flt::bits - enc::bits;
static constexpr auto bias_mul = raw_type( 2 * flt::exp_bias - enc::exp_bias ) << flt::sig_bits;
raw_type bits = value;
auto sign = bits & enc::sign; // save sign
bits ^= sign; // clear sign
auto is_norm = bits < enc::inf;
bits = ( sign << bit_diff ) | ( bits << sig_diff );
auto val = std::bit_cast< F >( bits ) * std::bit_cast< F >( bias_mul );
bits = std::bit_cast< raw_type >( val );
//if( !is_norm ) bits |= flt::inf;
bits |= -!is_norm & flt::inf;
return std::bit_cast< F >( bits );
}
};
using flt16_encoder = raw_float_encoder< raw_flt16_type_info >;
template< typename F >
auto quick_encode_flt16( F && value )
{ return flt16_encoder::encode< false >( std::forward< F >( value ) ); }
template< typename F >
auto encode_flt16( F && value )
{ return flt16_encoder::encode< true >( std::forward< F >( value ) ); }
template< typename F = float, typename X >
auto decode_flt16( X && value )
{ return flt16_encoder::decode< F >( std::forward< X >( value ) ); }
Of course full IEEE support isn't always needed. If your values don't require logarithmic resolution approaching zero, then linearizing them to a fixed point format is much faster, as was already mentioned.

Half to float:
float f = ((h&0x8000)<<16) | (((h&0x7c00)+0x1C000)<<13) | ((h&0x03FF)<<13);
Float to half:
uint32_t x = *((uint32_t*)&f);
uint16_t h = ((x>>16)&0x8000)|((((x&0x7f800000)-0x38000000)>>13)&0x7c00)|((x>>13)&0x03ff);

std::frexp extracts the significand and exponent from normal floats or doubles -- then you need to decide what to do with exponents that are too large to fit in a half-precision float (saturate...?), adjust accordingly, and put the half-precision number together. This article has C source code to show you how to perform the conversion.

Given your needs (-1000, 1000), perhaps it would be better to use a fixed-point representation.
//change to 20000 to SHORT_MAX if you don't mind whole numbers
//being turned into fractional ones
const int compact_range = 20000;
short compactFloat(double input) {
return round(input * compact_range / 1000);
}
double expandToFloat(short input) {
return ((double)input) * 1000 / compact_range;
}
This will give you accuracy to the nearest 0.05. If you change 20000 to SHORT_MAX you'll get a bit more accuracy but some whole numbers will end up as decimals on the other end.

Why so over-complicated? My implementation does not need any additional library, complies with the IEEE-754 FP16 format, manages both normalized and denormalized numbers, is branch-less, takes about 40-ish clock cycles for a back and forth conversion and ditches NaN or Inf for an extended range. That's the magical power of bit operations.
typedef unsigned short ushort;
typedef unsigned int uint;
uint as_uint(const float x) {
return *(uint*)&x;
}
float as_float(const uint x) {
return *(float*)&x;
}
float half_to_float(const ushort x) { // IEEE-754 16-bit floating-point format (without infinity): 1-5-10, exp-15, +-131008.0, +-6.1035156E-5, +-5.9604645E-8, 3.311 digits
const uint e = (x&0x7C00)>>10; // exponent
const uint m = (x&0x03FF)<<13; // mantissa
const uint v = as_uint((float)m)>>23; // evil log2 bit hack to count leading zeros in denormalized format
return as_float((x&0x8000)<<16 | (e!=0)*((e+112)<<23|m) | ((e==0)&(m!=0))*((v-37)<<23|((m<<(150-v))&0x007FE000))); // sign : normalized : denormalized
}
ushort float_to_half(const float x) { // IEEE-754 16-bit floating-point format (without infinity): 1-5-10, exp-15, +-131008.0, +-6.1035156E-5, +-5.9604645E-8, 3.311 digits
const uint b = as_uint(x)+0x00001000; // round-to-nearest-even: add last bit after truncated mantissa
const uint e = (b&0x7F800000)>>23; // exponent
const uint m = b&0x007FFFFF; // mantissa; in line below: 0x007FF000 = 0x00800000-0x00001000 = decimal indicator flag - initial rounding
return (b&0x80000000)>>16 | (e>112)*((((e-112)<<10)&0x7C00)|m>>13) | ((e<113)&(e>101))*((((0x007FF000+m)>>(125-e))+1)>>1) | (e>143)*0x7FFF; // sign : normalized : denormalized : saturate
}
Example for how to use it and to check that the conversion is correct:
#include <iostream>
void print_bits(const ushort x) {
for(int i=15; i>=0; i--) {
cout << ((x>>i)&1);
if(i==15||i==10) cout << " ";
if(i==10) cout << " ";
}
cout << endl;
}
void print_bits(const float x) {
uint b = *(uint*)&x;
for(int i=31; i>=0; i--) {
cout << ((b>>i)&1);
if(i==31||i==23) cout << " ";
if(i==23) cout << " ";
}
cout << endl;
}
int main() {
const float x = 1.0f;
const ushort x_compressed = float_to_half(x);
const float x_decompressed = half_to_float(x_compressed);
print_bits(x);
print_bits(x_compressed);
print_bits(x_decompressed);
return 0;
}
Output:
0 01111111 00000000000000000000000
0 01111 0000000000
0 01111111 00000000000000000000000
I have published an adapted version of this FP32<->FP16 conversion algorithm in this paper with detailed description on how the bit manipulation magic works. In this paper I also provide several ultra-fast conversion algorithms for different 16-bit Posit formats.

If you're sending a stream of information across, you could probably do better than this, especially if everything is in a consistent range, as your application seems to have.
Send a small header, that just consists of a float32 minimum and maximum, then you can send across your information as a 16 bit interpolation value between the two. As you also say that precision isn't much of an issue, you could even send 8bits at a time.
Your value would be something like, at reconstruction time:
float t = _t / numeric_limits<unsigned short>::max(); // With casting, naturally ;)
float val = h.min + t * (h.max - h.min);
Hope that helps.
-Tom

This question is already a bit old, but for the sake of completeness, you might also take a look at this paper for half-to-float and float-to-half conversion.
They use a branchless table-driven approach with relatively small look-up tables. It is completely IEEE-conformant and even beats Phernost's IEEE-conformant branchless conversion routines in performance (at least on my machine). But of course his code is much better suited to SSE and is not that prone to memory latency effects.

This conversion for 16-to-32-bit floating point is quite fast for cases where you do not have to account for infinities or NaNs, and can accept denormals-as-zero (DAZ). I.e. it is suitable for performance-sensitive calculations, but you should beware of division by zero if you expect to encounter denormals.
Note that this is most suitable for x86 or other platforms that have conditional moves or "set if" equivalents.
Strip the sign bit off the input
Align the most significant bit of the mantissa to the 22nd bit
Adjust the exponent bias
Set bits to all-zero if the input exponent is zero
Re-insert sign bit
The reverse applies for single-to-half-precision, with some additions.
void float32(float* __restrict out, const uint16_t in) {
uint32_t t1;
uint32_t t2;
uint32_t t3;
t1 = in & 0x7fff; // Non-sign bits
t2 = in & 0x8000; // Sign bit
t3 = in & 0x7c00; // Exponent
t1 <<= 13; // Align mantissa on MSB
t2 <<= 16; // Shift sign bit into position
t1 += 0x38000000; // Adjust bias
t1 = (t3 == 0 ? 0 : t1); // Denormals-as-zero
t1 |= t2; // Re-insert sign bit
*((uint32_t*)out) = t1;
};
void float16(uint16_t* __restrict out, const float in) {
uint32_t inu = *((uint32_t*)&in);
uint32_t t1;
uint32_t t2;
uint32_t t3;
t1 = inu & 0x7fffffff; // Non-sign bits
t2 = inu & 0x80000000; // Sign bit
t3 = inu & 0x7f800000; // Exponent
t1 >>= 13; // Align mantissa on MSB
t2 >>= 16; // Shift sign bit into position
t1 -= 0x1c000; // Adjust bias
t1 = (t3 > 0x38800000) ? 0 : t1; // Flush-to-zero
t1 = (t3 < 0x8e000000) ? 0x7bff : t1; // Clamp-to-max
t1 = (t3 == 0 ? 0 : t1); // Denormals-as-zero
t1 |= t2; // Re-insert sign bit
*((uint16_t*)out) = t1;
};
Note that you can change the constant 0x7bff to 0x7c00 for it to overflow to infinity.
See GitHub for source code.

Most of the approaches described in the other answers here either do not round correctly on conversion from float to half, throw away subnormals which is a problem since 2**-14 becomes your smallest non-zero number, or do unfortunate things with Inf / NaN. Inf is also a problem because the largest finite number in half is a bit less than 2^16. OpenEXR was unnecessarily slow and complicated, last I looked at it. A fast correct approach will use the FPU to do the conversion, either as a direct instruction, or using the FPU rounding hardware to make the right thing happen. Any half to float conversion should be no slower than a 2^16 element lookup table.
The following are hard to beat:
On OS X / iOS, you can use vImageConvert_PlanarFtoPlanar16F and vImageConvert_Planar16FtoPlanarF. See Accelerate.framework.
Intel ivybridge added SSE instructions for this. See f16cintrin.h.
Similar instructions were added to the ARM ISA for Neon. See vcvt_f32_f16 and vcvt_f16_f32 in arm_neon.h. On iOS you will need to use the arm64 or armv7s arch to get access to them.

This code converts a 32-bit floating point number to 16-bits and back.
#include <x86intrin.h>
#include <iostream>
int main()
{
float f32;
unsigned short f16;
f32 = 3.14159265358979323846;
f16 = _cvtss_sh(f32, 0);
std::cout << f32 << std::endl;
f32 = _cvtsh_ss(f16);
std::cout << f32 << std::endl;
return 0;
}
I tested with the Intel icpc 16.0.2:
$ icpc a.cpp
g++ 7.3.0:
$ g++ -march=native a.cpp
and clang++ 6.0.0:
$ clang++ -march=native a.cpp
It prints:
$ ./a.out
3.14159
3.14062
Documentation about these intrinsics is available at:
https://software.intel.com/en-us/node/524287
https://clang.llvm.org/doxygen/f16cintrin_8h.html

The question is old and has already been answered, but I figured it would be worth mentioning an open source C++ library that can create 16bit IEEE compliant half precision floats and has a class that acts pretty much identically to the built in float type, but with 16 bits instead of 32. It is the "half" class of the OpenEXR library. The code is under a permissive BSD style license. I don't believe it has any dependencies outside of the standard library.

I had this same exact problem, and found this link very helpful. Just import the file "ieeehalfprecision.c" into your project and use it like this :
float myFloat = 1.24;
uint16_t resultInHalf;
singles2halfp(&resultInHalf, &myFloat, 1); // it accepts a series of floats, so use 1 to input 1 float
// an example to revert the half float back
float resultInSingle;
halfp2singles(&resultInSingle, &resultInHalf, 1);
I also change some code (See the comment by the author (James Tursa) in the link) :
#define INT16_TYPE int16_t
#define UINT16_TYPE uint16_t
#define INT32_TYPE int32_t
#define UINT32_TYPE uint32_t

I have found an implementation of conversion from half-float to single-float format and back with using of AVX2. There are much more faster than software implementation of these algorithms. I hope it will be useful.
32-bit float to 16-bit float conversion:
#include <immintrin.h"
inline void Float32ToFloat16(const float * src, uint16_t * dst)
{
_mm_storeu_si128((__m128i*)dst, _mm256_cvtps_ph(_mm256_loadu_ps(src), 0));
}
void Float32ToFloat16(const float * src, size_t size, uint16_t * dst)
{
assert(size >= 8);
size_t fullAlignedSize = size&~(32-1);
size_t partialAlignedSize = size&~(8-1);
size_t i = 0;
for (; i < fullAlignedSize; i += 32)
{
Float32ToFloat16(src + i + 0, dst + i + 0);
Float32ToFloat16(src + i + 8, dst + i + 8);
Float32ToFloat16(src + i + 16, dst + i + 16);
Float32ToFloat16(src + i + 24, dst + i + 24);
}
for (; i < partialAlignedSize; i += 8)
Float32ToFloat16(src + i, dst + i);
if(partialAlignedSize != size)
Float32ToFloat16(src + size - 8, dst + size - 8);
}
16-bit float to 32-bit float conversion:
#include <immintrin.h"
inline void Float16ToFloat32(const uint16_t * src, float * dst)
{
_mm256_storeu_ps(dst, _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)src)));
}
void Float16ToFloat32(const uint16_t * src, size_t size, float * dst)
{
assert(size >= 8);
size_t fullAlignedSize = size&~(32-1);
size_t partialAlignedSize = size&~(8-1);
size_t i = 0;
for (; i < fullAlignedSize; i += 32)
{
Float16ToFloat32<align>(src + i + 0, dst + i + 0);
Float16ToFloat32<align>(src + i + 8, dst + i + 8);
Float16ToFloat32<align>(src + i + 16, dst + i + 16);
Float16ToFloat32<align>(src + i + 24, dst + i + 24);
}
for (; i < partialAlignedSize; i += 8)
Float16ToFloat32<align>(src + i, dst + i);
if (partialAlignedSize != size)
Float16ToFloat32<false>(src + size - 8, dst + size - 8);
}

Thanks Code for decimal to single precision
We actually can try to edit the same code to half precision , however it is not possible with gcc C compiler , so do the following
sudo apt install clang
Then try the following code
// A C code to convert Decimal value to IEEE 16-bit floating point Half precision
#include <stdio.h>
void printBinary(int n, int i)
{
int k;
for (k = i - 1; k >= 0; k--) {
if ((n >> k) & 1)
printf("1");
else
printf("0");
}
}
typedef union {
__fp16 f;
struct
{
unsigned int mantissa : 10;
unsigned int exponent : 5;
unsigned int sign : 1;
} raw;
} myfloat;
// Driver Code
int main()
{
myfloat var;
var.f = 11;
printf("%d | ", var.raw.sign);
printBinary(var.raw.exponent, 5);
printf(" | ");
printBinary(var.raw.mantissa, 10);
printf("\n");
return 0;
}
Compile the code in your terminal
clang code_name.c -o code_name
./code_name
Here
__fp16
is a 2 byte float data type supported in clang C compiler

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Modifying a function to use SSE intrinsics - c++

Related

Parallelization doubles the execution time

ifft results are different from original signal

Lambda function in accumulate

Evaluating the arithmetic intensity of a for loop

32-bit to 16-bit Floating Point Conversion

Categories

Resources