Implementing a half precision floating point number in C++

Implementing a half precision floating point number in C++ - c++

I am trying to implement a simple half precision floating point type, entirely for storage purposes (no arithmetic, converts to double implicitly), but I get weird behavior. I get completely wrong values for Half between -0.5 and 0.5. Also I get a nasty "offset" for values, for example 0.8 is decoded as 0.7998.
I am very new to C++, so I would be great if you can point out my mistake and help me with improving the accuracy a bit. I am also curious how portable is this solution. Thanks!
Here is the output - double value and actual decoded value from the half:
-1 -1
-0.9 -0.899902
-0.8 -0.799805
-0.7 -0.699951
-0.6 -0.599854
-0.5 -0.5
-0.4 -26208
-0.3 -19656
-0.2 -13104
-0.1 -6552
-1.38778e-16 -2560
0.1 6552
0.2 13104
0.3 19656
0.4 26208
0.5 32760
0.6 0.599854
0.7 0.699951
0.8 0.799805
0.9 0.899902
Here is the code so far:
#include <stdint.h>
#include <cmath>
#include <iostream>
using namespace std;
#define EXP 4
#define SIG 11
double normalizeS(uint v) {
return (0.5f * v / 2048 + 0.5f);
}
uint normalizeP(double v) {
return (uint)(2048 * (v - 0.5f) / 0.5f);
}
class Half {
struct Data {
unsigned short sign : 1;
unsigned short exponent : EXP;
unsigned short significant : SIG;
};
public:
Half() {}
Half(double d) { loadFromFloat(d); }
Half & operator = (long double d) {
loadFromFloat(d);
return *this;
}
operator double() {
long double sig = normalizeS(_d.significant);
if (_d.sign) sig = -sig;
return ldexp(sig, _d.exponent /*+ 1*/);
}
private:
void loadFromFloat(long double f) {
long double v;
int exp;
v = frexp(f, &exp);
v < 0 ? _d.sign = 1 : _d.sign = 0;
_d.exponent = exp/* - 1*/;
_d.significant = normalizeP(fabs(v));
}
Data _d;
};
int main() {
Half a[255];
double d = -1;
for (int i = 0; i < 20; ++i) {
a[i] = d;
cout << d << " " << a[i] << endl;
d += 0.1;
}
}

I ended up with a very simple (naive really) solution, capable of representing every value in the range I need: 0 - 64 with precision of 0.001.
Since the idea is to use it for storage, this is actually better because it allows conversion from and to double without any resolution loss. It is also faster. It actually loses some resolution (less than 16 bit) in the name of having a nicer minimum step so it can represent any of the input values without approximation - so in this case LESS is MORE. Using the full 2^10 resolution for the floating component would result in an odd step that cannot represent decimal values accurately.
class Half {
public:
Half() {}
Half(const double d) { load(d); }
operator double() const { return _d.i + ((double)_d.f / 1000); }
private:
struct Data {
unsigned short i : 6;
unsigned short f : 10;
};
void load(const double d) {
int i = d;
_d.i = i;
_d.f = round((d - i) * 1000);
}
Data _d;
};

Last solution wrong... Sorry...
Try to change the expoent to signed... It worked here.
The problem is that when the expoent turn to be negative, when value < 0.5 you save the expoent as a positive number, it is the problem that cause the number to be big when abs(val)<0.5.

Related

Calculating bit_Delta(double p1, double p2) in C++

I am interested in computing the function int bit_Delta(double p1, double p2) for two doubles in the range [0,1). The function returns the index where the two doubles deviate in binary after the dot.
For example, 1/2 = 0.10 in binary, and 3/4=0.11 in binary. So bit_delta(0.5, 0.75) should return 2 because their first digit (1) is the same, but the second is the first digit where they differ.
I've thought about calculating the mantissa and exponent separately for each double. If the exponents are different, I think I can do it, but if the exponents are the same, I don't know how to use the mantissa. Any ideas?

One way would be to compare if both values are above 0.5 ==> both have the first bit set, else if both are below 0.5 ==> both have the first bit not set.
If both are above 0.5, subtract 0.5 and half the treshold, continue till you found the threshold, where the values are not both above or both below it.
#include <iostream>
int bit_delta(double a, double b)
{
if (a == b) return -1;
double treshold = 0.5;
for (int digit = 1; digit < 20; digit++, treshold /= 2)
{
if (a < treshold && b < treshold)
{
}
else if (a >= treshold && b >= treshold)
{
a -= treshold;
b -= treshold;
}
else
return digit;
}
return 20; //compare more than 20 digits does not make sense for a double
}
int main()
{
std::cout << bit_delta(0.25, 0.75) << std::endl;
std::cout << bit_delta(0.5, 0.75) << std::endl;
std::cout << bit_delta(0.7632, 0.751) << std::endl;
}
This code returns 1 2 7.

The following idea is based on conversion of the double values to fixed-point arithmetic, comparing the integers with XOR and counting the equal most significant bits.
#include <bit>
int bit_delta(double p1, double p2)
{
unsigned int i1 = static_cast<unsigned int>(p1 * 0x80000000U);
unsigned int i2 = static_cast<unsigned int>(p2 * 0x80000000U);
return std::countl_zero(i1 ^ i2);
}
It returns results between 1 .. 32.
With positive inputs p1 and p2 below 1. the MSB of i1 and i2 would always be zero, which is needed to get the counting right.
By using unsigned long long int instead of unsigned int you could increase the precision to 53 (i.e. the precision of double numbers).
The function countl_zero - included with the bit header - was introduced in C++20.

how to improve the precision of computing float numbers?

I write a code snippet in Microsoft Visual Studio Community 2019 in C++ like this:
int m = 11;
int p = 3;
float step = 1.0 / (m - 2 * p);
the variable step is 0.200003, 0.2 is what i wanted. Is there any suggestion to improve the precision?
This problem comes from UNIFORM KNOT VECTOR. Knot vector is a concept in NURBS. You can think it is just an array of numbers like this: U[] = {0, 0.2, 0.4, 0.6, 0.8, 1.0}; The span between two adjacent numbers is a constant. The size of knot vector can be changed accroding to some condition, but the range is in [0, 1].
the whole function is:
typedef float NURBS_FLOAT;
void CreateKnotVector(int m, int p, bool clamped, NURBS_FLOAT* U)
{
if (clamped)
{
for (int i = 0; i <= p; i++)
{
U[i] = 0;
}
NURBS_FLOAT step = 1.0 / (m - 2 * p);
for (int i = p+1; i < m-p; i++)
{
U[i] = U[i - 1] + step;
}
for (int i = m-p; i <= m; i++)
{
U[i] = 1;
}
}
else
{
U[0] = 0;
NURBS_FLOAT step = 1.0 / m;
for (int i = 1; i <= m; i++)
{
U[i] = U[i - 1] + step;
}
}
}

Let's follow what's going on in your code:
The expression 1.0 / (m - 2 * p) yields 0.2, to which the closest representable double value is 0.200000000000000011102230246251565404236316680908203125. Notice how precise it is – to 16 significant decimal digits. It's because, due to 1.0 being a double literal, the denominator is promoted to double, and the whole calculation is done in double precision, thus yielding a double value.
The value obtained in the previous step is written to step, which has type float. So the value has to be rounded to the closest representable value, which happens to be 0.20000000298023223876953125.
So your cited result of 0.200003 is not what you should get. Instead, it should be closer to 0.200000003.
Is there any suggestion to improve the precision?
Yes. Store the value in a higher-precision variable. E.g., instead of float step, use double step. In this case the value you've calculated won't be rounded once more, so precision will be higher.
Can you get the exact 0.2 value to work with it in the subsequent calculations? With binary floating-point arithmetic, unfortunately, no. In binary, the number 0.2 is a periodic fraction:
0.210 = 0.0̅0̅1̅1̅2 = 0.0011 0011 0011...2
See Is floating point math broken? question and its answers for more details.
If you really need decimal calculations, you should use a library solution, e.g. Boost's cpp_dec_float. Or, if you need arbitrary-precision calculations, you can use e.g. cpp_bin_float from the same library. Note that both variants will be orders of magnitude slower than using bulit-in C++ binary floating-point types.

When dealing with floating point math a certain amount of rounding errors are expected.
For starters, values like 0.2 aren't exactly represented by a float, or even a double:
std::cout << std::setprecision(60) << 0.2 << '\n';
// ^^^ It outputs something like: 0.200000000000000011102230246251565404236316680908203125
Besides, the errors may accumulate when a sequence of operations are performed on imprecise values. Some operations, like summation and subctraction, are more sensitive to this kind of errors than others, so it'd be better to avoid them if possible.
That seems to be the case, here, where we can rewrite OP's function into something like the following
#include <iostream>
#include <iomanip>
#include <vector>
#include <algorithm>
#include <cassert>
#include <type_traits>
template <typename T = double>
auto make_knots(int m, int p = 0) // <- Note that I've changed the signature.
{
static_assert(std::is_floating_point_v<T>);
std::vector<T> knots(m + 1);
int range = m - 2 * p;
assert(range > 0);
for (int i = 1; i < m - p; i++)
{
knots[i + p] = T(i) / range; // <- Less prone to accumulate rounding errors
}
std::fill(knots.begin() + m - p, knots.end(), 1.0);
return knots;
}
template <typename T>
void verify(std::vector<T> const& v)
{
bool sum_is_one = true;
for (int i = 0, j = v.size() - 1; i <= j; ++i, --j)
{
if (v[i] + v[j] != 1.0) // <- That's a bold request for a floating point type
{
sum_is_one = false;
break;
}
}
std::cout << (sum_is_one ? "\n" : "Rounding errors.\n");
}
int main()
{
// For presentation purposes only
std::cout << std::setprecision(60) << 0.2 << '\n';
std::cout << std::setprecision(60) << 0.4 << '\n';
std::cout << std::setprecision(60) << 0.6 << '\n';
std::cout << std::setprecision(60) << 0.8 << "\n\n";
auto k1 = make_knots(11, 3);
for (auto i : k1)
{
std::cout << std::setprecision(60) << i << '\n';
}
verify(k1);
auto k2 = make_knots<float>(10);
for (auto i : k2)
{
std::cout << std::setprecision(60) << i << '\n';
}
verify(k2);
}
Testable here.

One solution to avoid drift (which I guess is your worry?) is to manually use rational numbers, for example in this case you might have:
// your input values for determining step
int m = 11;
int p = 3;
// pre-calculate any intermediate values, which won't have rounding issues
int divider = (m - 2 * p); // could be float or double instead of int
// input
int stepnumber = 1234; // could also be float or double instead of int
// output
float stepped_value = stepnumber * 1.0f / divider;
In other words, formulate your problem so that step of your original code is always 1 (or whatever rational number you can represent exactly using 2 integers) internally, so there is no rounding issue. If you need to display the value for user, then you can do it just for display: 1.0 / divider and round to suitable number of digits.

Round a double to the closest and greater float

I want to round big double number (>1e6) to the closest but bigger float using c/c++.
I tried this but I'm not sure it is always correct and there is maybe a fastest way to do that :
int main() {
// x is the double we want to round
double x = 100000000005.0;
double y = log10(x) - 7.0;
float a = pow(10.0, y);
float b = (float)x;
//c the closest round up float
float c = a + b;
printf("%.12f %.12f %.12f\n", c, b, x);
return 0;
}
Thank you.

Simply assigning a double to float and back should tell, if the float is larger. If it's not, one should simply increment the float by one unit. (for positive floats). If this doesn't still produce expected result, then the double is larger than supported by a float, in which case float should be assigned to Inf.
float next(double a) {
float b=a;
if ((double)b > a) return b;
return std::nextafter(b, std::numeric_limits<float>::infinity());
}
[Hack] C-version of next_after (on selected architectures would be)
float next_after(float a) {
*(int*)&a += a < 0 ? -1 : 1;
return a;
}
Better way to do it is:
float next_after(float a) {
union { float a; int b; } c = { .a = a };
c.b += a < 0 ? -1 : 1;
return c.a;
}
Both of these self-made hacks ignore Infs and NaNs (and work on non-negative floats only). The math is based on the fact, that the binary representations of floats are ordered. To get to next representable float, one simply increments the binary representation by one.

If you use c99, you can use the nextafterf function.
#include <stdio.h>
#include <math.h>
#include <float.h>
int main(){
// x is the double we want to round
double x=100000000005.0;
float c = x;
if ((double)c <= x)
c = nextafterf(c, FLT_MAX);
//c the closest round up float
printf("%.12f %.12f\n",c,x);
return 0;
}

C has a nice nextafter function which will help here;
float toBiggerFloat( const double a ) {
const float test = (float) a;
return ((double) test < a) ? nextafterf( test, INFINITY ) : test;
}
Here's a test script which shows it on all classes of number (positive/negative, normal/subnormal, infinite, nan, -0): http://codepad.org/BQ3aqbae (it works fine on anything is the result)

A C routine to round a float to n significant digits?

Suppose I have a float. I would like to round it to a certain number of significant digits.
In my case n=6.
So say float was f=1.23456999;
round(f,6) would give 1.23457
f=123456.0001 would give 123456
Anybody know such a routine ?
Here it works on website: http://ostermiller.org/calc/significant_figures.html

Multiply the number by a suitable scaling factor to move all significant digits to the left of the decimal point. Then round and finally reverse the operation:
#include <math.h>
double round_to_digits(double value, int digits)
{
if (value == 0.0) // otherwise it will return 'nan' due to the log10() of zero
return 0.0;
double factor = pow(10.0, digits - ceil(log10(fabs(value))));
return round(value * factor) / factor;
}
Tested: http://ideone.com/fH5ebt
Buts as #PascalCuoq pointed out: the rounded value may not exactly representable as a floating point value.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char *Round(float f, int d)
{
char buf[16];
sprintf(buf, "%.*g", d, f);
return strdup(buf);
}
int main(void)
{
char *r = Round(1.23456999, 6);
printf("%s\n", r);
free(r);
}
Output is:
1.23457

Something like this should work:
double round_to_n_digits(double x, int n)
{
double scale = pow(10.0, ceil(log10(fabs(x))) + n);
return round(x * scale) / scale;
}
Alternatively you could just use sprintf/atof to convert to a string and back again:
double round_to_n_digits(double x, int n)
{
char buff[32];
sprintf(buff, "%.*g", n, x);
return atof(buff);
}
Test code for both of the above functions: http://ideone.com/oMzQZZ
Note that in some cases incorrect rounding may be observed, e.g. as pointed out by #clearScreen in the comments below, 13127.15 is rounded to 13127.1 instead of
13127.2.

This should work (except the noise given by floating point precision):
#include <stdio.h>
#include <math.h>
double dround(double a, int ndigits);
double dround(double a, int ndigits) {
int exp_base10 = round(log10(a));
double man_base10 = a*pow(10.0,-exp_base10);
double factor = pow(10.0,-ndigits+1);
double truncated_man_base10 = man_base10 - fmod(man_base10,factor);
double rounded_remainder = fmod(man_base10,factor)/factor;
rounded_remainder = rounded_remainder > 0.5 ? 1.0*factor : 0.0;
return (truncated_man_base10 + rounded_remainder)*pow(10.0,exp_base10) ;
}
int main() {
double a = 1.23456999;
double b = 123456.0001;
printf("%12.12f\n",dround(a,6));
printf("%12.12f\n",dround(b,6));
return 0;
}

If you want to print a float to a string use simple sprintf(). For outputting it just to the console you can use printf():
printf("My float is %.6f", myfloat);
This will output your float with 6 decimal places.

Print to 16 significant digit.
double x = -1932970.8299999994;
char buff[100];
snprintf(buff, sizeof(buff), "%.16g", x);
std::string buffAsStdStr = buff;
std::cout << std::endl << buffAsStdStr ;

modf returns 1 as the fractional:

I have this static method, it receives a double and "cuts" its fractional tail leaving two digits after the dot. works almost all the time. I have noticed that when
it receives 2.3 it turns it to 2.29. This does not happen for 0.3, 1.3, 3.3, 4.3 and 102.3.
Code basically multiplies the number by 100 uses modf divides the integer value by 100 and returns it.
Here the code catches this one specific number and prints out:
static double dRound(double number) {
bool c = false;
if (number == 2.3)
c = true;
int factor = pow(10, 2);
number *= factor;
if (c) {
cout << " number *= factor : " << number << endl;
//number = 230;// When this is not marked as comment the code works well.
}
double returnVal;
if (c){
cout << " fractional : " << modf(number, &returnVal) << endl;
cout << " integer : " <<returnVal << endl;
}
modf(number, &returnVal);
return returnVal / factor;
}
it prints out:
number *= factor : 230
fractional : 1
integer : 229
Does anybody know why this is happening and how can i fix this?
Thank you, and have a great weekend.

Remember floating point number cannot represent decimal numbers exactly. 2.3 * 100 actually gives 229.99999999999997. Thus modf returns 229 and 0.9999999999999716.
However, cout's format will only display floating point numbers to 6 decimal places by default. So the 0.9999999999999716 is shown as 1.
You could use (roughly) the upper error limit that a value represents in floating point to avoid the 2.3 error:
#include <cmath>
#include <limits>
static double dRound(double d) {
double inf = copysign(std::numeric_limits<double>::infinity(), d);
double theNumberAfter = nextafter(d, inf);
double epsilon = theNumberAfter - d;
int factor = 100;
d *= factor;
epsilon *= factor/2;
d += epsilon;
double returnVal;
modf(number, &returnVal);
return returnVal / factor;
}
Result: http://www.ideone.com/ywmua

Here is a way without rounding:
double double_cut(double d)
{
long long x = d * 100;
return x/100.0;
}
Even if you want rounding according to 3rd digit after decimal point, here is a solution:
double double_cut_round(double d)
{
long long x = d * 1000;
if (x > 0)
x += 5;
else
x -= 5;
return x / 1000.0;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js