How to bound a floating-point arithmetic result?

How to bound a floating-point arithmetic result? - c++

Floating-point operations like x=a/b are usually not exactly representable so the CPU has to do rounding. Is it possible to get the two floats x_low and x_up that are respectively the highest floating point less or equals than the exact value of a/b and the lowest floating point higher or equals than a/b?
Some of the conditions are :
a, b, x_low, x_up and x are float
a and b are positive, integers (1.0f, 2.0f, etc)

This will give you a bounds that might be too large:
#include <cmath>
#include <utility>
template<typename T>
std::pair<T, T> bounds(int a, int b) {
T ta = a, tb = b;
T ta_prev = std::nexttoward(ta), ta_next = std::nextafter(ta);
T tb_prev = std::nexttoward(tb), tb_next = std::nextafter(tb);
return std::make_pair(ta_prev / tb_next, ta_next / tb_prev);
}

An easy way to do it is to do the division in higher precision and get the upper/lower bound on conversion to float:
struct float_range {
float lower;
float upper;
};
float_range to_float_range(double d) {
float as_float = static_cast<float>(d);
double rounded = double{as_float};
if (std::isnan(as_float) || rounded == d) {
// No rounding done
return { as_float, as_float };
}
if (rounded < d) {
// rounded down
return { as_float, std::nextafter(as_float, std::numeric_limits<float>::infinity()) };
}
// rounded up
return { std::nextafter(as_float, -std::numeric_limits<float>::infinity()), as_float };
}
float_range precise_divide(float a, float b) {
return to_float_range(double{a}/double{b});
}

Related

Trying to understand simple big number calculations

I am trying to better understand how 'big numbers' libraries work, (like GMP for example).
I want to write my own function to Add() / Subtract() / Multiply() / Divide()
The class is traditionally defined ...
std::vector<unsigned char> _numbers; // all the numbers
bool _neg; // positive or negative number
long _decimalPos; // where the decimal point is located
// so 10.5 would be 1
// 10.25 would be 2
// 10 would be 0 for example
First I need to normalise the numbers so I can do
Using 2 numbers
10(x) + 10.25(y) = 20.25
For simplicity, I would make them the same length,
For x:
_numbers = (1,0,0,0) decimal = 2
For y:
_numbers = (1,0,2,5) decimal = 2
And I can then reverse add x to y in a loop
...
// where x is 10.00 and y is 10.25
...
unsigned char carryOver = 0;
int totalLen = x._numbers.size();
for (size_t i = totalLen; i > 1 ; --i )
{
unsigned char sum = x._numbers[i-1] + y._numbers[i-1] + carryOver;
carryOver = 0;
if (sum > _base)
{
sum -= _base;
carryOver = 1;
}
numbers.insert( number.begin(), sum);
}
// any left over?
if (carryOver > 0)
{
numbers.insert( number.begin(), 1 );
}
// decimal pos is the same for this number as x and y
...
The example above will work for adding two positive numbers, but will soon fall over once I need to add a negative number to a positive number.
And this gets more complicated when it comes to subtracting numbers, then even worse for multiplications and divisions.
Can someone suggest some simple functions to Add() / Subtract() / Multiply() / Divide()
I am not trying to re-write / improve libraries, I just want to understand how they work with numbers.

addition and substractions are pretty straightforward
You need to inspect signs and magnitudes of operands and if needed convert the operation to/from +/-. Typical C++ implementation of mine for this is like this:
//---------------------------------------------------------------------------
arbnum arbnum::operator + (const arbnum &x)
{
arbnum c;
// you can skip this if you do not have NaN or Inf support
// this just handles cases like adding inf or NaN or zero
if ( isnan() ) return *this;
if (x.isnan() ) { c.nan(); return c; }
if ( iszero()) { c=x; return c; }
if (x.iszero()) return *this;
if ( isinf() ) { if (x.isinf()) { if (sig==x.sig) return *this;
c.nan(); return c; } return *this; }
if (x.isinf()) { c.inf(); return c; }
// this compares the sign bits if both signs are the same it is addition
if (sig*x.sig>0) { c.add(x,this[0]); c.sig=sig; }
// if not
else{
// compare absolute values (magnitudes)
if (c.geq(this[0],x)) // |this| >= |x| ... return (this-x)
{
c.sub(this[0],x);
c.sig=sig; // use sign of the abs greater operand
}
else { // else return (x-this)
c.sub(x,this[0]);
c.sig=x.sig;
}
}
return c;
}
//---------------------------------------------------------------------------
arbnum arbnum::operator - (const arbnum &x)
{
arbnum c;
if ( isnan() ) return *this;
if (x.isnan() ) { c.nan(); return c; }
if ( iszero()) { c=x; c.sig=-x.sig; return c; }
if (x.iszero()) return *this;
if ( isinf() ) { if (x.isinf()) { if (sig!=x.sig) return *this;
c.nan(); return c; } return *this; }
if (x.isinf()) { c.inf(); c.sig=-x.sig; return c; }
if (x.sig*sig<0) { c.add(x,this[0]); c.sig=sig; }
else{
if (c.geq(this[0],x))
{
c.sub(this[0],x);
c.sig=sig;
}
else {
c.sub(x,this[0]);
c.sig=-x.sig;
}
}
return c;
}
//---------------------------------------------------------------------------
where:
geq is unsigned comparison greater or equal
add is unsigned +
sub is unsigned -
division is a bit more complicated
see:
bignum divisions
approximational bignum divider
For divisions you need to have already implemented things like +,-,*,<<,>> and for some more advanced approaches you need even things like: absolute comparison (you need them for +/- anyway) , sqr, number of used bits usually separate for fractional and integer part.
The most important is the multiplication see Fast bignum square computation because it is core for most division algorithms.
performance
for some hints see BigInteger numbers implementation and performance
text conversion
If your number is in ASCII or in BASE=10^n digits then this is easy but If you use BASE=2^n instead for performance reasons then you need to have fast functions capable of converting between dec and hex strings so you can actually load and print some numbers to/from your class. see:
How do I convert a very long binary number to decimal?
How to convert a gi-normous integer (in string format) to hex format?

The most accurate way to calculate numerator and denominator of a double

I have implemented class NaturalNum for representing a natural number of "infinite" size (up to 4GB).
I have also implemented class RationalNum for representing a rational number with infinite accuracy. It stores the numerator and the denominator of the rational number, both of which are NaturalNum instances, and relies on them when performing any arithmetic operation issued by the user.
The only place where precision is "dropped by a certain degree", is upon printing, since there's a limit (provided by the user) to the number of digits that appear after the decimal (or non-decimal) point.
My question concerns one of the constructors of class RationalNum. Namely, the constructor that takes a double value, and computes the corresponding numerator and denominator.
My code is given below, and I would like to know if anyone sees a more accurate way for computing them:
RationalNum::RationalNum(double value)
{
if (value == value+1)
throw "Infinite Value";
if (value != value)
throw "Undefined Value";
m_sign = false;
m_numerator = 0;
m_denominator = 1;
if (value < 0)
{
m_sign = true;
value = -value;
}
// Here is the actual computation
while (value > 0)
{
unsigned int floor = (unsigned int)value;
value -= floor;
m_numerator += floor;
value *= 2;
m_numerator *= 2;
m_denominator *= 2;
}
NaturalNum gcd = GCD(m_numerator,m_denominator);
m_numerator /= gcd;
m_denominator /= gcd;
}
Note: variables starting with 'm_' are member variables.
Thanks

The standard library contains a function for obtaining the significand and exponent, frexp.
Just multiply the significand to get all bits before decimal point and set appropriate denominator. Just don't forget the significand is normalized to be between 0.5 and 1 (I would consider between 1 and 2 more natural but whatever) and that it has 53 significant bits for IEEE double (there are no practically used platforms that would use different floating point format).

I'm not 100% confident in the math that you have for the actual computation only because I haven't really examined it, but I think the below method removes the need to use the GCD function which could bring in some unnecessary running time.
Here is the class I came up with. I haven't fully tested it, but I produced a couple billion random doubles and the asserts never fired, so I'm reasonably confident in its usability, but I would still test the edge cases around INT64_MAX a little more.
If I'm not mistaken, the running time complexity of this algorithm is linear with respect to the size in bits of the input.
#include <iostream>
#include <cmath>
#include <cassert>
#include <limits>
class Real;
namespace std {
inline bool isnan(const Real& r);
inline bool isinf(const Real& r);
}
class Real {
public:
Real(double val)
: _val(val)
{
if (std::isnan(val)) { return; }
if (std::isinf(val)) { return; }
double d;
if (modf(val, &d) == 0) {
// already a whole number
_num = val;
_den = 1.0;
return;
}
int exponent;
double significand = frexp(val, &exponent); // val = significand * 2^exponent
double numerator = val;
double denominator = 1;
// 0.5 <= significand < 1.0
// significand is a fraction, multiply it by two until it's a whole number
// subtract exponent appropriately to maintain val = significand * 2^exponent
do {
significand *= 2;
--exponent;
assert(std::ldexp(significand, exponent) == val);
} while (modf(significand, &d) != 0);
assert(exponent <= 0);
// significand is now a whole number
_num = significand;
_den = 1.0 / std::ldexp(1.0, exponent);
assert(_val == _num / _den);
}
friend std::ostream& operator<<(std::ostream &os, const Real& rhs);
friend bool std::isnan(const Real& r);
friend bool std::isinf(const Real& r);
private:
double _val = 0;
double _num = 0;
double _den = 0;
};
std::ostream& operator<<(std::ostream &os, const Real& rhs) {
if (std::isnan(rhs) || std::isinf(rhs)) {
return os << rhs._val;
}
if (rhs._den == 1.0) {
return os << rhs._num;
}
return os << rhs._num << " / " << rhs._den;
}
namespace std {
inline bool isnan(const Real& r) { return std::isnan(r._val); }
inline bool isinf(const Real& r) { return std::isinf(r._val); }
}
#include <iomanip>
int main () {
#define PRINT_REAL(num) \
std::cout << std::setprecision(100) << #num << " = " << num << " = " << Real(num) << std::endl
PRINT_REAL(1.5);
PRINT_REAL(123.875);
PRINT_REAL(0.125);
// double precision issues
PRINT_REAL(-10000000000000023.219238745);
PRINT_REAL(-100000000000000000000000000000000000000000.5);
return 0;
}
Upon looking at your code a little bit more, there's at least a problem with your testing for infinite values. Note the following program:
#include <numeric>
#include <cassert>
#include <cmath>
int main() {
{
double d = std::numeric_limits<double>::max(); // about 1.7976931348623e+308
assert(!std::isnan(d));
assert(!std::isinf(d));
// assert(d != d + 1); // fires
}
{
double d = std::ldexp(1.0, 500); // 2 ^ 700
assert(!std::isnan(d));
assert(!std::isinf(d));
// assert(d != d + 1); // fires
}
}
In addition to that, if your GCD function doesn't support doubles, then you'll be limiting yourself in terms of values you can import as doubles. Try any number > INT64_MAX and the GCD function may not work.

Implementing a half precision floating point number in C++

I am trying to implement a simple half precision floating point type, entirely for storage purposes (no arithmetic, converts to double implicitly), but I get weird behavior. I get completely wrong values for Half between -0.5 and 0.5. Also I get a nasty "offset" for values, for example 0.8 is decoded as 0.7998.
I am very new to C++, so I would be great if you can point out my mistake and help me with improving the accuracy a bit. I am also curious how portable is this solution. Thanks!
Here is the output - double value and actual decoded value from the half:
-1 -1
-0.9 -0.899902
-0.8 -0.799805
-0.7 -0.699951
-0.6 -0.599854
-0.5 -0.5
-0.4 -26208
-0.3 -19656
-0.2 -13104
-0.1 -6552
-1.38778e-16 -2560
0.1 6552
0.2 13104
0.3 19656
0.4 26208
0.5 32760
0.6 0.599854
0.7 0.699951
0.8 0.799805
0.9 0.899902
Here is the code so far:
#include <stdint.h>
#include <cmath>
#include <iostream>
using namespace std;
#define EXP 4
#define SIG 11
double normalizeS(uint v) {
return (0.5f * v / 2048 + 0.5f);
}
uint normalizeP(double v) {
return (uint)(2048 * (v - 0.5f) / 0.5f);
}
class Half {
struct Data {
unsigned short sign : 1;
unsigned short exponent : EXP;
unsigned short significant : SIG;
};
public:
Half() {}
Half(double d) { loadFromFloat(d); }
Half & operator = (long double d) {
loadFromFloat(d);
return *this;
}
operator double() {
long double sig = normalizeS(_d.significant);
if (_d.sign) sig = -sig;
return ldexp(sig, _d.exponent /*+ 1*/);
}
private:
void loadFromFloat(long double f) {
long double v;
int exp;
v = frexp(f, &exp);
v < 0 ? _d.sign = 1 : _d.sign = 0;
_d.exponent = exp/* - 1*/;
_d.significant = normalizeP(fabs(v));
}
Data _d;
};
int main() {
Half a[255];
double d = -1;
for (int i = 0; i < 20; ++i) {
a[i] = d;
cout << d << " " << a[i] << endl;
d += 0.1;
}
}

I ended up with a very simple (naive really) solution, capable of representing every value in the range I need: 0 - 64 with precision of 0.001.
Since the idea is to use it for storage, this is actually better because it allows conversion from and to double without any resolution loss. It is also faster. It actually loses some resolution (less than 16 bit) in the name of having a nicer minimum step so it can represent any of the input values without approximation - so in this case LESS is MORE. Using the full 2^10 resolution for the floating component would result in an odd step that cannot represent decimal values accurately.
class Half {
public:
Half() {}
Half(const double d) { load(d); }
operator double() const { return _d.i + ((double)_d.f / 1000); }
private:
struct Data {
unsigned short i : 6;
unsigned short f : 10;
};
void load(const double d) {
int i = d;
_d.i = i;
_d.f = round((d - i) * 1000);
}
Data _d;
};

Last solution wrong... Sorry...
Try to change the expoent to signed... It worked here.
The problem is that when the expoent turn to be negative, when value < 0.5 you save the expoent as a positive number, it is the problem that cause the number to be big when abs(val)<0.5.

Round a double to the closest and greater float

I want to round big double number (>1e6) to the closest but bigger float using c/c++.
I tried this but I'm not sure it is always correct and there is maybe a fastest way to do that :
int main() {
// x is the double we want to round
double x = 100000000005.0;
double y = log10(x) - 7.0;
float a = pow(10.0, y);
float b = (float)x;
//c the closest round up float
float c = a + b;
printf("%.12f %.12f %.12f\n", c, b, x);
return 0;
}
Thank you.

Simply assigning a double to float and back should tell, if the float is larger. If it's not, one should simply increment the float by one unit. (for positive floats). If this doesn't still produce expected result, then the double is larger than supported by a float, in which case float should be assigned to Inf.
float next(double a) {
float b=a;
if ((double)b > a) return b;
return std::nextafter(b, std::numeric_limits<float>::infinity());
}
[Hack] C-version of next_after (on selected architectures would be)
float next_after(float a) {
*(int*)&a += a < 0 ? -1 : 1;
return a;
}
Better way to do it is:
float next_after(float a) {
union { float a; int b; } c = { .a = a };
c.b += a < 0 ? -1 : 1;
return c.a;
}
Both of these self-made hacks ignore Infs and NaNs (and work on non-negative floats only). The math is based on the fact, that the binary representations of floats are ordered. To get to next representable float, one simply increments the binary representation by one.

If you use c99, you can use the nextafterf function.
#include <stdio.h>
#include <math.h>
#include <float.h>
int main(){
// x is the double we want to round
double x=100000000005.0;
float c = x;
if ((double)c <= x)
c = nextafterf(c, FLT_MAX);
//c the closest round up float
printf("%.12f %.12f\n",c,x);
return 0;
}

C has a nice nextafter function which will help here;
float toBiggerFloat( const double a ) {
const float test = (float) a;
return ((double) test < a) ? nextafterf( test, INFINITY ) : test;
}
Here's a test script which shows it on all classes of number (positive/negative, normal/subnormal, infinite, nan, -0): http://codepad.org/BQ3aqbae (it works fine on anything is the result)

A C routine to round a float to n significant digits?

Suppose I have a float. I would like to round it to a certain number of significant digits.
In my case n=6.
So say float was f=1.23456999;
round(f,6) would give 1.23457
f=123456.0001 would give 123456
Anybody know such a routine ?
Here it works on website: http://ostermiller.org/calc/significant_figures.html

Multiply the number by a suitable scaling factor to move all significant digits to the left of the decimal point. Then round and finally reverse the operation:
#include <math.h>
double round_to_digits(double value, int digits)
{
if (value == 0.0) // otherwise it will return 'nan' due to the log10() of zero
return 0.0;
double factor = pow(10.0, digits - ceil(log10(fabs(value))));
return round(value * factor) / factor;
}
Tested: http://ideone.com/fH5ebt
Buts as #PascalCuoq pointed out: the rounded value may not exactly representable as a floating point value.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char *Round(float f, int d)
{
char buf[16];
sprintf(buf, "%.*g", d, f);
return strdup(buf);
}
int main(void)
{
char *r = Round(1.23456999, 6);
printf("%s\n", r);
free(r);
}
Output is:
1.23457

Something like this should work:
double round_to_n_digits(double x, int n)
{
double scale = pow(10.0, ceil(log10(fabs(x))) + n);
return round(x * scale) / scale;
}
Alternatively you could just use sprintf/atof to convert to a string and back again:
double round_to_n_digits(double x, int n)
{
char buff[32];
sprintf(buff, "%.*g", n, x);
return atof(buff);
}
Test code for both of the above functions: http://ideone.com/oMzQZZ
Note that in some cases incorrect rounding may be observed, e.g. as pointed out by #clearScreen in the comments below, 13127.15 is rounded to 13127.1 instead of
13127.2.

This should work (except the noise given by floating point precision):
#include <stdio.h>
#include <math.h>
double dround(double a, int ndigits);
double dround(double a, int ndigits) {
int exp_base10 = round(log10(a));
double man_base10 = a*pow(10.0,-exp_base10);
double factor = pow(10.0,-ndigits+1);
double truncated_man_base10 = man_base10 - fmod(man_base10,factor);
double rounded_remainder = fmod(man_base10,factor)/factor;
rounded_remainder = rounded_remainder > 0.5 ? 1.0*factor : 0.0;
return (truncated_man_base10 + rounded_remainder)*pow(10.0,exp_base10) ;
}
int main() {
double a = 1.23456999;
double b = 123456.0001;
printf("%12.12f\n",dround(a,6));
printf("%12.12f\n",dround(b,6));
return 0;
}

If you want to print a float to a string use simple sprintf(). For outputting it just to the console you can use printf():
printf("My float is %.6f", myfloat);
This will output your float with 6 decimal places.

Print to 16 significant digit.
double x = -1932970.8299999994;
char buff[100];
snprintf(buff, sizeof(buff), "%.16g", x);
std::string buffAsStdStr = buff;
std::cout << std::endl << buffAsStdStr ;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to bound a floating-point arithmetic result? - c++

Related

Trying to understand simple big number calculations

The most accurate way to calculate numerator and denominator of a double

Implementing a half precision floating point number in C++

Round a double to the closest and greater float

A C routine to round a float to n significant digits?

Categories

Resources