Round a double to the closest and greater float - c++

I want to round big double number (>1e6) to the closest but bigger float using c/c++.
I tried this but I'm not sure it is always correct and there is maybe a fastest way to do that :
int main() {
// x is the double we want to round
double x = 100000000005.0;
double y = log10(x) - 7.0;
float a = pow(10.0, y);
float b = (float)x;
//c the closest round up float
float c = a + b;
printf("%.12f %.12f %.12f\n", c, b, x);
return 0;
}
Thank you.

Simply assigning a double to float and back should tell, if the float is larger. If it's not, one should simply increment the float by one unit. (for positive floats). If this doesn't still produce expected result, then the double is larger than supported by a float, in which case float should be assigned to Inf.
float next(double a) {
float b=a;
if ((double)b > a) return b;
return std::nextafter(b, std::numeric_limits<float>::infinity());
}
[Hack] C-version of next_after (on selected architectures would be)
float next_after(float a) {
*(int*)&a += a < 0 ? -1 : 1;
return a;
}
Better way to do it is:
float next_after(float a) {
union { float a; int b; } c = { .a = a };
c.b += a < 0 ? -1 : 1;
return c.a;
}
Both of these self-made hacks ignore Infs and NaNs (and work on non-negative floats only). The math is based on the fact, that the binary representations of floats are ordered. To get to next representable float, one simply increments the binary representation by one.

If you use c99, you can use the nextafterf function.
#include <stdio.h>
#include <math.h>
#include <float.h>
int main(){
// x is the double we want to round
double x=100000000005.0;
float c = x;
if ((double)c <= x)
c = nextafterf(c, FLT_MAX);
//c the closest round up float
printf("%.12f %.12f\n",c,x);
return 0;
}

C has a nice nextafter function which will help here;
float toBiggerFloat( const double a ) {
const float test = (float) a;
return ((double) test < a) ? nextafterf( test, INFINITY ) : test;
}
Here's a test script which shows it on all classes of number (positive/negative, normal/subnormal, infinite, nan, -0): http://codepad.org/BQ3aqbae (it works fine on anything is the result)

Related

How to bound a floating-point arithmetic result?

Floating-point operations like x=a/b are usually not exactly representable so the CPU has to do rounding. Is it possible to get the two floats x_low and x_up that are respectively the highest floating point less or equals than the exact value of a/b and the lowest floating point higher or equals than a/b?
Some of the conditions are :
a, b, x_low, x_up and x are float
a and b are positive, integers (1.0f, 2.0f, etc)
This will give you a bounds that might be too large:
#include <cmath>
#include <utility>
template<typename T>
std::pair<T, T> bounds(int a, int b) {
T ta = a, tb = b;
T ta_prev = std::nexttoward(ta), ta_next = std::nextafter(ta);
T tb_prev = std::nexttoward(tb), tb_next = std::nextafter(tb);
return std::make_pair(ta_prev / tb_next, ta_next / tb_prev);
}
An easy way to do it is to do the division in higher precision and get the upper/lower bound on conversion to float:
struct float_range {
float lower;
float upper;
};
float_range to_float_range(double d) {
float as_float = static_cast<float>(d);
double rounded = double{as_float};
if (std::isnan(as_float) || rounded == d) {
// No rounding done
return { as_float, as_float };
}
if (rounded < d) {
// rounded down
return { as_float, std::nextafter(as_float, std::numeric_limits<float>::infinity()) };
}
// rounded up
return { std::nextafter(as_float, -std::numeric_limits<float>::infinity()), as_float };
}
float_range precise_divide(float a, float b) {
return to_float_range(double{a}/double{b});
}

Changing the whole part of a number with the decimal part [duplicate]

I have a program in C++ (compiled using g++). I'm trying to apply two doubles as operands to the modulus function, but I get the following error:
error: invalid operands of types 'double' and 'double' to binary 'operator%'
Here's the code:
int main() {
double x = 6.3;
double y = 2;
double z = x % y;
}
The % operator is for integers. You're looking for the fmod() function.
#include <cmath>
int main()
{
double x = 6.3;
double y = 2.0;
double z = std::fmod(x,y);
}
fmod(x, y) is the function you use.
You can implement your own modulus function to do that for you:
double dmod(double x, double y) {
return x - (int)(x/y) * y;
}
Then you can simply use dmod(6.3, 2) to get the remainder, 0.3.
Use fmod() from <cmath>. If you do not want to include the C header file:
template<typename T, typename U>
constexpr double dmod (T x, U mod)
{
return !mod ? x : x - mod * static_cast<long long>(x / mod);
}
//Usage:
double z = dmod<double, unsigned int>(14.3, 4);
double z = dmod<long, float>(14, 4.6);
//This also works:
double z = dmod(14.7, 0.3);
double z = dmod(14.7, 0);
double z = dmod(0, 0.3f);
double z = dmod(myFirstVariable, someOtherVariable);

Save a float into an integer without losing floating point precision

I want to save the value of a float variable named f in the third element of an array named i in a way that the floating point part isn't wiped (i.e. I don't want to save 1 instead of 1.5). After that, complete the last line in a way that we see 1.5 in the output (don't use cout<<1.5; or cout<<f; or some similar tricks!)
float f=1.5;
int i[3];
i[2] = ... ;
cout<<... ;
Does anybody have any idea?
Use type-punning with union if they have the same size under a compilation environment:
static_assert(sizeof(int) == sizeof(float));
int castFloatToInt(float f) {
union { float f; int i; } u;
u.f = f;
return u.i;
}
float castIntToFloat(int i) {
union { float f; int i; } u;
u.i = i;
return u.f;
}
// ...
float f=1.5;
int i[3];
i[2] = castFloatToInt(f);
cout << castIntToFloat(i);
Using union is the way to prevent aliasing problem, otherwise compiler may generate incorrect results due to optimization.
This is a common technique for manipulating bits of float directly. Although normally uint32_t will be used instead.
Generally speaking, you cannot store a float in an int without loss of precision.
You could multiply your number with a factor, store it and after that divide again to get some decimal places out of it.
Note that this will not work for all numbers and you have to choose your factor carefully.
float f = 1.5f;
const float factor = 10.0f;
int i[3];
i[2] = static_cast<int>(f * factor);
std::cout << static_cast<float>(i[2]) / factor;
If we can assume that int is 32 bits then you can do it with type-punning:
float f = 1.5;
int i[3];
i[2] = *(int *)&f;
cout << *(float *)&i[2];
but this is getting into Undefined Behaviour territory (breaking aliasing rules), since it accesses a type via a pointer to a different (incompatible) type.
LIVE DEMO

Implementing a half precision floating point number in C++

I am trying to implement a simple half precision floating point type, entirely for storage purposes (no arithmetic, converts to double implicitly), but I get weird behavior. I get completely wrong values for Half between -0.5 and 0.5. Also I get a nasty "offset" for values, for example 0.8 is decoded as 0.7998.
I am very new to C++, so I would be great if you can point out my mistake and help me with improving the accuracy a bit. I am also curious how portable is this solution. Thanks!
Here is the output - double value and actual decoded value from the half:
-1 -1
-0.9 -0.899902
-0.8 -0.799805
-0.7 -0.699951
-0.6 -0.599854
-0.5 -0.5
-0.4 -26208
-0.3 -19656
-0.2 -13104
-0.1 -6552
-1.38778e-16 -2560
0.1 6552
0.2 13104
0.3 19656
0.4 26208
0.5 32760
0.6 0.599854
0.7 0.699951
0.8 0.799805
0.9 0.899902
Here is the code so far:
#include <stdint.h>
#include <cmath>
#include <iostream>
using namespace std;
#define EXP 4
#define SIG 11
double normalizeS(uint v) {
return (0.5f * v / 2048 + 0.5f);
}
uint normalizeP(double v) {
return (uint)(2048 * (v - 0.5f) / 0.5f);
}
class Half {
struct Data {
unsigned short sign : 1;
unsigned short exponent : EXP;
unsigned short significant : SIG;
};
public:
Half() {}
Half(double d) { loadFromFloat(d); }
Half & operator = (long double d) {
loadFromFloat(d);
return *this;
}
operator double() {
long double sig = normalizeS(_d.significant);
if (_d.sign) sig = -sig;
return ldexp(sig, _d.exponent /*+ 1*/);
}
private:
void loadFromFloat(long double f) {
long double v;
int exp;
v = frexp(f, &exp);
v < 0 ? _d.sign = 1 : _d.sign = 0;
_d.exponent = exp/* - 1*/;
_d.significant = normalizeP(fabs(v));
}
Data _d;
};
int main() {
Half a[255];
double d = -1;
for (int i = 0; i < 20; ++i) {
a[i] = d;
cout << d << " " << a[i] << endl;
d += 0.1;
}
}
I ended up with a very simple (naive really) solution, capable of representing every value in the range I need: 0 - 64 with precision of 0.001.
Since the idea is to use it for storage, this is actually better because it allows conversion from and to double without any resolution loss. It is also faster. It actually loses some resolution (less than 16 bit) in the name of having a nicer minimum step so it can represent any of the input values without approximation - so in this case LESS is MORE. Using the full 2^10 resolution for the floating component would result in an odd step that cannot represent decimal values accurately.
class Half {
public:
Half() {}
Half(const double d) { load(d); }
operator double() const { return _d.i + ((double)_d.f / 1000); }
private:
struct Data {
unsigned short i : 6;
unsigned short f : 10;
};
void load(const double d) {
int i = d;
_d.i = i;
_d.f = round((d - i) * 1000);
}
Data _d;
};
Last solution wrong... Sorry...
Try to change the expoent to signed... It worked here.
The problem is that when the expoent turn to be negative, when value < 0.5 you save the expoent as a positive number, it is the problem that cause the number to be big when abs(val)<0.5.

A C routine to round a float to n significant digits?

Suppose I have a float. I would like to round it to a certain number of significant digits.
In my case n=6.
So say float was f=1.23456999;
round(f,6) would give 1.23457
f=123456.0001 would give 123456
Anybody know such a routine ?
Here it works on website: http://ostermiller.org/calc/significant_figures.html
Multiply the number by a suitable scaling factor to move all significant digits to the left of the decimal point. Then round and finally reverse the operation:
#include <math.h>
double round_to_digits(double value, int digits)
{
if (value == 0.0) // otherwise it will return 'nan' due to the log10() of zero
return 0.0;
double factor = pow(10.0, digits - ceil(log10(fabs(value))));
return round(value * factor) / factor;
}
Tested: http://ideone.com/fH5ebt
Buts as #PascalCuoq pointed out: the rounded value may not exactly representable as a floating point value.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char *Round(float f, int d)
{
char buf[16];
sprintf(buf, "%.*g", d, f);
return strdup(buf);
}
int main(void)
{
char *r = Round(1.23456999, 6);
printf("%s\n", r);
free(r);
}
Output is:
1.23457
Something like this should work:
double round_to_n_digits(double x, int n)
{
double scale = pow(10.0, ceil(log10(fabs(x))) + n);
return round(x * scale) / scale;
}
Alternatively you could just use sprintf/atof to convert to a string and back again:
double round_to_n_digits(double x, int n)
{
char buff[32];
sprintf(buff, "%.*g", n, x);
return atof(buff);
}
Test code for both of the above functions: http://ideone.com/oMzQZZ
Note that in some cases incorrect rounding may be observed, e.g. as pointed out by #clearScreen in the comments below, 13127.15 is rounded to 13127.1 instead of
13127.2.
This should work (except the noise given by floating point precision):
#include <stdio.h>
#include <math.h>
double dround(double a, int ndigits);
double dround(double a, int ndigits) {
int exp_base10 = round(log10(a));
double man_base10 = a*pow(10.0,-exp_base10);
double factor = pow(10.0,-ndigits+1);
double truncated_man_base10 = man_base10 - fmod(man_base10,factor);
double rounded_remainder = fmod(man_base10,factor)/factor;
rounded_remainder = rounded_remainder > 0.5 ? 1.0*factor : 0.0;
return (truncated_man_base10 + rounded_remainder)*pow(10.0,exp_base10) ;
}
int main() {
double a = 1.23456999;
double b = 123456.0001;
printf("%12.12f\n",dround(a,6));
printf("%12.12f\n",dround(b,6));
return 0;
}
If you want to print a float to a string use simple sprintf(). For outputting it just to the console you can use printf():
printf("My float is %.6f", myfloat);
This will output your float with 6 decimal places.
Print to 16 significant digit.
double x = -1932970.8299999994;
char buff[100];
snprintf(buff, sizeof(buff), "%.16g", x);
std::string buffAsStdStr = buff;
std::cout << std::endl << buffAsStdStr ;