boost int 128 division to floating point number - c++

I am new in boost. I have 128 bit integer (int128_t boost/multiprecision/cpp_int.hpp) in my project, which I need to divide to floating point number. In my current platform I have limitation and can't use boost/multiprecision/float128.hpp. It's still not supported in clang now https://github.com/boostorg/math/issues/181
Is there any way to this with boost math lib?

Although you can't use float128, Boost has several other implementations of long floating-point types:
cpp_bin_float
cpp_dec_float
gmp_float
mpfr_float
In particular, if you need binary high-precision floating-point type without dependencies on external libraries like GMP, you can use cpp_bin_float. Example:
#include <iomanip>
#include <iostream>
#include <boost/multiprecision/cpp_int.hpp>
#include <boost/multiprecision/cpp_bin_float.hpp>
int main()
{
using LongFloat=boost::multiprecision::cpp_bin_float_quad;
const auto x=boost::multiprecision::int128_t(1234123521);
const auto y=LongFloat(34532.52346246234);
const auto z=LongFloat(x)/y;
std::cout << "Ratio: " << std::setprecision(10) << z << "\n";
}
Here we've used a built-in typedef for 113-bit floating-point number, which has the same precision and range as IEEE 754 binary128. You can choose other parameters for the precision and range, see the docs I've linked to above for details.
Note though, that int128_t has more precision than any kind of float128, because some bits of the latter are used to store its exponent. If that's an issue, be sure to use higher precision.

Perhaps split the int128 into 64-bit numbers?
i128 = h64 * (1<<64) + l64
Then you could easily load those values shift and sum them on the 64bit floating point to get the equivalent number.
Or, as the floating point hardware is actually only 64 bit precision anyway, you could just shift down your int128 until it fits in 64 bit, cast that to float and then shift it back up, but the former may actually be faster because it is simpler.

Related

Largest value representable by a floating-point type smaller than 1

Is there a way to obtain the greatest value representable by the floating-point type float which is smaller than 1.
I've seen the following definition:
static const double DoubleOneMinusEpsilon = 0x1.fffffffffffffp-1;
static const float FloatOneMinusEpsilon = 0x1.fffffep-1;
But is this really how we should define these values?
According to the Standard, std::numeric_limits<T>::epsilon is the machine epsilon, that is, the difference between 1.0 and the next value representable by the floating-point type T. But that doesn't necessarily mean that defining T(1) - std::numeric_limits<T>::epsilon would be better.
You can use the std::nextafter function, which, despite its name, can retrieve the next representable value that is arithmetically before a given starting point, by using an appropriate to argument. (Often -Infinity, 0, or +Infinity).
This works portably by definition of nextafter, regardless of what floating-point format your C++ implementation uses. (Binary vs. decimal, or width of mantissa aka significand, or anything else.)
Example: Retrieving the closest value less than 1 for the double type (on Windows, using the clang-cl compiler in Visual Studio 2019), the answer is different from the result of the 1 - ε calculation (which as discussed in comments, is incorrect for IEEE754 numbers; below any power of 2, representable numbers are twice as close together as above it):
#include <iostream>
#include <iomanip>
#include <cmath>
#include <limits>
int main()
{
double naft = std::nextafter(1.0, 0.0);
std::cout << std::fixed << std::setprecision(20);
std::cout << naft << '\n';
double neps = 1.0 - std::numeric_limits<double>::epsilon();
std::cout << neps << '\n';
return 0;
}
Output:
0.99999999999999988898
0.99999999999999977796
With different output formatting, this could print as 0x1.fffffffffffffp-1 and 0x1.ffffffffffffep-1 (1 - ε)
Note that, when using analogous techniques to determine the closest value that is greater than 1, then the nextafter(1.0, 10000.) call gives the same value as the 1 + ε calculation (1.00000000000000022204), as would be expected from the definition of ε.
Performance
C++23 requires std::nextafter to be constexpr, but currently only some compilers support that. GCC does do constant-propagation through it, but clang can't (Godbolt). If you want this to be as fast (with optimization enabled) as a literal constant like 0x1.fffffffffffffp-1; for systems where double is IEEE754 binary64, on some compilers you'll have to wait for that part of C++23 support. (It's likely that once compilers are able to do this, like GCC they'll optimize even without actually using -std=c++23.)
const double DoubleBelowOne = std::nextafter(1.0, 0.); at global scope will at worst run the function once at startup, defeating constant propagation where it's used, but otherwise performing about the same as FP literal constants when used with other runtime variables.
This can be calculated without calling a function by using the characteristics of floating-point representation specified in the C standard. Since the epsilon provides the distance between representable numbers just above 1, and radix provides the base used to represent numbers, the distance between representable numbers just below one is epsilon divided by that base:
#include <iostream>
#include <limits>
int main(void)
{
typedef float Float;
std::cout << std::hexfloat <<
1 - std::numeric_limits<Float>::epsilon() / std::numeric_limits<Float>::radix
<< '\n';
}
0.999999940395355224609375 is the largest 32 bit float that is less than 1. The code below demos this:
Mac_3.2.57$cat float2uintTest4.c
#include <stdio.h>
int main(void){
union{
float f;
unsigned int i;
} u;
//u.f=0.9999;
//printf("as hex: %x\n", u.i); // 0x3f7fffff
u.i=0x3f800000; // 1.0
printf("as float: %200.200f\n", u.f);
u.i=0x3f7fffff; // 1.0-e
//00111111 01111111 11111111 11111111
//seeeeeee emmmmmmm mmmmmmmm mmmmmmmm
printf("as float: %200.200f\n", u.f);
return(0);
}
Mac_3.2.57$cc float2uintTest4.c
Mac_3.2.57$./a.out
as float: 1.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
as float: 0.99999994039535522460937500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

Libc hypot function seems to return incorrect results for double type... why?

#include <tgmath.h>
#include <iostream>
int main(int argc, char** argv) {
#define NUM1 -0.031679909079365576
#define NUM2 -0.11491794452567111
std::cout << "double precision :"<< std::endl;
typedef std::numeric_limits< double > dbl;
std::cout.precision(dbl::max_digits10);
std::cout << std::hypot((double)NUM1, (double)NUM2);
std::cout << " VS sqrt :" << sqrt((double )NUM1*(double )NUM1
+ (double )NUM2*(double )NUM2) << std::endl;
std::cout << "long double precision :"<< std::endl;
typedef std::numeric_limits<long double > ldbl;
std::cout.precision(ldbl::max_digits10);
std::cout << std::hypot((long double)NUM1, (long double)NUM2);
std::cout << " VS sqrt :" << sqrt((long double )NUM1*(long double )NUM1 + (long double )NUM2*(long double )NUM2);
}
Returns under Linux (Ubuntu 18.04 clang or gcc, whatever optimisation, glic 2.25):
double precision :
0.1192046585217293 VS sqrt :0.11920465852172932
long double precision :
0.119204658521729311251 VS sqrt :0.119204658521729311251
According to the cppreference :
Implementations usually guarantee precision of less than 1 ulp (units in the last place): GNU, BSD, Open64
std::hypot(x, y) is equivalent to std::abs(std::complex(x,y))
POSIX specifies that underflow may only occur when both arguments are subnormal and the correct result is also subnormal (this forbids naive implementations)
So, hypot((double)NUM1, (double)NUM2) should return 0.11920465852172932, i suppose (as naive sqrt implementation).
On windows, using MSVC 64 bit, this is the case.
Why do we see this difference using glibc ? How is it possible to solve this inconsistency ?
0.11920465852172932 is represented by 0x1.e84324de1b576p-4 (as a double)
0.11920465852172930 is represented by 0x1.e84324de1b575p-4 (as a double)
0.119204658521729311251 is the long-double result, which we can assume is correct to a couple more decimal places. i.e. the exact result is closer to rounded up result.
Those FP bit-patterns differ only in the low bit of the mantissa (aka significand), and the exact result is between them. So they each have less than 1 ulp of rounding error, achieving what typical implementations (including glibc) aim for.
Unlike IEEE-754 "basic" operations (add/sub/mul/div/sqrt), hypot is not required to be "correctly rounded". That means <= 0.5 ulp of error. Achieving that would be much slower for operations the HW doesn't provide directly. (e.g. do extended-precision calculation with at least a couple extra definitely-correct bits, so you can round to the nearest double, like the hardware does for basic operations)
It happens that in this case, the naive calculation method produced the correctly-rounded result while glibc's "safe" implementation of std::hypot (that has to avoid underflow when squaring small numbers before adding) produced a result with >0.5 but <1 ulp of error.
You didn't specify whether you were using MSVC in 32-bit mode.
Presumably 32-bit mode would be using x87 for FP math, giving extra temporary precision. Although some MSVC versions' CRT code sets the x87 FPU's internal precision to round to 53-bit mantissa after every operation, so it behaves like SSE2 using actual double, except with a wider exponent range. See Bruce Dawson's blog post.
So I don't know if there's any reason beyond luck that MSVC's std::hypot got the correctly-rounded result for this.
Note that long double in MSVC is the same type as 64-bit double; that C++ implementation doesn't expose x86 / x86-64's 80-bit hardware extended-precision type. (64-bit mantissa).

Is a 32 bit normalized floating point number that hasn't been operated on, same on any platform/compiler?

For example:
float a = 3.14159f;
If I was to inspect the bits in this number (or any other normalized floating point number), what are the chances that the bits are different in some other platform/compiler combination, or is that possible?
Not necessarily: The c++ standard doesn't define floating point representation (it doesn't even define the representation of signed integers), although most platforms probably orient themselves on the same IEEE standard (IEEE 754-2008?).
Your question can be rephrased as: Will the final assertion in the following code always be upheld, no matter what platform you run it on?
#include <cassert>
#include <cstring>
#include <cstdint>
#include <limits>
#if __cplusplus < 201103L // no static_assert prior to C++11
#define static_assert(a,b) assert(a)
#endif
int main() {
float f = 3.14159f;
std::uint32_t i = 0x40490fd0;// IEC 659/IEEE 754 representation
static_assert(std::numeric_limits<float>::is_iec559, "floating point must be IEEE 754");
static_assert(sizeof(f) == sizeof(i), "float must be 32 bits wide");
assert(std::memcmp(&f, &i, sizeof(f)) == 0);
}
Answer: There's nothing in the C++ standard that guarantees that the assertion will be upheld. Yet, on most sane platforms the assertion will hold and the code won't abort, no matter if the platform is big- or little-endian. As long as you only care that your code works on some known set of platforms, it'll be OK: you can verify that the tests pass there :)
Realistically speaking, some compilers might use a sub-par decimal-to-IEEE-754 conversion routine that doesn't properly round the result, so if you specify f to enough digits of precision, it might be a couple of LSBs of mantissa off from the value that would be nearest to the decimal representation. And then the assertion won't hold anymore. For such platforms, you might wish to test a couple mantissa LSBs around the desired one.

c++ precision issue in storing floating point numbers

I'm handling some mathematical calculation.
I'm losing precision. But i need extreme precision.
I then used to check the precision issue with the code given below.
Any solution for getting the precision?
#include <iostream>
#include <stdlib.h>
#include <cstdio>
#include <sstream>
#include <iomanip>
using namespace std;
int main(int argc,char** arvx)
{
float f = 1.00000001;
cout << "f: " <<std::setprecision(20)<< f <<endl;
return 0;
}
Output is
f: 1
If you truly want precise representation of these sorts of numbers (ie, with very small fractional components many places beyond the decimal point), then floating point types like float or even the much more precise double may still not give you the exact results you are looking for in all circumstances. Floating point types can only approximate some values with small fractional components.
You may need to use some sort of high precision fixed point C++ type in order to get exact representation of very small fractions in your values, and resulting accurate calculated results when you perform mathematical operations on such numbers. The following question/answers may provide you with some useful pointers: C++ fixed point library?
in c++
float f = 1.00000001;
support only 6 digits after decimal point
float f = 1.000001;
if you want more real calculation use double

Rounding to use for int -> float -> int round trip conversion

I'm writing a set of numeric type conversion functions for a database engine, and I'm concerned about the behavior of converting large integral floating-point values to integer types with greater precision.
Take for example converting a 32-bit int to a 32-bit single-precision float. The 23-bit significand of the float yields about 7 decimal digits of precision, so converting any int with more than about 7 digits will result in a loss of precision (which is fine and expected). However, when you convert such a float back to an int, you end up with artifacts of its binary representation in the low-order digits:
#include <iostream>
#include <iomanip>
using namespace std;
int main()
{
int a = 2147483000;
cout << a << endl;
float f = (float)a;
cout << setprecision(10) << f << endl;
int b = (int)f;
cout << b << endl;
return 0;
}
This prints:
2147483000
2147483008
2147483008
The trailing 008 is beyond the precision of the float, and therefore seems undesirable to retain in the int, since in a database application, users are primarily concerned with decimal representation, and trailing 0's are used to indicate insignificant digits.
So my questions are: Are there any well-known existing systems that perform decimal significant digit rounding in float -> int (or double -> long long) conversions, and are there any well-known, efficient algorithms for doing so?
(Note: I'm aware that some systems have decimal floating-point types, such as those defined by IEEE 754-2008. However, they don't have mainstream hardware support and aren't built into C/C++. I might want to support them down the road, but I still need to handle binary floats intuitively.)
std::numeric_limits<float>::digits10 says you only get 6 precise digits for float.
Pick an efficient algorithm for your language, processor, and data distribution to calculate-the-decimal-length-of-an-integer (or here). Then subtract the number of digits that digits10 says are precise to get the number of digits to cull. Use that as an index to lookup a power of 10 to use as a modulus. Etc.
One concern: Let's say you convert a float to a decimal and perform this sort of rounding or truncation. Then convert that "adjusted" decimal to a float and back to a decimal with the same rounding/truncation scheme. Do you get the same decimal value? Hopefully yes.
This isn't really what you're looking for but may be interesting reading: A Proposal to add a max significant decimal digits value to the C++ Standard Library Numeric limits
Naturally, 2147483008 has trailing zeros if you write it in binary (1111111111111111111110110000000) or hexadecimal (0b0x7FFFFD80). The most "correct" thing to do would be to track insignificant digits in any of those forms instead.
Alternatively, you could just zero all digits after the first seven significant ones in the int (ideally by rounding) after converting to it from a float, since the float contains approximately seven significant digits.