Rounding large float to int - c++

Problem: I'm looking for a way of rounding some float f to the closest int in general -- especially if the f is large. Mathematically speaking I'd like to compute the following function
where script T denotes the set of ints representable by my machine. In case of ties (eg. .5) r(f) can be defined arbitrarily.
Current Code: Below my current solution including two unsatisfying float examples (in main):
#include <cmath>
#include <iostream>
#include <limits>
template <class T>
T projection(T const min, T t, T const max) {
return std::max(std::min(t, max), min);
}
template <class Out, class In>
Out repr(In in) {
using Limits = std::numeric_limits<Out>;
auto next = [](Out val) {
auto const zero = static_cast<In>(0);
return std::nexttoward(static_cast<In>(val), zero);
};
return projection(next(Limits::lowest()), std::round(in), next(Limits::max()));
};
int main() {
std::cout
<< repr<int>(std::numeric_limits<float>::max()) << " "
<< repr<int>(static_cast<float>(std::numeric_limits<int>::max())) << "\n";
}
On my machine with 32bit ints this prints:
2147483520 2147483520
Short elaboration: For the upper bound, next computes the next smaller float that can be safely static_casted to int (analogously for lower bound). This is necessary as my float examples in main demonstrate: Without next, repr involves undefined behavior of casting (at least) std::numeric_limits<int>::max() + 1 as float to int in which this number is not representable.
The obvious downside of my repr is that it is incorrect in the mathematical sense: For large floats (eg. std::numeric_limits<float>::max()) it doesn't return std::numeric_limits<int>::max().
Questions:
Is this there an easier way to solve the problem (easier in the sense of less manual number crunching and more delegating to std-functions)?
How can repr be made correct (in the mathematical sense) with fully defined behavior only (no undefined and no implementation defined behavior)?
So far I've been talking about int and float but (as templates already suggested) this should only be a start. What about combinations
double and long or
double and long long?

Related

Largest value representable by a floating-point type smaller than 1

Is there a way to obtain the greatest value representable by the floating-point type float which is smaller than 1.
I've seen the following definition:
static const double DoubleOneMinusEpsilon = 0x1.fffffffffffffp-1;
static const float FloatOneMinusEpsilon = 0x1.fffffep-1;
But is this really how we should define these values?
According to the Standard, std::numeric_limits<T>::epsilon is the machine epsilon, that is, the difference between 1.0 and the next value representable by the floating-point type T. But that doesn't necessarily mean that defining T(1) - std::numeric_limits<T>::epsilon would be better.
You can use the std::nextafter function, which, despite its name, can retrieve the next representable value that is arithmetically before a given starting point, by using an appropriate to argument. (Often -Infinity, 0, or +Infinity).
This works portably by definition of nextafter, regardless of what floating-point format your C++ implementation uses. (Binary vs. decimal, or width of mantissa aka significand, or anything else.)
Example: Retrieving the closest value less than 1 for the double type (on Windows, using the clang-cl compiler in Visual Studio 2019), the answer is different from the result of the 1 - ε calculation (which as discussed in comments, is incorrect for IEEE754 numbers; below any power of 2, representable numbers are twice as close together as above it):
#include <iostream>
#include <iomanip>
#include <cmath>
#include <limits>
int main()
{
double naft = std::nextafter(1.0, 0.0);
std::cout << std::fixed << std::setprecision(20);
std::cout << naft << '\n';
double neps = 1.0 - std::numeric_limits<double>::epsilon();
std::cout << neps << '\n';
return 0;
}
Output:
0.99999999999999988898
0.99999999999999977796
With different output formatting, this could print as 0x1.fffffffffffffp-1 and 0x1.ffffffffffffep-1 (1 - ε)
Note that, when using analogous techniques to determine the closest value that is greater than 1, then the nextafter(1.0, 10000.) call gives the same value as the 1 + ε calculation (1.00000000000000022204), as would be expected from the definition of ε.
Performance
C++23 requires std::nextafter to be constexpr, but currently only some compilers support that. GCC does do constant-propagation through it, but clang can't (Godbolt). If you want this to be as fast (with optimization enabled) as a literal constant like 0x1.fffffffffffffp-1; for systems where double is IEEE754 binary64, on some compilers you'll have to wait for that part of C++23 support. (It's likely that once compilers are able to do this, like GCC they'll optimize even without actually using -std=c++23.)
const double DoubleBelowOne = std::nextafter(1.0, 0.); at global scope will at worst run the function once at startup, defeating constant propagation where it's used, but otherwise performing about the same as FP literal constants when used with other runtime variables.
This can be calculated without calling a function by using the characteristics of floating-point representation specified in the C standard. Since the epsilon provides the distance between representable numbers just above 1, and radix provides the base used to represent numbers, the distance between representable numbers just below one is epsilon divided by that base:
#include <iostream>
#include <limits>
int main(void)
{
typedef float Float;
std::cout << std::hexfloat <<
1 - std::numeric_limits<Float>::epsilon() / std::numeric_limits<Float>::radix
<< '\n';
}
0.999999940395355224609375 is the largest 32 bit float that is less than 1. The code below demos this:
Mac_3.2.57$cat float2uintTest4.c
#include <stdio.h>
int main(void){
union{
float f;
unsigned int i;
} u;
//u.f=0.9999;
//printf("as hex: %x\n", u.i); // 0x3f7fffff
u.i=0x3f800000; // 1.0
printf("as float: %200.200f\n", u.f);
u.i=0x3f7fffff; // 1.0-e
//00111111 01111111 11111111 11111111
//seeeeeee emmmmmmm mmmmmmmm mmmmmmmm
printf("as float: %200.200f\n", u.f);
return(0);
}
Mac_3.2.57$cc float2uintTest4.c
Mac_3.2.57$./a.out
as float: 1.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
as float: 0.99999994039535522460937500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

Incorrect output when converting from int to float with static_cast<float> c++

I do not understand why my output is coming out as wrong.
I have attached my code and my result(highlighted the issue)
I am adding a coin to the insertCoin function in VendingMachine class.
As soon as I add 10 cents to this function, it prints out error-> Does not accept 10 cents.
I am converting the input to float using static_cast. I spent some good time on this and at this point, I feel I just cannot see the issue probably because I dont understand some concept.
Also as a quick background, new to c++ and trying to get my object oriented programming up to date.
Trying to make a Vending machine product hehe.
Thank you again !
#include <iostream>
#include <string>
#include <vector>
#include <unordered_map>
#include <utility>
#include <stdlib.h>
using namespace std;
class Item
{
private:
string m_name;
float m_cost;
public:
Item(string t_name,float t_cost):m_name(t_name),m_cost(t_cost)
{
}
string getName()
{
return m_name;
}
float getCost()
{
return m_cost;
}
};
class VendingMachine
{
private:
vector<Item> m_totalItems;
unordered_map<string,Item>m_products;
float m_remainingCharges{0};
float m_moneyInserted{0};
size_t itemCount{0};
public:
VendingMachine()
{
}
void addItemToVendingMachine(string t_name,size_t t_cost)
{
float temp=static_cast<float>(t_cost)/static_cast<float>(100);
Item item(t_name,temp);
m_totalItems.push_back(item);
m_products.insert(make_pair(t_name,item));
}
bool chooseProduct(string t_name)
{
for(auto item:m_totalItems)
{
if(item.getName()==t_name)
{
m_remainingCharges=item.getCost();
return true;
}
itemCount++;
}
cout<<"Item not currently available: "+t_name<<endl;
return false;
}
void insertCoin(size_t t_coin)
{
float temp=static_cast<float>(t_coin);
if(t_coin<=50)
{
temp/=100;
cout<<temp<<endl;
}
if(temp==0.01 or temp==0.05 or temp==0.1 or temp==1.00 or temp==2.00 or temp==0.25 or temp==0.50)
{
m_moneyInserted+=temp;
m_remainingCharges-=m_moneyInserted;
}
else
{
cout<<"Does not accept: "<< t_coin<<" ,please insert correct coin."<<endl;
return;
}
}
pair<Item,float> getProduct()
{
auto item=m_totalItems[itemCount];
auto itemBack=m_totalItems.back();
m_totalItems[itemCount]=itemBack;
m_totalItems.pop_back();
return make_pair(item,abs(m_remainingCharges));
}
float refund()
{
if(m_remainingCharges<0)
return abs(m_remainingCharges);
else
return 0;
}
void resetOperator()
{
m_remainingCharges=0;
m_moneyInserted=0;
itemCount=0;
}
};
int main()
{
Item item("Candy",0.50);
cout<<item.getName()<<" ";
cout<<item.getCost()<<endl;
VendingMachine machine;
machine.addItemToVendingMachine("CANDY",10);
machine.addItemToVendingMachine("SNACK",50);
machine.addItemToVendingMachine("Coke",25);
machine.insertCoin(10);
machine.insertCoin(25);
machine.insertCoin(50);
machine.chooseProduct("CANDY");
auto temp=machine.getProduct();
cout<<temp.first.getName()<<endl;
cout<<temp.second<<endl;
machine.resetOperator();
return 0;
};
You can use the epsilon to compare floating point numbers, overcoming the fact that it's very difficult to precisely equate floating point numbers, you have to use "very nearly equals". See this post for more detail:
What is the most effective way for float and double comparison?
Personally I would recommend changing your code to work in integer units of cents in order to avoid having to implement the "very nearly equals" pattern throughout your code.
Why doesn't it work as expected ?
This is not specific to C++ but to floating point numbers.
The C++ standard doesn't tell how floating points are encoded. But in general, floating point numbers are encoded using power of two fractions. And with these schemes, some decimal numbers have no exact match and instead, the closest match is used, with some approximation.
For example, the most popular encoding is probably IEEE-754. This nice online converter, shows that there is no exact match for 0.1 . The closest approximation is 0.100000001490116119384765625:
If you print this value out, with all the rounding, everything will seem fine.
But if you compare for equality with ==, the value must be exactly the same. Unfortunately different calculations may make different roundings. So you will get two different numbers, very close to 0.1 but different from each other.
In your case, the literal value 0.1 is not a float: it's a double, which has a higher precision than a float and hence makes different approximations.
Practical evidence
If you're not convinced, try the following changes:
if(t_coin<=50)
{
temp/=100;
cout.precision(10); // print more digits
cout<<scientific<<"temp:"<<temp<<" vs.double "<<0.1
<<" or float "<<0.1f<<endl;
}
and then try to compare floats with floats:
if(temp==0.01f or temp==0.05f or temp==0.1f
or temp==1.00f or temp==2.00f
or temp==0.25f or temp==0.50f)
Online demo
How to solve it?
The best option here, is to work with integers and count the cents, as someone suggested in the comments. There is no floating point induced approximation in such calculation, so for money it's ideal.
A workaround it to align all the floating point numbers to either double or float (by adding a trailing f to the numeric literals). This would work when comparing constant values, if there could b e no rounding issue inf some calculations.
Another solution is to replace the strict equality comparison, with an ineguality checking that the difference between both numbers is very small. This is the epsilon approach proposed by Jason in the other answer, comparing the values with the help of a function almost_equal() defined as explained here
Not related: don't use size_t for reperesenting money. THhis is misleading for the readers and maintainers ;-). If you want an unsigned int, or an unsigned long, or an unsigned long long say so.

Efficient division of an int by intmax

I have an integer of type uint32_t and would like to divide it by a maximum value of uint32_t and obtain the result as a float (in range 0..1).
Naturally, I can do the following:
float result = static_cast<float>(static_cast<double>(value) / static_cast<double>(std::numeric_limits<uint32_t>::max()))
This is however quite a lot of conversions on the way, and a the division itself may be expensive.
Is there a way to achieve the above operation faster, without division and excess type conversions? Or maybe I shouldn't worry because modern compilers are able to generate an efficient code already?
Edit: division by MAX+1, effectively giving me a float in range [0..1) would be fine too.
A bit more context:
I use the above transformation in a time-critical loop, with uint32_t being produced from a relatively fast random-number generator (such as pcg). I expect that the conversions/divisions from the above transformation may have some noticable, albeit not overwhelming, negative impact on the performance of my code.
This sounds like a job for:
std::uniform_real_distribution<float> dist(0.f, 1.f);
I would trust that to give you an unbiased conversion to float in the range [0, 1) as efficiently as possible. If you want the range to be [0, 1] you could use this:
std::uniform_real_distribution<float> dist(0.f, std::nextafter(1.f, 2.f))
Here's an example with two instances of a not-so-random number generator that generates min and max for uint32_t:
#include <iostream>
#include <limits>
#include <random>
struct ui32gen {
constexpr ui32gen(uint32_t x) : value(x) {}
uint32_t operator()() { return value; }
static constexpr uint32_t min() { return 0; }
static constexpr uint32_t max() { return std::numeric_limits<uint32_t>::max(); }
uint32_t value;
};
int main() {
ui32gen min(ui32gen::min());
ui32gen max(ui32gen::max());
std::uniform_real_distribution<float> dist(0.f, 1.f);
std::cout << dist(min) << "\n";
std::cout << dist(max) << "\n";
}
Output:
0
1
Is there a way to achieve the operation faster, without division
and excess type conversions?
If you want to manually do something similar to what uniform_real_distribution does (but much faster, and slightly biased towards lower values), you can define a function like this:
// [0, 1) the common range
inline float zero_to_one_exclusive(uint32_t value) {
static const float f_mul =
std::nextafter(1.f / float(std::numeric_limits<uint32_t>::max()), 0.f);
return float(value) * f_mul;
}
It uses multiplication instead of division since that often is a bit faster (than your original suggestion) and only has one type conversion. Here's a comparison of division vs. multiplication.
If you really want the range to be [0, 1], you can do like below, which will also be slightly biased towards lower values compared to what std::uniform_real_distribution<float> dist(0.f, std::nextafter(1.f, 2.f)) would produce:
// [0, 1] the not so common range
inline float zero_to_one_inclusive(uint32_t value) {
static const float f_mul = 1.f/float(std::numeric_limits<uint32_t>::max());
return float(value) * f_mul;
}
Here's a benchmark comparing uniform_real_distribution to zero_to_one_exclusive and zero_to_one_inclusive.
Two of the casts are superfluous. You dont need to cast to float when anyhow you assign to a float. Also it is sufficient to cast one of the operands to avoid integer arithmetics. So we are left with
float result = static_cast<double>(value) / std::numeric_limits<int>::max();
This last cast you cannot avoid (otherwise you would get integer arithmetics).
Or maybe I shouldn't worry because modern compilers are able to
generate an efficient code already?
Definitely a yes and no! Yes, trust the compiler that it knows best to optimize code and write for readability first. And no, dont blindy trust. Look at the output of the compiler. Compare different versions and measure them.
Is there a way to achieve the above operation faster, without division
[...] ?
Probably yes. Dividing by std::numeric_limits<int>::max() is so special, that I wouldn't be too surprised if the compiler comes with some tricks. My first approach would again be to look at the output of the compiler and maybe compare different compilers. Only if the compilers output turns out to be suboptimal I'd bother to enter some manual bit-fiddling.
For further reading this might be of interest: How expensive is it to convert between int and double? . TL;DR: it actually depends on the hardware.
If performance were a real concern I think I'd be inclined to represent this 'integer that is really a fraction' in its own class and perform any conversion only where necessary.
For example:
#include <iostream>
#include <cstdint>
#include <limits>
struct fraction
{
using value_type = std::uint32_t;
constexpr explicit fraction(value_type num = 0) : numerator_(num) {}
static constexpr auto denominator() -> value_type { return std::numeric_limits<value_type>::max(); }
constexpr auto numerator() const -> value_type { return numerator_; }
constexpr auto as_double() const -> double {
return double(numerator()) / denominator();
}
constexpr auto as_float() const -> float {
return float(as_double());
}
private:
value_type numerator_;
};
auto generate() -> std::uint32_t;
int main()
{
auto frac = fraction(generate());
// use/manipulate/display frac here ...
// ... and finally convert to double/float if necessary
std::cout << frac.as_double() << std::endl;
}
However if you look at code gen on godbolt you'll see that the CPU's floating point instructions take care of the conversion. I'd be inclined to measure performance before you run the risk of wasting time on early optimisation.

How does this float square root approximation work?

I found a rather strange but working square root approximation for floats; I really don't get it. Can someone explain me why this code works?
float sqrt(float f)
{
const int result = 0x1fbb4000 + (*(int*)&f >> 1);
return *(float*)&result;
}
I've test it a bit and it outputs values off of std::sqrt() by about 1 to 3%. I know of the Quake III's fast inverse square root and I guess it's something similar here (without the newton iteration) but I'd really appreciate an explanation of how it works.
(nota: I've tagged it both c and c++ since it's both valid-ish (see comments) C and C++ code)
(*(int*)&f >> 1) right-shifts the bitwise representation of f. This almost divides the exponent by two, which is approximately equivalent to taking the square root.1
Why almost? In IEEE-754, the actual exponent is e - 127.2 To divide this by two, we'd need e/2 - 64, but the above approximation only gives us e/2 - 127. So we need to add on 63 to the resulting exponent. This is contributed by bits 30-23 of that magic constant (0x1fbb4000).
I'd imagine the remaining bits of the magic constant have been chosen to minimise the maximum error across the mantissa range, or something like that. However, it's unclear whether it was determined analytically, iteratively, or heuristically.
It's worth pointing out that this approach is somewhat non-portable. It makes (at least) the following assumptions:
The platform uses single-precision IEEE-754 for float.
The endianness of float representation.
That you will be unaffected by undefined behaviour due to the fact this approach violates C/C++'s strict-aliasing rules.
Thus it should be avoided unless you're certain that it gives predictable behaviour on your platform (and indeed, that it provides a useful speedup vs. sqrtf!).
1. sqrt(a^b) = (a^b)^0.5 = a^(b/2)
2. See e.g. https://en.wikipedia.org/wiki/Single-precision_floating-point_format#Exponent_encoding
See Oliver Charlesworth’s explanation of why this almost works. I’m addressing an issue raised in the comments.
Since several people have pointed out the non-portability of this, here are some ways you can make it more portable, or at least make the compiler tell you if it won’t work.
First, C++ allows you to check std::numeric_limits<float>::is_iec559 at compile time, such as in a static_assert. You can also check that sizeof(int) == sizeof(float), which will not be true if int is 64-bits, but what you really want to do is use uint32_t, which if it exists will always be exactly 32 bits wide, will have well-defined behavior with shifts and overflow, and will cause a compilation error if your weird architecture has no such integral type. Either way, you should also static_assert() that the types have the same size. Static assertions have no run-time cost and you should always check your preconditions this way if possible.
Unfortunately, the test of whether converting the bits in a float to a uint32_t and shifting is big-endian, little-endian or neither cannot be computed as a compile-time constant expression. Here, I put the run-time check in the part of the code that depends on it, but you might want to put it in the initialization and do it once. In practice, both gcc and clang can optimize this test away at compile time.
You do not want to use the unsafe pointer cast, and there are some systems I’ve worked on in the real world where that could crash the program with a bus error. The maximally-portable way to convert object representations is with memcpy(). In my example below, I type-pun with a union, which works on any actually-existing implementation. (Language lawyers object to it, but no successful compiler will ever break that much legacy code silently.) If you must do a pointer conversion (see below) there is alignas(). But however you do it, the result will be implementation-defined, which is why we check the result of converting and shifting a test value.
Anyway, not that you’re likely to use it on a modern CPU, here’s a gussied-up C++14 version that checks those non-portable assumptions:
#include <cassert>
#include <cmath>
#include <cstdint>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <limits>
#include <vector>
using std::cout;
using std::endl;
using std::size_t;
using std::sqrt;
using std::uint32_t;
template <typename T, typename U>
inline T reinterpret(const U x)
/* Reinterprets the bits of x as a T. Cannot be constexpr
* in C++14 because it reads an inactive union member.
*/
{
static_assert( sizeof(T)==sizeof(U), "" );
union tu_pun {
U u = U();
T t;
};
const tu_pun pun{x};
return pun.t;
}
constexpr float source = -0.1F;
constexpr uint32_t target = 0x5ee66666UL;
const uint32_t after_rshift = reinterpret<uint32_t,float>(source) >> 1U;
const bool is_little_endian = after_rshift == target;
float est_sqrt(const float x)
/* A fast approximation of sqrt(x) that works less well for subnormal numbers.
*/
{
static_assert( std::numeric_limits<float>::is_iec559, "" );
assert(is_little_endian); // Could provide alternative big-endian code.
/* The algorithm relies on the bit representation of normal IEEE floats, so
* a subnormal number as input might be considered a domain error as well?
*/
if ( std::isless(x, 0.0F) || !std::isfinite(x) )
return std::numeric_limits<float>::signaling_NaN();
constexpr uint32_t magic_number = 0x1fbb4000UL;
const uint32_t raw_bits = reinterpret<uint32_t,float>(x);
const uint32_t rejiggered_bits = (raw_bits >> 1U) + magic_number;
return reinterpret<float,uint32_t>(rejiggered_bits);
}
int main(void)
{
static const std::vector<float> test_values{
4.0F, 0.01F, 0.0F, 5e20F, 5e-20F, 1.262738e-38F };
for ( const float& x : test_values ) {
const double gold_standard = sqrt((double)x);
const double estimate = est_sqrt(x);
const double error = estimate - gold_standard;
cout << "The error for (" << estimate << " - " << gold_standard << ") is "
<< error;
if ( gold_standard != 0.0 && std::isfinite(gold_standard) ) {
const double error_pct = error/gold_standard * 100.0;
cout << " (" << error_pct << "%).";
} else
cout << '.';
cout << endl;
}
return EXIT_SUCCESS;
}
Update
Here is an alternative definition of reinterpret<T,U>() that avoids type-punning. You could also implement the type-pun in modern C, where it’s allowed by standard, and call the function as extern "C". I think type-punning is more elegant, type-safe and consistent with the quasi-functional style of this program than memcpy(). I also don’t think you gain much, because you still could have undefined behavior from a hypothetical trap representation. Also, clang++ 3.9.1 -O -S is able to statically analyze the type-punning version, optimize the variable is_little_endian to the constant 0x1, and eliminate the run-time test, but it can only optimize this version down to a single-instruction stub.
But more importantly, this code isn’t guaranteed to work portably on every compiler. For example, some old computers can’t even address exactly 32 bits of memory. But in those cases, it should fail to compile and tell you why. No compiler is just suddenly going to break a huge amount of legacy code for no reason. Although the standard technically gives permission to do that and still say it conforms to C++14, it will only happen on an architecture very different from we expect. And if our assumptions are so invalid that some compiler is going to turn a type-pun between a float and a 32-bit unsigned integer into a dangerous bug, I really doubt the logic behind this code will hold up if we just use memcpy() instead. We want that code to fail at compile time, and to tell us why.
#include <cassert>
#include <cstdint>
#include <cstring>
using std::memcpy;
using std::uint32_t;
template <typename T, typename U> inline T reinterpret(const U &x)
/* Reinterprets the bits of x as a T. Cannot be constexpr
* in C++14 because it modifies a variable.
*/
{
static_assert( sizeof(T)==sizeof(U), "" );
T temp;
memcpy( &temp, &x, sizeof(T) );
return temp;
}
constexpr float source = -0.1F;
constexpr uint32_t target = 0x5ee66666UL;
const uint32_t after_rshift = reinterpret<uint32_t,float>(source) >> 1U;
extern const bool is_little_endian = after_rshift == target;
However, Stroustrup et al., in the C++ Core Guidelines, recommend a reinterpret_cast instead:
#include <cassert>
template <typename T, typename U> inline T reinterpret(const U x)
/* Reinterprets the bits of x as a T. Cannot be constexpr
* in C++14 because it uses reinterpret_cast.
*/
{
static_assert( sizeof(T)==sizeof(U), "" );
const U temp alignas(T) alignas(U) = x;
return *reinterpret_cast<const T*>(&temp);
}
The compilers I tested can also optimize this away to a folded constant. Stroustrup’s reasoning is [sic]:
Accessing the result of an reinterpret_cast to a different type from the objects declared type is still undefined behavior, but at least we can see that something tricky is going on.
Update
From the comments: C++20 introduces std::bit_cast, which converts an object representation to a different type with unspecified, not undefined, behavior. This doesn’t guarantee that your implementation will use the same format of float and int that this code expects, but it doesn’t give the compiler carte blanche to break your program arbitrarily because there’s technically undefined behavior in one line of it. It can also give you a constexpr conversion.
Let y = sqrt(x),
it follows from the properties of logarithms that log(y) = 0.5 * log(x) (1)
Interpreting a normal float as an integer gives INT(x) = Ix = L * (log(x) + B - σ) (2)
where L = 2^N, N the number of bits of the significand, B is the exponent bias, and σ is a free factor to tune the approximation.
Combining (1) and (2) gives: Iy = 0.5 * (Ix + (L * (B - σ)))
Which is written in the code as (*(int*)&x >> 1) + 0x1fbb4000;
Find the σ so that the constant equals 0x1fbb4000 and determine whether it's optimal.
Adding a wiki test harness to test all float.
The approximation is within 4% for many float, but very poor for sub-normal numbers. YMMV
Worst:1.401298e-45 211749.20%
Average:0.63%
Worst:1.262738e-38 3.52%
Average:0.02%
Note that with argument of +/-0.0, the result is not zero.
printf("% e % e\n", sqrtf(+0.0), sqrt_apx(0.0)); // 0.000000e+00 7.930346e-20
printf("% e % e\n", sqrtf(-0.0), sqrt_apx(-0.0)); // -0.000000e+00 -2.698557e+19
Test code
#include <float.h>
#include <limits.h>
#include <math.h>
#include <stddef.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
float sqrt_apx(float f) {
const int result = 0x1fbb4000 + (*(int*) &f >> 1);
return *(float*) &result;
}
double error_value = 0.0;
double error_worst = 0.0;
double error_sum = 0.0;
unsigned long error_count = 0;
void sqrt_test(float f) {
if (f == 0) return;
volatile float y0 = sqrtf(f);
volatile float y1 = sqrt_apx(f);
double error = (1.0 * y1 - y0) / y0;
error = fabs(error);
if (error > error_worst) {
error_worst = error;
error_value = f;
}
error_sum += error;
error_count++;
}
void sqrt_tests(float f0, float f1) {
error_value = error_worst = error_sum = 0.0;
error_count = 0;
for (;;) {
sqrt_test(f0);
if (f0 == f1) break;
f0 = nextafterf(f0, f1);
}
printf("Worst:%e %.2f%%\n", error_value, error_worst*100.0);
printf("Average:%.2f%%\n", error_sum / error_count);
fflush(stdout);
}
int main() {
sqrt_tests(FLT_TRUE_MIN, FLT_MIN);
sqrt_tests(FLT_MIN, FLT_MAX);
return 0;
}

Rounding error reduction?

Consider the following functions:
#include <iostream>
#include <iomanip>
#include <cmath>
#include <limits>
template <typename Type>
inline Type a(const Type dx, const Type a0, const Type z0, const Type b1)
{
return (std::sqrt(std::abs(2*b1-z0))*dx)+a0;
}
template <typename Type>
inline Type b(const Type dx, const Type a0, const Type z0, const Type a1)
{
return (std::pow((a1-a0)/dx, 2)+ z0)/2;
}
int main(int argc, char* argv[])
{
double dx = 1.E-6;
double a0 = 1;
double a1 = 2;
double z0 = -1.E7;
double b1 = -10;
std::cout<<std::scientific;
std::cout<<std::setprecision(std::numeric_limits<double>::digits10);
std::cout<<a1-a(dx, a0, z0, b(dx, a0, z0, a1))<<std::endl;
std::cout<<b1-b(dx, a0, z0, a(dx, a0, z0, b1))<<std::endl;
return 0;
}
On my machine, it returns:
0.000000000000000e+00
-1.806765794754028e-07
Instead of (0, 0). There is a large rounding error with the second expression.
My question is: how to reduce the rounding error of each function without changing the type (I need to keep these 2 functions declarations (but the formulas can be rearanged): they come from a larger program)?
Sadly, all of the floating point types are notorious for rounding error. They can't even store 0.1 without it (you can prove this using long division by hand: the binary equivalent is 0b0.0001100110011001100...). You might try some workarounds like expanding that pow to a hard-coded multiplication, but you'll ultimately need to code your program to anticipate and minimize the effects of rounding error. Here are a couple ideas:
Never compare floating point values for equality. Some alternative comparisons I have seen include: abs(a-b) < delta, or percent_difference (a,b) < delta or even abs(a/b-1) < delta, where delta is a "suitably small" value you have determined works for this specific test.
Avoid adding long arrays of numbers into an accumulator; the end of the array may be completely lost to rounding error as the accumulator grows large. In "Cuda by Example" by Jason Sanders and Edward Kandrot, the authors recommend recursively adding each pair of elements individually so that each step produces an array half the size of the previous step, until you get a one-element array.
In a(), you lose precision when you add a0 (which is exactly 1) to the small and imprecise result of sqrt()*dx.
The function b() doesn't lose any precision using the supplied values.
When you call a() before b() as in the second output, you're doing mathematical operations on a number that's already imprecise, compounding the error.
Try to structure the mathematical operations so you do operations that are less likely to create floating point errors first and those more likely to create floating point errors last.
Or, inside your functions, make sure they are operating on "long double" values. For example, the following uses floating-point promotion to promote double to long double during the first mathematical operation (pay attention to operator precedence):
template <typename Type>
inline Type a(const Type dx, const Type a0, const Type z0, const Type b1)
{
return (std::sqrt(std::abs(2*static_cast<long double>(b1)-z0))*dx)+a0;
}