Numerical accuracy of pow(a/b,x) vs pow(b/a,-x)

Numerical accuracy of pow(a/b,x) vs pow(b/a,-x) - c++

Is there a difference in accuracy between pow(a/b,x) and pow(b/a,-x)?
If there is, does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?
Edit: Let's assume x86_64 processor and gcc compiler.
Edit: I tried comparing using some random numbers. For example:
printf("%.20f",pow(8.72138221/1.761329479,-1.51231)) // 0.08898783049228660424
printf("%.20f",pow(1.761329479/8.72138221, 1.51231)) // 0.08898783049228659037
So, it looks like there is a difference (albeit minuscule in this case), but maybe someone who knows about the algorithm implementation could comment on what the maximum difference is, and under what conditions.

Here's one way to answer such questions, to see how floating-point behaves. This is not a 100% correct way to analyze such question, but it gives a general idea.
Let's generate random numbers. Calculate v0=pow(a/b, n) and v1=pow(b/a, -n) in float precision. And calculate ref=pow(a/b, n) in double precision, and round it to float. We use ref as a reference value (we suppose that double has much more precision than float, so we can trust that ref can be considered the best possible value. This is true for IEEE-754 for most of the time). Then sum the difference between v0-ref and v1-ref. The difference should calculated with "the number of floating point numbers between v and ref".
Note, that the results may be depend on the range of a, b and n (and on the random generator quality. If it's really bad, it may give a biased result). Here, I've used a=[0..1], b=[0..1] and n=[-2..2]. Furthermore, this answer supposes that the algorithm of float/double division/pow is the same kind, have the same characteristics.
For my computer, the summed differences are: 2604828 2603684, it means that there is no significant precision difference between the two.
Here's the code (note, this code supposes IEEE-754 arithmetic):
#include <cmath>
#include <stdio.h>
#include <string.h>
long long int diff(float a, float b) {
unsigned int ai, bi;
memcpy(&ai, &a, 4);
memcpy(&bi, &b, 4);
long long int diff = (long long int)ai - bi;
if (diff<0) diff = -diff;
return diff;
}
int main() {
long long int e0 = 0;
long long int e1 = 0;
for (int i=0; i<10000000; i++) {
float a = 1.0f*rand()/RAND_MAX;
float b = 1.0f*rand()/RAND_MAX;
float n = 4.0f*rand()/RAND_MAX - 2.0f;
if (a==0||b==0) continue;
float v0 = std::pow(a/b, n);
float v1 = std::pow(b/a, -n);
float ref = std::pow((double)a/b, n);
e0 += diff(ref, v0);
e1 += diff(ref, v1);
}
printf("%lld %lld\n", e0, e1);
}

... between pow(a/b,x) and pow(b/a,-x) ... does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?
Whichever division is more arcuate.
Consider z = xy = 2y * log2(x).
Roughly: The error in y * log2(x) is magnified by the value of z to form the error in z. xy is very sensitive to the error in x. The larger the |log2(x)|, the greater concern.
In OP's case, both pow(a/b,p) and pow(b/a,-p), in general, have the same y * log2(x) and same z and similar errors in z. It is a question of how x, y are formed:
a/b and b/a, in general, both have the same error of +/- 0.5*unit in the last place and so both approaches are of similar error.
Yet with select values of a/b vs. b/a, one quotient will be more exact and it is that approach with the lower pow() error.
pow(7777777/4,-p) can be expected to be more accurate than pow(4/7777777,p).
Lacking assurance about the error in the division, the general rule applies: no major difference.

In general, the form with the positive power is slightly better, although by so little it will likely have no practical effect. Specific cases could be distinguished. For example, if either a or b is a power of two, it ought to be used as the denominator, as the division then has no rounding error.
In this answer, I assume IEEE-754 binary floating-point with round-to-nearest-ties-to-even and that the values involved are in the normal range of the floating-point format.
Given a, b, and x with values a, b, and x, and an implementation of pow that computes the representable value nearest the ideal mathematical value (actual implementations are generally not this good), pow(a/b, x) computes (a/b•(1+e0))x•(1+e1), where e0 is the rounding error that occurs in the division and e1 is the rounding error that occurs in the pow, and pow(b/a, -x) computes (b/a•(1+e2))−x•(1+e3), where e2 and e3 are the rounding errors in this division and this pow, respectively.
Each of the errors, e0…e3 lies in the interval [−u/2, u/2], where u is the unit of least precision (ULP) of 1 in the floating-point format. (The notation [p, q] is the interval containing all values from p to q, including p and q.) In case a result is near the edge of a binade (where the floating-point exponent changes and the significand is near 1), the lower bound may be −u/4. At this time, I will not analyze this case.
Rewriting, these are (a/b)x•(1+e0)x•(1+e1) and (a/b)x•(1+e2)−x•(1+e3). This reveals the primary difference is in (1+e0)x versus (1+e2)−x. The 1+e1 versus 1+e3 is also a difference, but this is just the final rounding. [I may consider further analysis of this later but omit it for now.]
Consider (1+e0)x and (1+e2)−x.The potential values of the first expression span [(1−u/2)x, (1+u/2)x], while the second spans [(1+u/2)−x, (1−u/2)−x]. When x > 0, the second interval is longer than the first:
The length of the first is (1+u/2)x−(1+u/2)x.
The length of the second is (1/(1−u/2))x−(1/(1+u/2))x.
Multiplying the latter by (1−u2/22)x produces ((1−u2/22)/(1−u/2))x−( (1−u2/22)/(1+u/2))x = (1+u/2)x−(1+u/2)x, which is the length of the first interval.
1−u2/22 < 1, so (1−u2/22)x < 1 for positive x.
Since the first length equals the second length times a number less than one, the first interval is shorter.
Thus, the form in which the exponent is positive is better in the sense that it has a shorter interval of potential results.
Nonetheless, this difference is very slight. I would not be surprised if it were unobservable in practice. Also, one might be concerned with the probability distribution of errors rather than the range of potential errors. I suspect this would also favor positive exponents.

For evaluation of rounding errors like in your case, it might be useful to use some multi-precision library, such as Boost.Multiprecision. Then, you can compare results for various precisions, e.g, such as with the following program:
#include <iomanip>
#include <iostream>
#include <boost/multiprecision/cpp_bin_float.hpp>
#include <boost/multiprecision/cpp_dec_float.hpp>
namespace mp = boost::multiprecision;
template <typename FLOAT>
void comp() {
FLOAT a = 8.72138221;
FLOAT b = 1.761329479;
FLOAT c = 1.51231;
FLOAT e = mp::pow(a / b, -c);
FLOAT f = mp::pow(b / a, c);
std::cout << std::fixed << std::setw(40) << std::setprecision(40) << e << std::endl;
std::cout << std::fixed << std::setw(40) << std::setprecision(40) << f << std::endl;
}
int main() {
std::cout << "Double: " << std::endl;
comp<mp::cpp_bin_float_double>();
td::cout << std::endl;
std::cout << "Double extended: " << std::endl;
comp<mp::cpp_bin_float_double_extended>();
std::cout << std::endl;
std::cout << "Quad: " << std::endl;
comp<mp::cpp_bin_float_quad>();
std::cout << std::endl;
std::cout << "Dec-100: " << std::endl;
comp<mp::cpp_dec_float_100>();
std::cout << std::endl;
}
Its output reads, on my platform:
Double:
0.0889878304922865903670015086390776559711
0.0889878304922866181225771242679911665618
Double extended:
0.0889878304922865999079806265115166752366
0.0889878304922865999012043629334822725241
Quad:
0.0889878304922865999004910375213273866639
0.0889878304922865999004910375213273505527
Dec-100:
0.0889878304922865999004910375213273881004
0.0889878304922865999004910375213273881004
Live demo: https://wandbox.org/permlink/tAm4sBIoIuUy2lO6
For double, the first calculation was more accurate, however, I guess one cannot make any generic conclusions here.
Also, note that your input numbers are not accurately representable with the IEEE 754 double precision floating-point type (none of them). The question is whether you care about the accuracy of calculations with either those exact numbers of their closest representations.

Related

Does exist two numbers that multiplied (or divided) each other introduce error?

Here's the bank of tests I'm doing, learning how FP basic ops (+, -, *, /) would introduce errors:
#include <iostream>
#include <math.h>
int main() {
std::cout.precision(100);
double a = 0.499999999999999944488848768742172978818416595458984375;
double original = 47.9;
double target = original * a;
double back = target / a;
std::cout << original << std::endl;
std::cout << back << std::endl;
std::cout << fabs(original - back) << std::endl; // its always 0.0 for the test I did
}
Can you show to me two values (original and a) that, once * (or /), due to FP math, introduce error?
And if they exist, is it possible to establish if that error is introduced by * or /? And how? (since you need both for coming back to the value; 80 bit?)
With + is easy (just add 0.499999999999999944488848768742172978818416595458984375 to 0.5, and you get 1.0, as for 0.5 + 0.5).
But I'm not able to do the same with * or /.

The output of:
#include <cstdio>
int main(void)
{
double a = 1000000000000.;
double b = 1000000000000.;
std::printf("a = %.99g.\n", a);
std::printf("a = %.99g.\n", b);
std::printf("a*b = %.99g.\n", a*b);
}
is:
a = 1000000000000.
a = 1000000000000.
a*b = 999999999999999983222784.
assuming IEEE-754 basic 64-bit binary floating-point with correct rounding to nearest, ties to even.
Obviously, 999999999999999983222784 differs from the exact mathematical result of 1000000000000•1000000000000, 1000000000000000000000000.

Multiply any two large† numbers, and there is likely going to be error because representable values have great distances in the high range of values.
While this error can be great in absolute terms, it is still small in relation to the size of the number itself, so if you perform the reverse division, the error of the first operation is scaled down in the same ratio, and disappears completely. As such, this sequence of operations is stable.
If the result of the multiplication would be greater than the maximum value representable, then it would overflow to inifinity (may depend on configuration), in which case reverse division won't result in the original value, but remains as infinity.
Similarly, if you divide with a great number, you will potentially underflow the smallest representable value resulting in either zero or a subnormal value.
† Numbers do not necessarily have to be huge. It's just easier to perceive the issue when considering huge values. The problem applies to quite small values as well. For example:
2.100000000000000088817841970012523233890533447265625 ×
2.100000000000000088817841970012523233890533447265625
Correct result:
4.410000000000000373034936274052605470949292688633679117285...
Example floating point result:
4.410000000000000142108547152020037174224853515625
Error:
2.30926389122032568296724439173008679117285652827862296732064351090230047702789306640625
× 10^-16

Does exist two numbers that multiplied (or divided) each other introduce error?
This is much easier to see with "%a".
When the precision of the result is insufficient, rounding occurs. Typically double has 53 bits of binary precision. Multiplying 2 27-bit numbers below results in an exact 53-bit answer, but 2 28 bit ones cannot form a 55-bit significant answer.
Division is easy to demo, just try 1.0/n*n.
int main(void) {
double a = 1 + 1.0/pow(2,26);
printf("%.15a, %.17e\n", a, a);
printf("%.15a, %.17e\n", a*a, a*a);
double b = 1 + 1.0/pow(2,27);
printf("%.15a, %.17e\n", b, b);
printf("%.15a, %.17e\n", b*b, b*b);
for (int n = 47; n < 52; n += 2) {
volatile double frac = 1.0/n;
printf("%.15a, %.17e %d\n", frac, frac, n);
printf("%.15a, %.17e\n", frac*n, frac*n);
}
return 0;
}
Output
//v-------v 27 significant bits.
0x1.000000400000000p+0, 1.00000001490116119e+00
//v-------------v 53 significant bits.
0x1.000000800000100p+0, 1.00000002980232261e+00
//v-------v 28 significant bits.
0x1.000000200000000p+0, 1.00000000745058060e+00
//v--------------v not 55 significant bits.
0x1.000000400000000p+0, 1.00000001490116119e+00
// ^^^ all zeros here, not the expected mathematical answer.
0x1.5c9882b93105700p-6, 2.12765957446808505e-02 47
0x1.000000000000000p+0, 1.00000000000000000e+00
0x1.4e5e0a72f053900p-6, 2.04081632653061208e-02 49
0x1.fffffffffffff00p-1, 9.99999999999999889e-01 <==== Not 1.0
0x1.414141414141400p-6, 1.96078431372549017e-02 51
0x1.000000000000000p+0, 1.00000000000000000e+00

How many floats can be added together before floating point precision becomes an issue

I am currently recording some frame times in MS instead of ticks. I know this can be an issue as we are adding all the frame times (in MS) together and then dividing by the number of frames. This could cause bad results due to floating point precision.
It would make more sense to add all the tick counts together then convert to MS once at the end.
However, I am wondering what the actual difference would be for a small number of samples? I expect to have between 900-1800 samples. Would this be an issue at all?
I have made this small example and run it on GCC 4.9.2:
// Example program
#include <iostream>
#include <string>
int main()
{
float total = 0.0f;
double total2 = 0.0f;
for(int i = 0; i < 1000000; ++i)
{
float r = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
total += r;
total2 += r;
}
std::cout << "Total: " << total << std::endl;
std::cout << "Total2: " << total2 << std::endl;
}
Result:
Total: 500004 Total2: 500007
So as far as I can tell with 1 million values we do not lose a lot of precision. Though I am not sure if what I have written is a reasonable test or actually testing what I want to test.
So my question is, how many floats can I add together before precision becomes an issue? I expect my values to be between 1 and 60 MS. I would like the end precision to be within 1 millisecond. I have 900-1800 values.
Example Value: 15.1345f for 15 milliseconds.

Counterexample
Using the assumptions below about the statement of the problem (times are effectively given as values such as .06 for 60 milliseconds), if we convert .06 to float and add it 1800 times, the computed result is 107.99884796142578125. This differs from the mathematical result, 108.000, by more than .001. Therefore, the computed result will sometimes differ from the mathematical result by more than 1 millisecond, so the goal desired in the question is not achievable in these conditions. (Further refinement of the problem statement and alternate means of computation may be able to achieve the goal.)
Original Analysis
Suppose we have 1800 integer values in [1, 60] that are converted to float using float y = x / 1000.f;, where all operations are implemented using IEEE-754 basic 32-bit binary floating-point with correct rounding.
The conversions of 1 to 60 to float are exact. The division by 1000 has an error of at most ½ ULP(.06), which is ½ • 2−5 • 2−23 = 2−29. 1800 such errors amount to at most 1800 • 2−29.
As the resulting float values are added, there may be an error of at most ½ ULP in each addition, where the ULP is that of the current result. For a loose analysis, we can bound this with the ULP of the final result, which is at most around 1800 • .06 = 108, which has an ULP of 26 • 2−23 = 2−17. So each of the 1799 additions has an error of at most 2−17, so the total errors in the additions is at most 1799 • 2−18.
Thus, the total error during divisions and additions is at most 1800 • 2−29 + 1799 • 2−18, which is about .006866.
That is a problem. I expect a better analysis of the errors in the additions would halve the error bound, as it is an arithmetic progression from 0 to the total, but that still leaves a potential error above .003, which means there is a possibility the sum could be off by several milliseconds.
Note that if the times are added as integers, the largest potential sum is 1800•60 = 108,000, which is well below the first integer not representable in float (16,777,217). Addition of these integers in float would be error-free.
This bound of .003 is small enough that some additional constraints on the problem and some additional analysis might, just might, push it below .0005, in which case the computed result will always be close enough to the correct mathematical result that rounding the computed result to the nearest millisecond would produce the correct answer.
For example, if it were known that, while the times range from 1 to 60 milliseconds, the total is always less than 7.8 seconds, that could suffice.

As much as possible, reduce the errors caused by floating point calculations
Since you've already described measuring your individual timings in milliseconds, it's far better if you accumulate those timings using integer values before you finally divide them:
std::milliseconds duration{};
for(Timing const& timing : timings) {
//Lossless integer accumulation, in a scenario where overflow is extremely unlikely
//or possibly even impossible for your problem domain
duration += std::milliseconds(timing.getTicks());
}
//Only one floating-point calculation performed, error is minimal
float averageTiming = duration.count() / float(timings.size());
The Errors that accumulate are highly particular to the scenario
Consider these two ways of accumulating values:
#include<iostream>
int main() {
//Make them volatile to prevent compilers from optimizing away the additions
volatile float sum1 = 0, sum2 = 0;
for(float i = 0.0001; i < 1000; i += 0.0001) {
sum1 += i;
}
for(float i = 1000; i > 0; i -= 0.0001) {
sum2 += i;
}
std::cout << "Sum1: " << sum1 << std::endl;
std::cout << "Sum2: " << sum2 << std::endl;
std::cout << "% Difference: " << (sum2 - sum1) / (sum1 > sum2 ? sum1 : sum2) * 100 << "%" << std::endl;
return 0;
}
Results may vary on some machines (particularly machines that don't have IEEE754 floats), but in my tests, the second value was 3% different than the first value, a difference of 13 million. That can be pretty significant.
Like before, the best option is to minimize the number of calculations performed using floating point values until the last possible step before you need them as floating point values. That will minimize accuracy losses.

Just for what it's worth, here's some code to demonstrate that yes, after 1800 items, a simple accumulation can be incorrect by more than 1 millisecond, but Kahan summation maintains the required level of accuracy.
#include <iostream>
#include <iterator>
#include <iomanip>
#include <vector>
#include <numeric>
template <class InIt>
typename std::iterator_traits<InIt>::value_type accumulate(InIt begin, InIt end)
{
typedef typename std::iterator_traits<InIt>::value_type real;
real sum = real();
real running_error = real();
for (; begin != end; ++begin)
{
real difference = *begin - running_error;
real temp = sum + difference;
running_error = (temp - sum) - difference;
sum = temp;
}
return sum;
}
int main()
{
const float addend = 0.06f;
const float count = 1800.0f;
std::vector<float> d;
std::fill_n(std::back_inserter(d), count, addend);
float result = std::accumulate(d.begin(), d.end(), 0.0f);
float result2 = accumulate(d.begin(), d.end());
float reference = count * addend;
std::cout << " simple: " << std::setprecision(20) << result << "\n";
std::cout << " Kahan: " << std::setprecision(20) << result2 << "\n";
std::cout << "Reference: " << std::setprecision(20) << reference << "\n";
}
For this particular test, it appears that double precision is sufficient, at least for the input values I tried--but to be honest, I'm still a bit leery of it, especially when exhaustive testing isn't reasonable, and better techniques are easily available.

Rounding error detection

I have two integers n and d. These can be exactly represented by double dn(n) and double dd(d). Is there a reliable way in C++ to check if
double result = dn/dd
contains a rounding error? If it was just an integer-division checking if (n/d) * d==n would work but doing that with double precision arithmetic could hide rounding errors.
Edit: Shortly after posting this it struck me that changing the rounding mode to round_down would make the (n/d)*d==n test work for double. But if there is a simpler solution, I'd still like to hear it.

If a hardware FMA is available, then, in most cases (cases where n is expected not to be small, per below), the fastest test may be:
#include <cmath>
…
double q = dn/dd;
if (std::fma(-q, dd, dn))
std::cout << "Quotient was not exact.\n";
This can fail if nd−q•dd is so small it is rounded to zero, which occurs in round-to-nearest-ties-to-even mode if its magnitude is smaller than half the smallest representable positive value (commonly 2−1074). That can happen only if dn itself is small. I expect I could calculate some bound on dn for that if desired, and, given that dn = n and n is an integer, that should not occur.
Ignoring the exponent bounds, a way to test the significands for divisibility is:
#include <cfloat>
#include <cmath>
…
int sink; // Needed for frexp argument but will be ignored.
double fn = std::ldexp(std::frexp(n, &sink), DBL_MANT_DIG);
double fd = std::frexp(d, &sink);
if (std::fmod(fn, fd))
std::cout << "Quotient will not be exact.\n";
Given that n and d are integers that are exactly representable in the floating-point type, I think we could show their exponents cannot be such that the above test would fail. There are cases where n is a small integer and d is large (a value from 21023 to 21024−2972, inclusive) that I need to think about.

If you ignore overflow and underflow (which you should be able to do unless the integer types representing d and n are very wide), then the (binary) floating-point division dn/dd is exact iff d is a divisor of n times a power of two.
An algorithm to check for this may look like:
assert(d != 0);
while (d & 1 == 0) d >>= 1; // extract largest odd divisor of d
int exact = n % d == 0;
This is cheaper than changing the FPU rounding mode if you want the rounding mode to be “to nearest” the rest of the time, and there probably exist bit-twiddling tricks that can speed up the extraction of the largest odd divisor of d.

Is there a reliable way in C++ to check if double result = dn/dd contains a rounding error?
Should your system allow access to the various FP flags, test for FE_INEXACT after the division.
If FP code is expensive, than at least this code can be used to check integer only solutions.
A C solution follow, (I do not have access to a compliant C++ compiler to test right now)
#include <fenv.h>
// Return 0: no rounding error
// Return 1: rounding error
// Return -1: uncertain
#pragma STDC FENV_ACCESS ON
int Rounding_error_detection(int n, int d) {
double dn = n;
double dd = d;
if (feclearexcept(FE_INEXACT)) return -1;
volatile double result = dn/dd;
(void) result;
int set_excepts = fetestexcept(FE_INEXACT);
return set_excepts != 0;
}
Test code
void Rounding_error_detection_Test(int n, int d) {
printf("Rounding_error_detection(%d, %d) --> %d\n",
n, d, Rounding_error_detection(n,d));
}
int main(void) {
Rounding_error_detection_Test(3, 6);
Rounding_error_detection_Test(3, 7);
}
Output
Rounding_error_detection(3, 6) --> 0
Rounding_error_detection(3, 7) --> 1

If the quotient q=dn/dd is exact, it will divide dn exactly dd times.
Since you have dd being integer, you could test exactness with integer division.
Instead of testing the quotient multiplied by dd with (dn/dd)*dd==dn where round off errors can compensate, you should rather test the remainder.
Indeed std:remainder is always exact:
if(std:remainder(dn,dn/dd)!=0)
std::cout << "Quotient was not exact." << std::endl;

How to remove last significant digits/mantissa bits for floating numbers in C++

I would like to remove last 2 or 3 significant digits for floating number in C++ in efficient way. To formulate the question more accurately -- I would like to discard last few mantissa bits in floating number representation.
Some background: I can arrive to the same floating point number using different ways. For example, if I use bilinear interpolation over rectangle with equal values in its corners results will vary in last couple of digits in different points of the rectangle due to machine accuracy limits. The absolute order of these deviations depends on order of interpolated values. For example if p[i]~1e10 (i == 1..4, values at corners of rectangle) then interpolation error caused by machine accuracy is ~1e-4 (for 8-byte floats). If p[i]~1e-10 then error would be ~1e-24. As the interpolated values are used to calculate 1st or 2nd order derivatives I need to 'smooth out' these difference. One idea is to remove last couple of digits off final result. Below is my take on it:
#include <iostream> // std::cout
#include <limits> // std::numeric_limits
#include <math.h> // fabs
template<typename real>
real remove_digits(const real& value, const int num_digits)
{
//return value;
if (value == 0.) return 0;
const real corrector_power =
(std::numeric_limits<real>::digits10 - num_digits)
- round(log10(fabs(value)));
//the value is too close to the limits of the double minimum value
//to be corrected -- return 0 instead.
if (corrector_power > std::numeric_limits<real>::max_exponent10 -
num_digits - 1)
{
return 0.;
}
const real corrector = pow(10, corrector_power);
const real result =
(value > 0) ? floor(value * corrector + 0.5) / corrector :
ceil(value * corrector - 0.5) / corrector;
return result;
}//remove_digits
int main() {
// g++ (Debian 4.7.2-5) 4.7.2 --- Debian GNU/Linux 7.8 (wheezy) 64-bit
std::cout << remove_digits<float>(12345.1234, 1) << std::endl; // 12345.1
std::cout << remove_digits<float>(12345.1234, 2) << std::endl; // 12345
std::cout << remove_digits<float>(12345.1234, 3) << std::endl; // 12350
std::cout << std::numeric_limits<float>::digits10 << std::endl; // 6
}
It works, but uses 2 expensive operations -- log10 and pow. Is there some smarter way of doing this? As follows from above to achieve my goals I do not need to remove actual decimal digits, but just set 3-4 bits in mantissa representation of floating number to 0.

"Some background: I can arrive to the same floating point number using different ways."
That's not a problem. The "local" resolution near x is approximately x-std::next_after(x,0.0) which over a wide range is linear in x. Your problem with p[i] varying by 20 orders of magnitude doesn't matter as the relative error varies linearly with p[i], which implies it would also vary linearly with the first expression.
More in general, if you want to compare whether a and b are similar enough, just test whether a/b is approximately 1.00000000. (You have bigger problems when b is exactly zero - there's no meaningful way to say whether 1E-10 is "almost equal" to 0)

When a float variable goes out of the float limits, what happens?

I remarked two things:
std::numeric_limits<float>::max()+(a small number) gives: std::numeric_limits<float>::max().
std::numeric_limits<float>::max()+(a large number like: std::numeric_limits<float>::max()/3) gives inf.
Why this difference? Does 1 or 2 results in an OVERFLOW and thus to an undefined behavior?
Edit: Code for testing this:
1.
float d = std::numeric_limits<float>::max();
float q = d + 100;
cout << "q: " << q << endl;
2.
float d = std::numeric_limits<float>::max();
float q = d + (d/3);
cout << "q: " << q << endl;

Formally, the behavior is undefined. On a machine with IEEE
floating point, however, overflow after rounding will result
in Inf. The precision is limited, however, and the results
after rounding of FLT_MAX + 1 are FLT_MAX.
You can see the same effect with values well under FLT_MAX.
Try something like:
float f1 = 1e20; // less than FLT_MAX
float f2 = f1 + 1.0;
if ( f1 == f2 ) ...
The if will evaluate to true, at least with IEEE arithmetic.
(There do exist, or at least have existed, machines where
float has enough precision for the if to evaluate to
false, but they aren't very common today.)

It depends on what you are doing. If the float "overflow" comes in an expression which is directly returned, i.e.
return std::numeric_limits::max() + std::numeric_limits::max();
the operation might not result in an overflow. I cite from the C standard [ISO/IEC 9899:2011]:
The return statement is not an assignment. The overlap restriction of
subclause 6.5.16.1 does not apply to the case of function return. The
representation of floating-point values may have wider range or
precision than implied by the type; a cast may be used to remove this
extra range and precision.
See here for more details.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js