Could someone please help me understand what this function is doing if the input is a complex number a+bi and real = a, imag = b
I have no idea what it could be computing but maybe I am missing something obvious?
double function(double real, double imag)
double y;
double a;
double b;
a = fabs(real);
b = fabs(imag);
if (a < b)
a /= b;
y = b * sqrt(a * a + 1.0);
else if (a > b)
b /= a;
y = a * sqrt(b * b + 1.0);
else if (b == NAN)
y = b;
y = a * sqrt(2);
return y;

The code is a defective attempt to compute the magnitude (absolute value) of the complex number passed to it without incurring needless overflow.
Consider the complex number a + b i, where a and b are the values assigned to a and b in the first few lines of the function. Its magnitude is √(a2+b2). However, if a or b is large, the floating-point calculation a*a might overflow the finite range of the floating-point format and produce infinity (∞) even though the magnitude is within the range. As a simple example, let a be 21000 and b be 0. Then the magnitude is √(22000+0) = 21000, but computing sqrt(a*a + b*b) yields infinity. (Since a*a overflowed and produced ∞, and the rest of the calculations then produce ∞ too.)
The code attempts to solve that by dividing the smaller of a and b by the larger and using a calculation that is mathematically equivalent but that does not overflow. For example, if a < b is true, it executes:
a /= b;
y = b * sqrt(a * a + 1.0);
Then a /= b produces a value less than 1, so all calculation prior to the last are safely within the floating-point finite range: a * a is less than 1, a * a + 1.0 is less than 2, and sqrt(a * a + 1.0) is less than 1.42. When we multiply by b, the final result might overflow to ∞. There are two reasons this might happen: The magnitude of a + b i might exceed the floating-point finite range, so the overflow is correct. Or rounding in the prior calculations might have caused sqrt(a * a + 1.0) to be slightly larger than the mathematical result and sufficient to cause b * sqrt(a * a + 1.0) to overflow when actual value of the magnitude is within the range. As this is not our focus, I will not analyze this case further in this answer.
Aside from that rounding issue, the first two cases are fine. However, this code is incorrect:
else if (b == NAN)
Per IEEE-754 2008 5.11 and IEEE-754 1985 5.7, a NaN is not less than, equal to, or greater than any operand, including itself. It is unordered. This means b == NAN must return false if IEEE-754 is used. C 2018 does not require IEEE-754 be used, but footnote 22 (at 4) says that, if IEC 60559:1989 (effectively IEEE-754) is not supported, the terms “quiet NaN” and “signaling NaN” in the C standard are intended to apply to encodings with similar behavior. And 7.12 5 tells us that NAN expands to a float representing a quiet NaN. Thus, in b == NAN, NAN should behave as an IEEE-754 NaN, and so b == NAN should yield 0, for false.
Therefore, this code governed by else if (b == NAN) is never executed:
y = b;
Instead, execution falls through to the else, which executes:
y = a * sqrt(2);
If a is a NaN, the result is a NaN, as desired. However, if a is a number and b is a NaN, this produces a number as a result when the desired result would be a NaN. Thus, the code is broken.


I have two vectors of double. The value of the double is between -1000 and 1000.
Both vectors contain the same numbers, but the order is different.
For example
Vector1 = {0.1, 0.2, 0.3, 0.4};
Vector2 = {0.4, 0.2, 0.1, 0.3};
Is there a guarantee that the sum of Vector1 will be exactly equal to the sum of Vector2, assuming the sum is done via:
double Sum = 0;
for (double Val : Vector) Sum += Val;
I am worried about double imprecisions.
Is there a guarantee that the sum of Vector1 will be exactly equal to the sum of Vector2, assuming the sum is done via:
No, there is no such guarantee in the C++ language.
In fact, there is an indirect practical guarantee - assuming typical floating point implementation - that the results would be unequal. (But compilers have ways of disabling such guarantees, and of enabling unsafe floating point optimisations that may cause the sum to be equal).
The difference is likely to be very small with the given input, but it can be very large with other inputs.
No, they are not guaranteed to be the same. Here's a simple concrete example:
#include <stdio.h>
int main(void) {
double x = 504.4883585687764;
double y = 29.585946026264367;
double z = 2.91427392498775;
double lhs = x + (y + z);
double rhs = z + (y + x);
printf("LHS : %5.30g\n", lhs);
printf("RHS : %5.30g\n", rhs);
printf("Equal: %s\n", lhs == rhs ? "yes" : "no");
return 0;
When run, this produces:
LHS : 536.988578520028568163979798555
RHS : 536.988578520028454477142076939
Equal: no
I am a circuit designer, not a software engineer, so I have no idea how to track down this problem.
I am working with some IIR filter code and I am have problems with extremely slow execution times when I process extremely small values through the filter. To find the problem, I wrote this test code.
Normally, the loop will run in about 200 ms or so. (I didn't measure it.) But when TestCheckBox->Checked, it requires about 7 seconds to run. The problem lies with the reduction in size of A, B, C and D within the loop, which is exactly what happens to the values in an IIR filter after it's input goes to zero.
I believe the problem lies with the fact that the variable's expononent value becomes less than -308. A simple fix is to declare the variables as long doubles, but that isn't an easy fix in the actual code, and it doesn't seem like I should have to do this.
Any ideas why this happens and what a simple fix might be?
In case its matters, I am using C++ Builder XE3.
int j;
double A, B, C, D, E, F, G, H;
//long double A, B, C, D, E, F, G, H; // a fix
A = (double)random(100000000)/10000000.0 - 5.0;
B = (double)random(100000000)/10000000.0 - 5.0;
C = (double)random(100000000)/10000000.0 - 5.0;
D = (double)random(100000000)/10000000.0 - 5.0;
A *= 1.0E-300;
B *= 1.0E-300;
C *= 1.0E-300;
D *= 1.0E-300;
for(j=0; j<=1000000; j++)
A *= 0.9999;
B *= 0.9999;
C *= 0.9999;
D *= 0.9999;
E = A * B + C - D; // some exercise code
F = A - C * B + D;
G = A + B + C + D;
H = A * C - B + G;
E = A * B + C - D;
F = A - C * B + D;
G = A + B + C + D;
H = A * C - B + G;
E = A * B + C - D;
F = A - C * B + D;
G = A + B + C + D;
H = A * C - B + G;
As the answers said, the cause of this problem is denormal math, something I had never heard of. Wikipedia has a pretty nice description of it as does the MSDN article given by Sneftel.
Having said this, I still can't get my code to flush denormals. The MSDN article says to do this:
_controlfp(_DN_FLUSH, _MCW_DN)
These definitions are not in the XE3 math libraries however, so I used
controlfp(0x01000000, 0x03000000)
per the article, but this is having no affect in XE3. Nor is the code suggested in the Wikipedia article.
Any suggestions?
You're running into denormal numbers (ones less than DBL_MIN, in which the most significant digit is treated as a zero). Denormals extend the range of the representable floating-point numbers, and are important to maintain certain useful error bounds in FP arithmetic, but operating on them is far slower than operating on normal FP numbers. They also have lower precision. So you should try to keep all your numbers (both intermediate and final quantities) greater than DBL_MIN.
In order to increase performance, you can force denormals to be flushed to zero by calling _controlfp(_DN_FLUSH, _MCW_DN) (or, depending on OS and compiler, a similar function). http://msdn.microsoft.com/en-us/library/e9b52ceh.aspx
You've entered the realm of floating-point underflow, resulting in denormalized numbers - depending on the hardware you're likely trapping into software, which will be much much slower than hardware operations.

Numerical stability of division expression

I stumbled across code like
double x,y = ...;
double n = sqrt(x*x+y*y);
if (n > 0)
double d1 = (x*x)/n;
double d2 = (x*y)/n;
and I am wondering about the numerical stability of such an expression for small values of x and y.
For both expressions, lim (x->0, y->0) (...) = 0, so from a mathematical point of view, it looks safe (the nominator O(x²) whereas the denominator is O(x)).
Nevertheless my question is: Are there any possible numerical problems with this code?
EDIT: If possible I'd like to avoid re-writing the expressions because n is actually used more than twice and to keep readability (it's relatively clear in the context what happens).
If x and y are very close to DBL_MIN, the calculations are
succeptible to underflow or extreme loss of precision: if x is
very close to DBL_MIN, for example x * x may be 0.0, or
(for somewhat larger values) it may result in what is called
gradual underflow, with extreme loss of precision: e.g. with
IEEE double (most, if not all desktop and laptop PCs), 1E-300
* 1E-300 will be 0.0. Obviously, if this happens for both
* x and y, you'll end up with n == 0.0, even if x and
y are both positive.
In C++11, there is a function hypot, which will solve the
problem for n; if x * x is 0.0, however, d1 will still
be 0.0; you'll probably get better results with (x / n) * x
(but I think that there still may be limit cases where you'll
end up with 0.0 or gradual underflow—I've not analyzed it sufficiently to be sure). A better solution
would be to scale the data differently, to avoid such limit

Division by zero prevention: Checking the divisor's expression doesn't result in zero vs. checking the divisor isn't zero?

Is division by zero possible in the following case due to the floating point error in the subtraction?
float x, y, z;
if (y != 1.0)
z = x / (y - 1.0);
In other words, is the following any safer?
float divisor = y - 1.0;
if (divisor != 0.0)
z = x / divisor;
Assuming IEEE-754 floating-point, they are equivalent.
It is a basic theorem of FP arithmetic that for finite x and y, x - y == 0 if and only if x == y, assuming gradual underflow.
If subnormal results are flushed to zero (instead of gradual underflow), this theorem holds only if the result x - y is normal. Because 1.0 is well scaled, y - 1.0 is never subnormal, and so y - 1.0 is zero if and only if y is exactly 1.0, regardless of how underflow are handled.
C++ doesn't guarantee IEEE-754, of course, but the theorem is true for most "reasonable" floating-point systems.
This will prevent you from dividing by exactly zero, however that does not mean still won't end up with +/-inf as a result. The denominator could still be small enough so that the answer is not representable with a double and you will end up with an inf. For example:
#include <iostream>
#include <limits>
int main(int argc, char const *argv[])
double small = std::numeric_limits<double>::epsilon();
double large = std::numeric_limits<double>::max() / small;
std::cout << "small: " << small << std::endl;
std::cout << "large: " << large << std::endl;
return 0;
In this program small is non-zero, but it is so small that large exceeds the range of double and is inf.
There is no difference between the two code snippets () - in fact, the optimizer could even optimize both fragments to the same binary code, assuming that there are no further uses of the divisor variable.
Note, however, that division by a floating point zero 0.0 does not result in a run-time error, but produces an inf or -inf instead.

finding cube root in C++?

Strange things happen when i try to find the cube root of a number.
The following code returns me undefined. In cmd : -1.#IND
cout<<pow(( double )(20.0*(-3.2) + 30.0),( double )1/3)
While this one works perfectly fine. In cmd : 4.93242414866094
cout<<pow(( double )(20.0*4.5 + 30.0),( double )1/3)
From mathematical way it must work since we can have the cube root from a negative number.
Pow is from Visual C++ 2010 math.h library. Any ideas?
pow(x, y) from <cmath> does NOT work if x is negative and y is non-integral.
This is a limitation of std::pow, as documented in the C standard and on cppreference:
Error handling
Errors are reported as specified in math_errhandling
If base is finite and negative and exp is finite and non-integer, a domain error occurs and a range error may occur.
If base is zero and exp is zero, a domain error may occur.
If base is zero and exp is negative, a domain error or a pole error may occur.
There are a couple ways around this limitation:
Cube-rooting is the same as taking something to the 1/3 power, so you could do std::pow(x, 1/3.).
In C++11, you can use std::cbrt. C++11 introduced both square-root and cube-root functions, but no generic n-th root function that overcomes the limitations of std::pow.
The power 1/3 is a special case. In general, non-integral powers of negative numbers are complex. It wouldn't be practical for pow to check for special cases like integer roots, and besides, 1/3 as a double is not exactly 1/3!
I don't know about the visual C++ pow, but my man page says under errors:
EDOM The argument x is negative and y is not an integral value. This would result in a complex number.
You'll have to use a more specialized cube root function if you want cube roots of negative numbers - or cut corners and take absolute value, then take cube root, then multiply the sign back on.
Note that depending on context, a negative number x to the 1/3 power is not necessarily the negative cube root you're expecting. It could just as easily be the first complex root, x^(1/3) * e^(pi*i/3). This is the convention mathematica uses; it's also reasonable to just say it's undefined.
While (-1)^3 = -1, you can't simply take a rational power of a negative number and expect a real response. This is because there are other solutions to this rational exponent that are imaginary in nature.
Similarily, plot x^x. For x = -1/3, this should have a solution. However, this function is deemed undefined in R for x < 0.
Therefore, don't expect math.h to do magic that would make it inefficient, just change the signs yourself.
Guess you gotta take the negative out and put it in afterwards. You can have a wrapper do this for you if you really want to.
function yourPow(double x, double y)
if (x < 0)
return -1.0 * pow(-1.0*x, y);
return pow(x, y);
Don't cast to double by using (double), use a double numeric constant instead:
double thingToCubeRoot = -20.*3.2+30;
cout<< thingToCubeRoot/fabs(thingToCubeRoot) * pow( fabs(thingToCubeRoot), 1./3. );
Should do the trick!
Also: don't include <math.h> in C++ projects, but use <cmath> instead.
Alternatively, use pow from the <complex> header for the reasons stated by buddhabrot
pow( x, y ) is the same as (i.e. equivalent to) exp( y * log( x ) )
if log(x) is invalid then pow(x,y) is also.
Similarly you cannot perform 0 to the power of anything, although mathematically it should be 0.
C++11 has the cbrt function (see for example http://en.cppreference.com/w/cpp/numeric/math/cbrt) so you can write something like
#include <iostream>
#include <cmath>
int main(int argc, char* argv[])
const double arg = 20.0*(-3.2) + 30.0;
std::cout << cbrt(arg) << "\n";
std::cout << cbrt(-arg) << "\n";
return 0;
I do not have access to the C++ standard so I do not know how the negative argument is handled... a test on ideone http://ideone.com/bFlXYs seems to confirm that C++ (gcc-4.8.1) extends the cube root with this rule cbrt(x)=-cbrt(-x) when x<0; for this extension you can see http://mathworld.wolfram.com/CubeRoot.html
I was looking for cubit root and found this thread and it occurs to me that the following code might work:
#include <cmath>
using namespace std;
function double nth-root(double x, double n){
if (!(n%2) || x<0){
throw FAILEXCEPTION(); // even root from negative is fail
bool sign = (x >= 0);
x = exp(log(abs(x))/n);
return sign ? x : -x;
I think you should not confuse exponentiation with the nth-root of a number. See the good old Wikipedia
because the 1/3 will always return 0 as it will be considered as integer...
try with 1.0/3.0...
it is what i think but try and implement...
and do not forget to declare variables containing 1.0 and 3.0 as double...
Here's a little function I knocked up.
#define uniform() (rand()/(1.0 + RAND_MAX))
double CBRT(double Z)
double guess = Z;
double x, dx;
int loopbreaker;
x = guess * guess * guess;
loopbreaker = 0;
while (fabs(x - Z) > FLT_EPSILON)
dx = 3 * guess*guess;
if (fabs(dx) < DBL_EPSILON || loopbreaker > 53)
guess += uniform() * 2 - 1.0;
goto retry;
guess -= (x - Z) / dx;
x = guess*guess*guess;
return guess;
It uses Newton-Raphson to find a cube root.
Sometime Newton -Raphson gets stuck, if the root is very close to 0 then the derivative can
get large and it can oscillate. So I've clamped and forced it to restart if that happens.
If you need more accuracy you can change the FLT_EPSILONs.
If you ever have no math library you can use this way to compute the cubic root:
cubic root
double curt(double x) {
if (x == 0) {
// would otherwise return something like 4.257959840008151e-109
return 0;
double b = 1; // use any value except 0
double last_b_1 = 0;
double last_b_2 = 0;
while (last_b_1 != b && last_b_2 != b) {
last_b_1 = b;
// use (2 * b + x / b / b) / 3 for small numbers, as suggested by willywonka_dailyblah
b = (b + x / b / b) / 2;
last_b_2 = b;
// use (2 * b + x / b / b) / 3 for small numbers, as suggested by willywonka_dailyblah
b = (b + x / b / b) / 2;
return b;
It is derives from the sqrt algorithm below. The idea is that b and x / b / b bigger and smaller from the cubic root of x. So, the average of both lies closer to the cubic root of x.
Square Root And Cubic Root (in Python)
def sqrt_2(a):
if a == 0:
return 0
b = 1
last_b = 0
while last_b != b:
last_b = b
b = (b + a / b) / 2
return b
def curt_2(a):
if a == 0:
return 0
b = a
last_b_1 = 0;
last_b_2 = 0;
while (last_b_1 != b and last_b_2 != b):
last_b_1 = b;
b = (b + a / b / b) / 2;
last_b_2 = b;
b = (b + a / b / b) / 2;
return b
In contrast to the square root, last_b_1 and last_b_2 are required in the cubic root because b flickers. You can modify these algorithms to compute the fourth root, fifth root and so on.
Thanks to my math teacher Herr Brenner in 11th grade who told me this algorithm for sqrt.
I tested it on an Arduino with 16mhz clock frequency:
0.3525ms for yourPow
0.3853ms for nth-root
2.3426ms for curt