Algorithm for closed-form polynomial root finding - c++

I'm looking for a robust algorithm (or a paper describing an algorithm) that can find roots of polynomials (ideally up to the 4th debree, but anything will do) using a closed-form solution. I'm only interested in the real roots.
My first take on solving quadratic equations involved this (I also have code in similar style for cubics / quartics, but let's focus on quadratics right now):
* #brief a simple quadratic equation solver
* With double-precision floating-point, this reaches 1e-12 worst-case and 1e-15 average
* precision of the roots (the value of the function in the roots). The roots can be however
* quite far from the true roots, up to 1e-10 worst-case and 1e-18 average absolute difference
* for cases when two roots exist. If only a single root exists, the worst-case precision is
* 1e-13 and average-case precision is 1e-18.
* With single-precision floating-point, this reaches 1e-3 worst-case and 1e-7 average
* precision of the roots (the value of the function in the roots). The roots can be however
* quite far from the true roots, up to 1e-1 worst-case and 1e-10 average absolute difference
* for cases when two roots exist. If only a single root exists, the worst-case precision is
* 1e+2 (!) and average-case precision is 1e-2. Do not use single-precision floating point,
* except if pressed by time.
* All the precision measurements are scaled by the maximum absolute coefficient value.
* #tparam T is data type of the arguments (default double)
* #tparam b_sort_roots is root sorting flag (if set, the roots are
* given in ascending (not absolute) value; default true)
* #tparam n_2nd_order_coeff_log10_thresh is base 10 logarithm of threshold
* on the first coefficient (if below threshold, the equation is a linear one; default -6)
* #tparam n_zero_discriminant_log10_thresh is base 10 logarithm of threshold
* on the discriminant (if below negative threshold, the equation does not
* have a real root, if below threshold, the equation has just a single solution; default -6)
template <class T = double, const bool b_sort_roots = true,
const int n_2nd_order_coeff_log10_thresh = -6,
const int n_zero_discriminant_log10_thresh = -6>
class CQuadraticEq {
T a; /**< #brief the 2nd order coefficient */
T b; /**< #brief the 1st order coefficient */
T c; /**< #brief 0th order coefficient */
T p_real_root[2]; /**< #brief list of the roots (real parts) */
//T p_im_root[2]; // imaginary part of the roots
size_t n_real_root_num; /**< #brief number of real roots */
* #brief default constructor; solves for roots of \f$ax^2 + bx + c = 0\f$
* This finds roots of the given equation. It tends to find two identical roots instead of one, rather
* than missing one of two different roots - the number of roots found is therefore orientational,
* as the roots might have the same value.
* #param[in] _a is the 2nd order coefficient
* #param[in] _b is the 1st order coefficient
* #param[in] _c is 0th order coefficient
CQuadraticEq(T _a, T _b, T _c) // ax2 + bx + c = 0
:a(_a), b(_b), c(_c)
T _aa = fabs(_a);
if(_aa < f_Power_Static(10, n_2nd_order_coeff_log10_thresh)) { // otherwise division by a yields large numbers, this is then more precise
p_real_root[0] = -_c / _b;
//p_im_root[0] = 0;
n_real_root_num = 1;
// a simple linear equation
if(_aa < 1) { // do not divide always, that makes it worse
_b /= _a;
_c /= _a;
_a = 1;
// could copy the code here and optimize away division by _a (optimizing compiler might do it for us)
// improve numerical stability if the coeffs are very small
const double f_thresh = f_Power_Static(10, n_zero_discriminant_log10_thresh);
double f_disc = _b * _b - 4 * _a * _c;
if(f_disc < -f_thresh) // only really negative
n_real_root_num = 0; // only two complex roots
else if(/*fabs(f_disc) < f_thresh*/f_disc <= f_thresh) { // otherwise gives problems for double root situations
p_real_root[0] = T(-_b / (2 * _a));
n_real_root_num = 1;
} else {
f_disc = sqrt(f_disc);
int i = (b_sort_roots)? ((_a > 0)? 0 : 1) : 0; // produce sorted roots, if required
p_real_root[i] = T((-_b - f_disc) / (2 * _a));
p_real_root[1 - i] = T((-_b + f_disc) / (2 * _a));
//p_im_root[0] = 0;
//p_im_root[1] = 0;
n_real_root_num = 2;
* #brief gets number of real roots
* #return Returns number of real roots (0 to 2).
size_t n_RealRoot_Num() const
_ASSERTE(n_real_root_num >= 0);
return n_real_root_num;
* #brief gets value of a real root
* #param[in] n_index is zero-based index of the root
* #return Returns value of the specified root.
T f_RealRoot(size_t n_index) const
_ASSERTE(n_index < 2 && n_index < n_real_root_num);
return p_real_root[n_index];
* #brief evaluates the equation for a given argument
* #param[in] f_x is value of the argument \f$x\f$
* #return Returns value of \f$ax^2 + bx + c\f$.
T operator ()(T f_x) const
T f_x2 = f_x * f_x;
return f_x2 * a + f_x * b + c;
The code is horrible, and I hate all the thresholds. But for random equations with roots in the [-100, 100] interval, this is not so bad:
root response precision 1e-100: 6315 cases
root response precision 1e-19: 2 cases
root response precision 1e-17: 2 cases
root response precision 1e-16: 6 cases
root response precision 1e-15: 6333 cases
root response precision 1e-14: 3765 cases
root response precision 1e-13: 241 cases
root response precision 1e-12: 3 cases
2-root solution precision 1e-100: 5353 cases
2-root solution precision 1e-19: 656 cases
2-root solution precision 1e-18: 4481 cases
2-root solution precision 1e-17: 2312 cases
2-root solution precision 1e-16: 455 cases
2-root solution precision 1e-15: 68 cases
2-root solution precision 1e-14: 7 cases
2-root solution precision 1e-13: 2 cases
1-root solution precision 1e-100: 3022 cases
1-root solution precision 1e-19: 38 cases
1-root solution precision 1e-18: 197 cases
1-root solution precision 1e-17: 68 cases
1-root solution precision 1e-16: 7 cases
1-root solution precision 1e-15: 1 cases
Note that this precision is relative to the magnitude of the coefficients, which is typically in the 10^6 range (so finally the precision is far from perfect, but probably mostly usable). Without the thresholds, however, it is near to useless.
I have tried using multiple precision arithmetics, which generally works well, but tends to reject many of the roots simply because the coefficients of the polynomial are not multiple precision and some polynomials cannot be exactly represented (if there is a double root in a 2nd degree polynomial, it mostly either splits it to two roots (which I wouldn't mind) or says that there is no root whatsoever). If I want to recover perhaps even slightly imprecise roots, my code gets complicated and full of thresholds.
So far, I've tried using CCmath, but either I can't use it correctly, or the precision is really bad. Also, it uses iterative (not closed-form) solver in plrt().
I have tried using GNU scientific library gsl_poly_solve_quadratic() but that seems to be a naive approach, and not very numerically stable.
Using std::complex numbers naively also turned out to be a really bad idea, as both the precision and speed can be bad (especially with cubic / quartic equations where the code is heavy with transcendental functions).
Is recovering the roots as complex numbers the only way to go? Then no roots are missed and the user can select how precise the roots need to be (and thus ignore small imaginary components in less precise roots).

This isn't really answering your question but I think you can improve on what you've got since you currently have a 'loss of significance' problem when b^2 >> ac. In such cases, you end up with a formula along the lines of (-b + (b + eps))/(2 * a) where the cancellation of the b's can lose many significant figures from eps.
The correct way of handling this is to use the 'normal' equation for roots of a quadratic for one root and the lesser known 'alternative' or 'upside down' equation for the other root. Which way round you take them depends on the sign of _b.
A change to your code along this lines of the following should reduce the errors resulting from this.
if( _b > 0 ) {
p_real_root[i] = T((-_b - f_disc) / (2 * _a));
p_real_root[1 - i] = T((2 * _c) / (-_b - f_disc));
p_real_root[i] = T((2 * _c) / (-_b + f_disc));
p_real_root[1 - i] = T((-_b + f_disc) / (2 * _a));


How to increase accuracy of floating point second derivative calculation?

I've written a simple program to calculate the first and second derivative of a function, using function pointers. My program computes the correct answers (more or less), but for some functions, the accuracy is less than I would like.
This is the function I am differentiating:
float f1(float x) {
return (x * x);
These are the derivative functions, using the central finite difference method:
// Function for calculating the first derivative.
float first_dx(float (*fx)(float), float x) {
float h = 0.001;
float dfdx;
dfdx = (fx(x + h) - fx(x - h)) / (2 * h);
return dfdx;
// Function for calculating the second derivative.
float second_dx(float (*fx)(float), float x) {
float h = 0.001;
float d2fdx2;
d2fdx2 = (fx(x - h) - 2 * fx(x) + fx(x + h)) / (h * h);
return d2fdx2;
Main function:
int main() {
float x = 2.0;
pc.printf("**** Function Pointers ****\r\n");
pc.printf("Value of f(%f): %f\r\n", x, f1(x));
pc.printf("First derivative: %f\r\n", first_dx(f1, x));
pc.printf("Second derivative: %f\r\n\r\n", second_dx(f1, x));
This is the output from the program:
**** Function Pointers ****
Value of f(2.000000): 4.000000
First derivative: 3.999948
Second derivative: 1.430511
I'm happy with the accuracy of the first derivative, but I believe the second derivative is too far off (it should be equal to ~2.0).
I have a basic understanding of how floating point numbers are represented and why they are sometimes inaccurate, but how can I make this second derivative result more accurate? Could I be using something better than the central finite difference method, or is there a way I can get better results with the current method?
The accuracy can be increased by choosing a type which has more precision. float is currently defined as an IEEE-754 32-bit number, giving you a precision of ~7.225 decimal places.
What you want is the 64-bit counterpart: double with ~15.955 decimal places accuracy.
That should be sufficient for your calculation, however worth mentioning is boosts implementation which offers a quadruple-precision floating point number (128-bit).
Finally The GNU Multiple Precision Arithmetic Library offers types with an arbitrary number of decimal places for precision.
Go analytical. ;-) probably not an option given "with the current
Use double instead of float.
Vary the epsilon (h), and combine the results in some way. For example you could try 0.00001, 0.000001, 0.0000001 and average them. In fact, you'd want the result with the smallest h that doesn't overflow/underflow. But it's not clear how to detect overflow and underflow.

Is there a way to optimize this function?

For an application I'm working on, I need to take two integers and add them together using a particular mathematical formula. This ends up looking like this:
int16_t add_special(int16_t a, int16_t b) {
float limit = std::numeric_limits<int16_t>::max();//32767 as a floating point value
float a_fl = a, b_fl = b;
float numerator = a_fl + b_fl;
float denominator = 1 + a_fl * b_fl / std::pow(limit, 2);
float final_value = numerator / denominator;
return static_cast<int16_t>(std::round(final_value));
Any readers with a passing familiarity with physics will recognize that this formula is the same as what is used to calculate the sum of near-speed-of-light velocities, and the calculation here intentionally mirrors that computation.
The code as-written gives the results I need: for low numbers, they nearly add together normally, but for high numbers, they converge to the maximum value of 32767, i.e.
add_special(10, 15) == 25
add_special(100, 200) == 300
add_special(1000, 3000) == 3989
add_special(10000, 25000) == 28390
add_special(30000, 30000) == 32640
Which all appears to be correct.
The problem, however, is that the function as-written involves first transforming the numbers into floating point values before transforming them back into integers. This seems like a needless detour for numbers that I know, as a principle of its domain, will never not be integers.
Is there a faster, more optimized way to perform this computation? Or is this the most optimized version of this function I can create?
I'm building for x86-64, using MSVC 14.X, although methods that also work for GCC would be beneficial. Also, I'm not interested in SSE/SIMD optimizations at this stage; I'm mostly just looking at the elementary operations being performed on the data.
You might avoid floating number and does all computation in integral type:
constexpr int16_t add_special(int16_t a, int16_t b) {
std::int64_t limit = std::numeric_limits<int16_t>::max();
std::int64_t a_fl = a;
std::int64_t b_fl = b;
return static_cast<int16_t>(((limit * limit) * (a_fl + b_fl)
+ ((limit * limit + a_fl * b_fl) / 2)) /* Handle round */
/ (limit * limit + a_fl * b_fl));
but according to Benchmark, it is not faster for those values.
As noted by Johannes Overmann, a big performance boost is gained by avoiding std::round, at the cost of some (little) discrepancies in the results, though.
I tried some other little changes HERE, where it seems that the following is a faster approach (at least for that architecture)
constexpr int32_t i_max = std::numeric_limits<int16_t>::max();
constexpr int64_t i_max_2 = static_cast<int64_t>(i_max) * i_max;
int16_t my_add_special(int16_t a, int16_t b)
// integer multipication instead of floating point division
double numerator = (a + b) * i_max_2;
double denominator = i_max_2 + a * b;
// Approximated rounding instead of std::round
return 0.5 + numerator / denominator;
Use 32767.0*32767.0 (which is a constant) instead of std::pow(limit, 2).
Use integer values as much as possible, potentially with fixed points. Just the two divisions are a problem. Use floats just form them, if necessary (depends on the input data ranges).
Make it inline if the function is small and if it is appropriate.
Something like:
int16_t add_special(int16_t a, int16_t b) {
float numerator = int32_t(a) + int32_t(b); // Cannot overflow.
float denominator = 1 + (int32_t(a) * int32_t(b)) / (32767.0 * 32767.0); // Cannot overflow either.
return (numerator / denominator) + 0.5; // Relying on implementation defined rounding. Not good but potentially faster than std::round().
The only risk with the above is the omission of the explicit rounding, so you will get some implicit rounding.

Calculating the area of overlap of two functions

Locked. There are disputes about this question’s content being resolved at this time. It is not currently accepting new answers or interactions.
I have two functions. I am giving here the basic structure only, as they have quite a few parameters each to adjust their exact shape.
For example, y = sin(.1*pi*x)^2 and y = e^-(x-5)^2.
The question is how much area of the sine is captured by the e function:
I tried to be clever and recursively find the points of intersection, but that turned out to be a lot more work than was necessary.
As n.m. pointed out, you want the integral from a to b of min(f, g). Since you're integrating by approximation, you're already stepping through the interval, meaning you can check at each step which function is greater and compute the area of the current trapezoid.
Simple implementation in C:
#define SLICES 10000
* Computes the integral of min(f, g) on [a, b].
* Intended use is for when f and g are both non-negative, real-valued
* functions of one variable.
* That is, f: R -> R and g: R -> R.
* Assumes b ≥ a.
* #param a left boundary of interval to integrate over
* #param b right boundary of interval to integrate over
* #param f function accepting one double argument which returns a double
* #param g function accepting one double argument which returns a double
* #return integral of min(f, g) on [a, b]
double minIntegrate (double a, double b, double (*f)(double), double (*g)(double)) {
double area = 0.0;
// the height of each trapezoid
double deltaX = (b - a) / SLICES;
* We are integrating by partitioning the interval into SLICES pieces, then
* adding the areas of the trapezoids formed to our running total.
* To save a computation, we can cache the last side encountered.
* That is, let lastSide be the minimum of f(x) and g(x), where x was the
* previous "fence post" (side of the trapezoid) encountered.
* Initialize lastSide with the minimum of f and g at the left boundary.
double lastSide = min(f(a), g(a));
// The loop starts at 1 since we already have the last (trapezoid) side
// for the 0th fencepost.
for (int i = 1; i <= SLICES; i++) {
double fencePost = a + (i * deltaX);
double currentSide = min(f(fencePost), g(fencePost));
area += trapezoid(lastSide, currentSide, deltaX);
lastSide = currentSide;
return area;
* Computes the area of a trapezoid with bases `a` and `b` and height `height`.
double trapezoid (double a, double b, double height) {
return h * (a + b) / 2.0;
If you're looking for something really, really simple, why don't you do Monte Carlo Integration?
Use the fact that the functions are easy to calculate to sample a large number of points. For each point, check whether it's below 0, 1, or 2 of the curves.
You might have some fiddling to find the boundaries for the sampling, but this method will work for a variety of curves.
I'm guessing your exponential is actually of the form e^-(x-5)^2 so that the exponential decays to zero at plus/minus infinity.
Given that, your integral would be most quickly and accurately calculated by something called Gaussian quadrature. There are a few types of common integrals which have very simple solutions using different polynomials (Hermite, Legendre, etc.). Yours specifically looks like it could be solved using Gauss-Hermite quadrature.
Hope this helps.

Strange multiplication result

In my code I have this multiplications in a C++ code with all variable types as double[]
f1[0] = (f1_rot[0] * xu[0]) + (f1_rot[1] * yu[0]);
f1[1] = (f1_rot[0] * xu[1]) + (f1_rot[1] * yu[1]);
f1[2] = (f1_rot[0] * xu[2]) + (f1_rot[1] * yu[2]);
f2[0] = (f2_rot[0] * xu[0]) + (f2_rot[1] * yu[0]);
f2[1] = (f2_rot[0] * xu[1]) + (f2_rot[1] * yu[1]);
f2[2] = (f2_rot[0] * xu[2]) + (f2_rot[1] * yu[2]);
corresponding to these values
Force Rot1 : -5.39155e-07, -3.66312e-07
Force Rot2 : 4.04383e-07, -1.51852e-08
xu: 0.786857, 0.561981, 0.255018
yu: 0.534605, -0.82715, 0.173264
F1: -6.2007e-07, -4.61782e-16, -2.00963e-07
F2: 3.10073e-07, 2.39816e-07, 1.00494e-07
this multiplication in particular produces a wrong value -4.61782e-16 instead of 1.04745e-13
f1[1] = (f1_rot[0] * xu[1]) + (f1_rot[1] * yu[1]);
I hand verified the other multiplications on a calculator and they all seem to produce the correct values.
this is an open mpi compiled code and the above result is for running a single processor, there are different values when running multiple processors for example 40 processors produces 1.66967e-13 as result of F1[1] multiplication.
Is this some kind of mpi bug ? or a type precision problem ? and why does it work okay for the other multiplications ?
Your problem is an obvious result of what is called catastrophic summations:
As we know, a double precision float can handle numbers of around 16 significant decimals.
f1[1] = (f1_rot[0] * xu[1]) + (f1_rot[1] * yu[1])
= -3.0299486605499998e-07 + 3.0299497080000003e-07
= 1.0474500005332475e-13
This is what we obtain with the numbers you have given in your example.
Notice that (-7) - (-13) = 6, which corresponds to the number of decimals in the float you give in your example: (ex: -5.39155e-07 -3.66312e-07, each mantissa is of a precision of 6 decimals). It means that you used here single precision floats.
I am sure that in your calculations, the precision of your numbers is bigger, that's why you find a more precise result.
Anyway, if you use single precision floats, you can't expect a better precision. With a double precision, you can find a precision up to 16. You shouldn't trust a difference between two numbers, unless it is bigger than the mantissa:
Simple precision floats: (a - b) / b >= ~1e-7
Double precision floats: (a - b) / b >= ~4e-16
For further information, see these examples ... or the table in this article ...

Create sine lookup table in C++

How can I rewrite the following pseudocode in C++?
real array sine_table[-1000..1000]
for x from -1000 to 1000
sine_table[x] := sine(pi * x / 1000)
I need to create a sine_table lookup table.
You can reduce the size of your table to 25% of the original by only storing values for the first quadrant, i.e. for x in [0,pi/2].
To do that your lookup routine just needs to map all values of x to the first quadrant using simple trig identities:
sin(x) = - sin(-x), to map from quadrant IV to I
sin(x) = sin(pi - x), to map from quadrant II to I
To map from quadrant III to I, apply both identities, i.e. sin(x) = - sin (pi + x)
Whether this strategy helps depends on how much memory usage matters in your case. But it seems wasteful to store four times as many values as you need just to avoid a comparison and subtraction or two during lookup.
I second Jeremy's recommendation to measure whether building a table is better than just using std::sin(). Even with the original large table, you'll have to spend cycles during each table lookup to convert the argument to the closest increment of pi/1000, and you'll lose some accuracy in the process.
If you're really trying to trade accuracy for speed, you might try approximating the sin() function using just the first few terms of the Taylor series expansion.
sin(x) = x - x^3/3! + x^5/5! ..., where ^ represents raising to a power and ! represents the factorial.
Of course, for efficiency, you should precompute the factorials and make use of the lower powers of x to compute higher ones, e.g. use x^3 when computing x^5.
One final point, the truncated Taylor series above is more accurate for values closer to zero, so its still worthwhile to map to the first or fourth quadrant before computing the approximate sine.
Yet one more potential improvement based on two observations:
1. You can compute any trig function if you can compute both the sine and cosine in the first octant [0,pi/4]
2. The Taylor series expansion centered at zero is more accurate near zero
So if you decide to use a truncated Taylor series, then you can improve accuracy (or use fewer terms for similar accuracy) by mapping to either the sine or cosine to get the angle in the range [0,pi/4] using identities like sin(x) = cos(pi/2-x) and cos(x) = sin(pi/2-x) in addition to the ones above (for example, if x > pi/4 once you've mapped to the first quadrant.)
Or if you decide to use a table lookup for both the sine and cosine, you could get by with two smaller tables that only covered the range [0,pi/4] at the expense of another possible comparison and subtraction on lookup to map to the smaller range. Then you could either use less memory for the tables, or use the same memory but provide finer granularity and accuracy.
long double sine_table[2001];
for (int index = 0; index < 2001; index++)
sine_table[index] = std::sin(PI * (index - 1000) / 1000.0);
One more point: calling trigonometric functions is pricey. if you want to prepare the lookup table for sine with constant step - you may save the calculation time, in expense of some potential precision loss.
Consider your minimal step is "a". That is, you need sin(a), sin(2a), sin(3a), ...
Then you may do the following trick: First calculate sin(a) and cos(a). Then for every consecutive step use the following trigonometric equalities:
sin([n+1] * a) = sin(n*a) * cos(a) + cos(n*a) * sin(a)
cos([n+1] * a) = cos(n*a) * cos(a) - sin(n*a) * sin(a)
The drawback of this method is that during this procedure the round-off error is accumulated.
double table[1000] = {0};
for (int i = 1; i <= 1000; i++)
sine_table[i-1] = std::sin(PI * i/ 1000.0);
double getSineValue(int multipleOfPi){
if(multipleOfPi == 0) return 0.0;
int sign = 1;
if(multipleOfPi < 0){
sign = -1;
return signsine_table[signmultipleOfPi - 1];
You can reduce the array length to 500, by a trick sin(pi/2 +/- angle) = +/- cos(angle).
So store sin and cos from 0 to pi/4.
I don't remember from top of my head but it increased the speed of my program.
You'll want the std::sin() function from <cmath>.
another approximation from a book or something
streamin ramp;
streamout sine;
float x,rect,k,i,j;
x = ramp -0.5;
rect = x * (1 - x < 0 & 2);
k = (rect + 0.42493299) *(rect -0.5) * (rect - 0.92493302) ;
i = 0.436501 + (rect * (rect + 1.05802));
j = 1.21551 + (rect * (rect - 2.0580201));
sine = i*j*k*60.252201*x;
full discussion here:
I presume that you know, that using a division is a lot slower than multiplying by decimal number, /5 is always slower than *0.2
it's just an approximation.
streamin ramp;
streamin x; // 1.5 = Saw 3.142 = Sin 4.5 = SawSin
streamout sine;
float saw,saw2;
saw = (ramp * 2 - 1) * x;
saw2 = saw * saw;
sine = -0.166667 + saw2 * (0.00833333 + saw2 * (-0.000198409 + saw2 * (2.7526e-006+saw2 * -2.39e-008)));
sine = saw * (1+ saw2 * sine);