Random number generation Function explanation - c++

Can anyone explain these two lines of function??
int getRandomNumber(int min, int max)
static const double fraction = 1.0 / (RAND_MAX + 1.0);
return min + static_cast<int>((max - min + 1) * (rand() * fraction));

Looks like it's constraining rand() function's output to fall inside a min and max.
a double type value fraction is calculated from 1.0 / (RAND_MAX + 1.0);
RAND_MAX is a pre-compiler value defined in cstdlib or other header file or library, it's a big positive integer that represents the largest signed int your program can use. A higher RAND_MAX will make fraction lower since 1/number is a reciprocal; the reciprocal of 4 is 1/4 or 0.25. 5 is 1/5 or 0.20
The 1.0's are to implicitly cast RAND_MAX into a floating point number aka decimal; this makes sure that the / division / operator doesn't do integer division (5 / 2 == 2; vs 5.0 / 2.0 == 2.5)
return min + static_cast<int>((max - min + 1) * (rand() * fraction));
Return the integer representation of the min/max spread reduced by a random factor, added to the original minimum.
This line uses the min parameter value as a 'floor'. static_cast<int>() rounds down the floating point value of ((max - min + 1) * (rand() * fraction)) into an integer aka a whole number with no decimal parts. This cast is important for returning an int, but it also ensures that max is not exceeded by rounding down.
(max - min + 1) is the spread between the max and min parameters + 1. So if max == min you would be multiplying (rand() * fraction) by 1 instead of zero.
rand() generates a semi-random integer (no decimal parts) between 0 and RAND_MAX
Since fraction is the reciprocal of RAND_MAX + 1, fraction will always be < 1, and rand()s output will be a random portion of the min/max spread. The key part of understanding this function beyond the mixed C and C++ code confusion is knowing that RAND_MAX is used by the fraction variable AND the rand() function.
Think of the (rand() * fraction) part as a portion of distance from min
I would try feeding this function multiple values, tweaking the min, max, and fraction values each time and see how the output changes, you could probably find a pattern.
By making fraction smaller than (1.0 / RAND_MAX + 1.0) you can cluster the return values closer to the minimum.
(look up math ceiling and floor, and walnut's comment about uniform distribution). This can be done to smooth output, or make something more predictable, or cluster return values around an input value. If the math is the confusing part for you then messing around with code and seeing what happens will likely help your understanding and intuition of math functions.
Simpson's Composite Rule giving too large values for when n is very large

Using Simpson's Composite Rule to calculate the integral from 2 to 1,000 of 1/ln(x), however when using a large n (usually around 500,000), I start to get results that vary from the value my calculator and other sources give me (176.5644). For example, when n = 10,000,000, it gives me a value of 184.1495. Wondering why this is, since as n gets larger, the accuracy is supposed to increase and not decrease.
#include <iostream>
#include <cmath>
// the function f(x)
float f(float x)
return (float) 1 / std::log(x);
float my_simpson(float a, float b, long int n)
if (n % 2 == 1) n += 1; // since n has to be even
float area, h = (b-a)/n;
float x, y, z;
for (int i = 1; i <= n/2; i++)
x = a + (2*i - 2)*h;
y = a + (2*i - 1)*h;
z = a + 2*i*h;
area += f(x) + 4*f(y) + f(z);
return area*h/3;
int main()
int upperBound = 1'000;
int subsplits = 1'000'000;
float approx = my_simpson(2, upperBound, subsplits);
std::cout << "Output: " << approx << std::endl;
return 0;
Update: Switched from floats to doubles and works much better now! Thank you!
Unlike a real (in mathematical sense) number, a float has a limited precision.
A typical IEEE 754 32-bit (single precision) floating-point number binary representation dedicates only 24 bits (one of which is implicit) to the mantissa and that translates in roughly less than 8 decimal significant digits (please take this as a gross semplification).
A double on the other end, has 53 significand bits, making it more accurate and (usually) the first choice for numerical computations, these days.
since as n gets larger, the accuracy is supposed to increase and not decrease.
Unfortunately, that's not how it works. There's a sweat spot, but after that the accumulation of rounding errors prevales and the results diverge from their expected values.
In OP's case, this calculation
area += f(x) + 4*f(y) + f(z);
introduces (and accumulates) rounding errors, due to the fact that area becomes much greater than f(x) + 4*f(y) + f(z) (e.g 224678.937 vs. 0.3606823). The bigger n is, the sooner this gets relevant, making the result diverging from the real one.
As mentioned in the comments, another issue (undefined behavior) is that area isn't initialized (to zero).

Compute integer bounds to include scaled floating point values

I am trying to compute integer array bounds that will include floating point limits divided by a scale. For example, if my origin is 0, my floating point maximum is 10 then my integer array bounds need to be 2. The obvious formula is to divide my bounds by the scale, giving the incorrect result of 1.
I need to divide the inclusive maximum values by the scale and add one if the division is an exact multiple.
I am running into a mismatch between the normal way to define and use integer array indexes and my desired way to use real value coordinates. I am trying to map inclusive real value coordinates into integer array indexes, using a scaling term.
(I am actually working with two dimensional maps, but the problem can be expressed more simply in one dimension.)
This is wrong:
int get_array_size(double, scale, double maximum)
return std::ceil(maximum / scale); // Fails on exact multiples
This is wasteful:
int get_array_size(double, scale, double maximum)
return 1 + std::ceil(maximum / scale); // Allocates extra array memory
This is ugly and I am not sure if it is correct:
int get_array_size(double, scale, double maximum)
if (maximum % scale == 0) // I am not sure if this is correct
return 1 + std::ceil(maximum / scale);
return std::ceil(maximum / scale); // Maybe I can eliminate the call to std::ceil?
I am trying to get the value maximum / scale on every open ended interval ending at multiples of scale and 1 + maximum / scale on every interval from >= multiple of scale ending at < multiple of scale + 1. I am not sure how to correctly express this in mathematical terms or how to implement it in c++. I would be grateful if someone can clarify my understand and point me in the right direction.
Mathematically I think I am trying to define f(x, s) = y s.t. if s * n <= x and x < s * (n + 1) then y = n + 1. I want to implement this efficiently and respect the difference between <= and < comparison.
The way I interpret this question, I think maximum and scale don't actually matter - what you are really asking about is how to correctly map from floats to ints with specific boundary conditions. For example [0.0, 1.0) to 0, [1.0, 2.0) to 1, etc. So the question becomes a bit simpler if we just consider maximum / scale to be a single quantity; I'll call it t.
I believe you actually want to use std::floor instead of std::ceil:
int scaled_coord_to_index(float t) {
return std::floor(t);
And the size of your array should always be the maximum scaled coordinate + 1 (with negative values normalized to start at 0).
int array_size(float min_t, float max_t) {
// NOTE: This will "anchor" your coords based on the most negative value.
// e.g. if that value is 1.6, then your bins will be [1.6, 2.6), [2.6, 3.6), etc.
// To change that behavior you could use std::floor(min_t) instead.
return scaled_coord_to_index(max_t - min_t) + 1;

constrain a value -pi to pi for precision buff

What is the best way to constrain any value from -pi to pi ?
I currently have:
if (fAngle > XM_PI) {
fAngle = fAngle - XM_2PI;
else if (fAngle < -XM_PI) {
fAngle = fAngle - -XM_2PI;
However, I fear those if's should instead be while's
For reference, under the Exploit Symmetrical Functions section:
Extra bit of precision!
Adding or subtracting XM_2PI cannot restore any accuracy that has been lost. In fact, it adds noise, generally losing more accuracy, because XM_2PI is necessarily only an approximation of 2π. It has some error itself, so adding or subtracting it adds or subtracts the error in the approximation.
What it can do is keep you from losing more accuracy by ensuring that future results remain low in magnitude, thus remaining in a region where the floating-point format has more precision than if the number grew beyond 4, 8, 16, or other points where the exponent changes and the absolute precision becomes worse.
If you already have some value x outside [−π, π] and want its sine or cosine, you should get the best result by using sin(x) or cos(x) directly. Good implementations of sin and cos will reduce the argument using a high-precision value for 2π, so you will get a better result than using sin(x-XM_PI) or cos(x-XM_PI) (unless, by chance, the various errors in these happen to cancel).
So your task with trigonometric functions is not to reduce values you already have but to design your algorithms to keep values from growing. Adding or subtracting 2π is a reasonable way to do this. However, when you do it, add or subtract an extended-precision version of 2π, not just XM_2PI. You can do this by representing 2π as XM_2PI (which should be the value representable in floating-point that is closest to 2π) plus some residue r. r should be the value representable in floating-point that is closest to 2π−XM_2PI. You can calculate that with extended-precision software such as GMP or Maple and can likely find it online. (I do not have it handy or I would paste it here; anybody else is welcome to edit it in.) Then you would update your angle with fAngle = fAngle - XM_2PI - r; or fAngle = fAngle + XM_2PI + r;.
An exception is if you have the angle measured in some unit that you can represent or reduce exactly, such as in degrees (which you can reduce by 360º with no error as long as the number of degrees itself is represented with no error) or in time (such as number of seconds for some function with a period of a day or other rational number of seconds, so you can again reduce with no error). In that case, you can let the angle grow as long as you can represent it exactly, and you would reduce it modulo the period prior to converting it to radians.
The simplest coding way is to use the math library function remainder, as in
fAngle = remainder( fangle, XM_2PI);
STATIC_INLINE_PURE float const __vectorcall constrain(float const fAngle)
static constexpr double const
d2PI(2.0 * std::numbers::pi),
dResidue(-1.74845553146951715461909770965576171875e-07); // abs difference between d2PI(double precision) and XM_2PI(float precision)
double dAngle(fAngle);
dAngle = std::remainder(dAngle, d2PI);
if (dAngle > dPI) {
dAngle = dAngle - d2PI - dResidue;
else if (dAngle < -dPI) {
dAngle = dAngle + d2PI + dResidue;

Is there a way to optimize this function?

For an application I'm working on, I need to take two integers and add them together using a particular mathematical formula. This ends up looking like this:
int16_t add_special(int16_t a, int16_t b) {
float limit = std::numeric_limits<int16_t>::max();//32767 as a floating point value
float a_fl = a, b_fl = b;
float numerator = a_fl + b_fl;
float denominator = 1 + a_fl * b_fl / std::pow(limit, 2);
float final_value = numerator / denominator;
return static_cast<int16_t>(std::round(final_value));
Any readers with a passing familiarity with physics will recognize that this formula is the same as what is used to calculate the sum of near-speed-of-light velocities, and the calculation here intentionally mirrors that computation.
The code as-written gives the results I need: for low numbers, they nearly add together normally, but for high numbers, they converge to the maximum value of 32767, i.e.
add_special(10, 15) == 25
add_special(100, 200) == 300
add_special(1000, 3000) == 3989
add_special(10000, 25000) == 28390
add_special(30000, 30000) == 32640
Which all appears to be correct.
The problem, however, is that the function as-written involves first transforming the numbers into floating point values before transforming them back into integers. This seems like a needless detour for numbers that I know, as a principle of its domain, will never not be integers.
Is there a faster, more optimized way to perform this computation? Or is this the most optimized version of this function I can create?
I'm building for x86-64, using MSVC 14.X, although methods that also work for GCC would be beneficial. Also, I'm not interested in SSE/SIMD optimizations at this stage; I'm mostly just looking at the elementary operations being performed on the data.
You might avoid floating number and does all computation in integral type:
constexpr int16_t add_special(int16_t a, int16_t b) {
std::int64_t limit = std::numeric_limits<int16_t>::max();
std::int64_t a_fl = a;
std::int64_t b_fl = b;
return static_cast<int16_t>(((limit * limit) * (a_fl + b_fl)
+ ((limit * limit + a_fl * b_fl) / 2)) /* Handle round */
/ (limit * limit + a_fl * b_fl));
but according to Benchmark, it is not faster for those values.
As noted by Johannes Overmann, a big performance boost is gained by avoiding std::round, at the cost of some (little) discrepancies in the results, though.
I tried some other little changes HERE, where it seems that the following is a faster approach (at least for that architecture)
constexpr int32_t i_max = std::numeric_limits<int16_t>::max();
constexpr int64_t i_max_2 = static_cast<int64_t>(i_max) * i_max;
int16_t my_add_special(int16_t a, int16_t b)
// integer multipication instead of floating point division
double numerator = (a + b) * i_max_2;
double denominator = i_max_2 + a * b;
// Approximated rounding instead of std::round
return 0.5 + numerator / denominator;
Use 32767.0*32767.0 (which is a constant) instead of std::pow(limit, 2).
Use integer values as much as possible, potentially with fixed points. Just the two divisions are a problem. Use floats just form them, if necessary (depends on the input data ranges).
Make it inline if the function is small and if it is appropriate.
Something like:
int16_t add_special(int16_t a, int16_t b) {
float numerator = int32_t(a) + int32_t(b); // Cannot overflow.
float denominator = 1 + (int32_t(a) * int32_t(b)) / (32767.0 * 32767.0); // Cannot overflow either.
return (numerator / denominator) + 0.5; // Relying on implementation defined rounding. Not good but potentially faster than std::round().
The only risk with the above is the omission of the explicit rounding, so you will get some implicit rounding.

Algorithm for closed-form polynomial root finding

I'm looking for a robust algorithm (or a paper describing an algorithm) that can find roots of polynomials (ideally up to the 4th debree, but anything will do) using a closed-form solution. I'm only interested in the real roots.
My first take on solving quadratic equations involved this (I also have code in similar style for cubics / quartics, but let's focus on quadratics right now):
* #brief a simple quadratic equation solver
* With double-precision floating-point, this reaches 1e-12 worst-case and 1e-15 average
* precision of the roots (the value of the function in the roots). The roots can be however
* quite far from the true roots, up to 1e-10 worst-case and 1e-18 average absolute difference
* for cases when two roots exist. If only a single root exists, the worst-case precision is
* 1e-13 and average-case precision is 1e-18.
* With single-precision floating-point, this reaches 1e-3 worst-case and 1e-7 average
* precision of the roots (the value of the function in the roots). The roots can be however
* quite far from the true roots, up to 1e-1 worst-case and 1e-10 average absolute difference
* for cases when two roots exist. If only a single root exists, the worst-case precision is
* 1e+2 (!) and average-case precision is 1e-2. Do not use single-precision floating point,
* except if pressed by time.
* All the precision measurements are scaled by the maximum absolute coefficient value.
* #tparam T is data type of the arguments (default double)
* #tparam b_sort_roots is root sorting flag (if set, the roots are
* given in ascending (not absolute) value; default true)
* #tparam n_2nd_order_coeff_log10_thresh is base 10 logarithm of threshold
* on the first coefficient (if below threshold, the equation is a linear one; default -6)
* #tparam n_zero_discriminant_log10_thresh is base 10 logarithm of threshold
* on the discriminant (if below negative threshold, the equation does not
* have a real root, if below threshold, the equation has just a single solution; default -6)
template <class T = double, const bool b_sort_roots = true,
const int n_2nd_order_coeff_log10_thresh = -6,
const int n_zero_discriminant_log10_thresh = -6>
class CQuadraticEq {
T a; /**< #brief the 2nd order coefficient */
T b; /**< #brief the 1st order coefficient */
T c; /**< #brief 0th order coefficient */
T p_real_root[2]; /**< #brief list of the roots (real parts) */
//T p_im_root[2]; // imaginary part of the roots
size_t n_real_root_num; /**< #brief number of real roots */
* #brief default constructor; solves for roots of \f$ax^2 + bx + c = 0\f$
* This finds roots of the given equation. It tends to find two identical roots instead of one, rather
* than missing one of two different roots - the number of roots found is therefore orientational,
* as the roots might have the same value.
* #param[in] _a is the 2nd order coefficient
* #param[in] _b is the 1st order coefficient
* #param[in] _c is 0th order coefficient
CQuadraticEq(T _a, T _b, T _c) // ax2 + bx + c = 0
:a(_a), b(_b), c(_c)
T _aa = fabs(_a);
if(_aa < f_Power_Static(10, n_2nd_order_coeff_log10_thresh)) { // otherwise division by a yields large numbers, this is then more precise
p_real_root[0] = -_c / _b;
//p_im_root[0] = 0;
n_real_root_num = 1;
// a simple linear equation
if(_aa < 1) { // do not divide always, that makes it worse
_b /= _a;
_c /= _a;
_a = 1;
// could copy the code here and optimize away division by _a (optimizing compiler might do it for us)
// improve numerical stability if the coeffs are very small
const double f_thresh = f_Power_Static(10, n_zero_discriminant_log10_thresh);
double f_disc = _b * _b - 4 * _a * _c;
if(f_disc < -f_thresh) // only really negative
n_real_root_num = 0; // only two complex roots
else if(/*fabs(f_disc) < f_thresh*/f_disc <= f_thresh) { // otherwise gives problems for double root situations
p_real_root[0] = T(-_b / (2 * _a));
n_real_root_num = 1;
} else {
f_disc = sqrt(f_disc);
int i = (b_sort_roots)? ((_a > 0)? 0 : 1) : 0; // produce sorted roots, if required
p_real_root[i] = T((-_b - f_disc) / (2 * _a));
p_real_root[1 - i] = T((-_b + f_disc) / (2 * _a));
//p_im_root[0] = 0;
//p_im_root[1] = 0;
n_real_root_num = 2;
* #brief gets number of real roots
* #return Returns number of real roots (0 to 2).
size_t n_RealRoot_Num() const
_ASSERTE(n_real_root_num >= 0);
return n_real_root_num;
* #brief gets value of a real root
* #param[in] n_index is zero-based index of the root
* #return Returns value of the specified root.
T f_RealRoot(size_t n_index) const
_ASSERTE(n_index < 2 && n_index < n_real_root_num);
return p_real_root[n_index];
* #brief evaluates the equation for a given argument
* #param[in] f_x is value of the argument \f$x\f$
* #return Returns value of \f$ax^2 + bx + c\f$.
T operator ()(T f_x) const
T f_x2 = f_x * f_x;
return f_x2 * a + f_x * b + c;
The code is horrible, and I hate all the thresholds. But for random equations with roots in the [-100, 100] interval, this is not so bad:
root response precision 1e-100: 6315 cases
root response precision 1e-19: 2 cases
root response precision 1e-17: 2 cases
root response precision 1e-16: 6 cases
root response precision 1e-15: 6333 cases
root response precision 1e-14: 3765 cases
root response precision 1e-13: 241 cases
root response precision 1e-12: 3 cases
2-root solution precision 1e-100: 5353 cases
2-root solution precision 1e-19: 656 cases
2-root solution precision 1e-18: 4481 cases
2-root solution precision 1e-17: 2312 cases
2-root solution precision 1e-16: 455 cases
2-root solution precision 1e-15: 68 cases
2-root solution precision 1e-14: 7 cases
2-root solution precision 1e-13: 2 cases
1-root solution precision 1e-100: 3022 cases
1-root solution precision 1e-19: 38 cases
1-root solution precision 1e-18: 197 cases
1-root solution precision 1e-17: 68 cases
1-root solution precision 1e-16: 7 cases
1-root solution precision 1e-15: 1 cases
Note that this precision is relative to the magnitude of the coefficients, which is typically in the 10^6 range (so finally the precision is far from perfect, but probably mostly usable). Without the thresholds, however, it is near to useless.
I have tried using multiple precision arithmetics, which generally works well, but tends to reject many of the roots simply because the coefficients of the polynomial are not multiple precision and some polynomials cannot be exactly represented (if there is a double root in a 2nd degree polynomial, it mostly either splits it to two roots (which I wouldn't mind) or says that there is no root whatsoever). If I want to recover perhaps even slightly imprecise roots, my code gets complicated and full of thresholds.
So far, I've tried using CCmath, but either I can't use it correctly, or the precision is really bad. Also, it uses iterative (not closed-form) solver in plrt().
I have tried using GNU scientific library gsl_poly_solve_quadratic() but that seems to be a naive approach, and not very numerically stable.
Using std::complex numbers naively also turned out to be a really bad idea, as both the precision and speed can be bad (especially with cubic / quartic equations where the code is heavy with transcendental functions).
Is recovering the roots as complex numbers the only way to go? Then no roots are missed and the user can select how precise the roots need to be (and thus ignore small imaginary components in less precise roots).
This isn't really answering your question but I think you can improve on what you've got since you currently have a 'loss of significance' problem when b^2 >> ac. In such cases, you end up with a formula along the lines of (-b + (b + eps))/(2 * a) where the cancellation of the b's can lose many significant figures from eps.
The correct way of handling this is to use the 'normal' equation for roots of a quadratic for one root and the lesser known 'alternative' or 'upside down' equation for the other root. Which way round you take them depends on the sign of _b.
A change to your code along this lines of the following should reduce the errors resulting from this.
if( _b > 0 ) {
p_real_root[i] = T((-_b - f_disc) / (2 * _a));
p_real_root[1 - i] = T((2 * _c) / (-_b - f_disc));
p_real_root[i] = T((2 * _c) / (-_b + f_disc));
p_real_root[1 - i] = T((-_b + f_disc) / (2 * _a));