std::remquo purpose and usage? - c++

What is the purpose of the std::remquo function? What is an example of when you would use it instead of the regular std::remainder function?

Suppose I am implementing a sine function. A typical way to implement sine is to design some polynomial s such that s(x) approximates sine of x, but the polynomial is only good for -π/4 <= x <= π/4. Outside of that interval, the polynomial deviates from sine(x) and is a bad approximation. (Making the polynomial good over a larger interval requires a polynomial with more terms, and, at some point, the polynomial becomes larger than is useful.) Commonly, we would also design a polynomial c such that c(x) approximates the cosine of x, in a similar interval.
The remquo function helps us use these polynomials to implement sine. We can use “r = remquo(x, pi/2, &q)” and use q to determine which portion of the circle x is in. (Note that sine is periodic with period 2π, so we only need to know the low few bits of the quotient. The higher bits just indicate x has wrapped around the circle and is repeating sine values.) Depending on which part of the circle x is in, the routine will return s(r), -s(r), c(r), or -c(r) for the sine of x.
There are embellishments, of course, but that is the basic idea. It only works for values of x that are small, not more than a few multiples of 2π. That is because each time x doubles, another bit of the divisor moves into the calculation of the exact result. However, we cannot pass π/2 exactly to remquo, because the precision of the double type is limited. So, as x grows, the error grows.

remquo first appeared in C99 before being in C++ and here is what the C99 rationale says about it:
The remquo functions are intended for implementing argument reductions which can exploit a few low-order bits of the quotient. Note that x may be so large in magnitude relative to y that an exact representation of the quotient is not practical.

Related

Arbitrary precision gamma function

I'm implementing an arbitrary precision arithmetic library in C++ and I'm pretty much stuck when implementing the gamma function.
By using the equivalences gamma(n) = gamma(n - 1) * n and gamma(n) = gamma(n + 1) / n, respectively, I can obtain a rational number r in the range (1; 2] for all real values x.
However, I don't know how to evaluate gamma(r). For the Lanczos approximation (https://en.wikipedia.org/wiki/Lanczos_approximation), I need precomputed values p which happen to calculate a factorial of a non-integer value (?!) and can't be calculated dynamically with my current knowledge... Precomputing values for p wouldn't make much sense when implementing an arbitrary precision library.
Are there any algorithms that compute gamma(r) in a reasonable amount of time with arbitrary precision? Thanks for your help.
Spouge's approximation is similar to Lanczos's approximation, but probably easier to use for arbitrary precision, as you can set the desired error.
Lanczos approximation doesn't seem too bad. What exactly do you suspect?
Parts of code which calculate p, C (Chebyshev polynomials) and (a + 1/2)! can be implemented as stateful objects so that, for example, you can calculate p(i) from p(i-1) and Chebyshev coefficients and be computed once, maintaining their matrix.

What algorithm should I use for high-performance large integer division?

I am encoding large integers into an array of size_t. I already have the other operations working (add, subtract, multiply); as well as division by a single digit. But I would like match the time complexity of my multiplication algorithms if possible (currently Toom-Cook).
I gather there are linear time algorithms for taking various notions of multiplicative inverse of my dividend. This means I could theoretically achieve division in the same time complexity as my multiplication, because the linear-time operation is "insignificant" by comparison anyway.
My question is, how do I actually do that? What type of multiplicative inverse is best in practice? Modulo 64^digitcount? When I multiply the multiplicative inverse by my divisor, can I shirk computing the part of the data that would be thrown away due to integer truncation? Can anyone provide C or C++ pseudocode or give a precise explanation of how this should be done?
Or is there a dedicated division algorithm that is even better than the inverse-based approach?
Edit: I dug up where I was getting "inverse" approach mentioned above. On page 312 of "Art of Computer Programming, Volume 2: Seminumerical Algorithms", Knuth provides "Algorithm R" which is a high-precision reciprocal. He says its time complexity is less than that of multiplication. It is, however, nontrivial to convert it to C and test it out, and unclear how much overhead memory, etc, will be consumed until I code this up, which would take a while. I'll post it if no one beats me to it.
The GMP library is usually a good reference for good algorithms. Their documented algorithms for division mainly depend on choosing a very large base, so that you're dividing a 4 digit number by a 2 digit number, and then proceed via long division.
Long division will require computing 2 digit by 1 digit quotients; this can either be done recursively, or by precomputing an inverse and estimating the quotient as you would with Barrett reduction.
When dividing a 2n-bit number by an n-bit number, the recursive version costs O(M(n) log(n)), where M(n) is the cost of multiplying n-bit numbers.
The version using Barrett reduction will cost O(M(n)) if you use Newton's algorithm to compute the inverse, but according to GMP's documentation, the hidden constant is a lot larger, so this method is only preferable for very large divisions.
In more detail, the core algorithm behind most division algorithms is an "estimated quotient with reduction" calculation, computing (q,r) so that
x = qy + r
but without the restriction that 0 <= r < y. The typical loop is
Estimate the quotient q of x/y
Compute the corresponding reduction r = x - qy
Optionally adjust the quotient so that the reduction r is in some desired interval
If r is too big, then repeat with r in place of x.
The quotient of x/y will be the sum of all the qs produced, and the final value of r will be the true remainder.
Schoolbook long division, for example, is of this form. e.g. step 3 covers those cases where the digit you guessed was too big or too small, and you adjust it to get the right value.
The divide and conquer approach estimates the quotient of x/y by computing x'/y' where x' and y' are the leading digits of x and y. There is a lot of room for optimization by adjusting their sizes, but IIRC you get best results if x' is twice as many digits of y'.
The multiply-by-inverse approach is, IMO, the simplest if you stick to integer arithmetic. The basic method is
Estimate the inverse of y with m = floor(2^k / y)
Estimate x/y with q = 2^(i+j-k) floor(floor(x / 2^i) m / 2^j)
In fact, practical implementations can tolerate additional error in m if it means you can use a faster reciprocal implementation.
The error is a pain to analyze, but if I recall the way to do it, you want to choose i and j so that x ~ 2^(i+j) due to how errors accumulate, and you want to choose x / 2^i ~ m^2 to minimize the overall work.
The ensuing reduction will have r ~ max(x/m, y), so that gives a rule of thumb for choosing k: you want the size of m to be about the number of bits of quotient you compute per iteration — or equivalently the number of bits you want to remove from x per iteration.
I do not know the multiplicative inverse algorithm but it sounds like modification of Montgomery Reduction or Barrett's Reduction.
I do bigint divisions a bit differently.
See bignum division. Especially take a look at the approximation divider and the 2 links there. One is my fixed point divider and the others are fast multiplication algos (like karatsuba,Schönhage-Strassen on NTT) with measurements, and a link to my very fast NTT implementation for 32bit Base.
I'm not sure if the inverse multiplicant is the way.
It is mostly used for modulo operation where the divider is constant. I'm afraid that for arbitrary divisions the time and operations needed to acquire bigint inverse can be bigger then the standard divisions itself, but as I am not familiar with it I could be wrong.
The most common divider in use I saw in implemetations are Newton–Raphson division which is very similar to approximation divider in the link above.
Approximation/iterative dividers usually use multiplication which define their speed.
For small enough numbers is usually long binary division and 32/64bit digit base division fast enough if not fastest: usually they have small overhead, and let n be the max value processed (not the number of digits!)
Binary division example:
Is O(log32(n).log2(n)) = O(log^2(n)).
It loops through all significant bits. In each iteration you need to compare, sub, add, bitshift. Each of those operations can be done in log32(n), and log2(n) is the number of bits.
Here example of binary division from one of my bigint templates (C++):
template <DWORD N> void uint<N>::div(uint &c,uint &d,uint a,uint b)
{
int i,j,sh;
sh=0; c=DWORD(0); d=1;
sh=a.bits()-b.bits();
if (sh<0) sh=0; else { b<<=sh; d<<=sh; }
for (;;)
{
j=geq(a,b);
if (j)
{
c+=d;
sub(a,a,b);
if (j==2) break;
}
if (!sh) break;
b>>=1; d>>=1; sh--;
}
d=a;
}
N is the number of 32 bit DWORDs used to store a bigint number.
c = a / b
d = a % b
qeq(a,b) is a comparison: a >= b greater or equal (done in log32(n)=N)
It returns 0 for a < b, 1 for a > b, 2 for a == b
sub(c,a,b) is c = a - b
The speed boost is gained from that this does not use multiplication (if you do not count the bit shift)
If you use digit with a big base like 2^32 (ALU blocks), then you can rewrite the whole in polynomial like style using 32bit build in ALU operations.
This is usually even faster then binary long division, the idea is to process each DWORD as a single digit, or recursively divide the used arithmetic by half until hit the CPU capabilities.
See division by half-bitwidth arithmetics
On top of all that while computing with bignums
If you have optimized basic operations, then the complexity can lower even further as sub-results get smaller with iterations (changing the complexity of basic operations) A nice example of that are NTT based multiplications.
The overhead can mess thing up.
Due to this the runtime sometimes does not copy the big O complexity, so you should always measure the tresholds and use faster approach for used bit-count to get the max performance and optimize what you can.

Is the floating point implementation of exp() function equivalent to a truncated Taylor series expansion?

Is the floating point implementation of exp() function in cmath equivalent to a truncated Taylor series expansion of a very high order? One possible source of the error we should keep in mind is the finiteness of the number of bits to represent the answer
Is the floating point implementation of exp() function in cmath equivalent to a truncated Taylor series expansion of a very high order?
Equivalent to? Yes. That's because any decent implementation of exp() has an error of half an ULP (unit of least precision) or so. Ignoring problems with finite precision arithmetic, one can always construct a truncated Taylor series that does the same.
However, no decent implementation of exp() will use a Taylor expansion. That would be very very slow, and wouldn't achieve the desired accuracy. It would be a downright stupid implementation. Much better is to use the fact that there is a strong relation between 2x and ex and the fact that 2x is fairly easy to compute given the almost universal power of 2 representation of floating point numbers.
Just an example how you could calculate exp (x):
If x is quite large then the result is +inf. If x is quite small then the result is 0.
Let k = round (x / ln 2). Then exp (x) = 2^k * exp (x - k ln 2). 2^k is very easy to calculate. A small problem is to calculate x - k ln 2 without any rounding error. That's quite easy: Let L1 = ln 2 rounded to say 35 bits, and L2 = ln 2 - L1. k is a smallish integer, so k * L1 has no rounding error, nor has x - k * L1; then we subtract k * L2 which is small and therefore has little rounding error.
To do this quicker (without a division), we calculate k = round (x * (1 / ln 2)). And we check whether x is close to zero, so the whole calculation isn't needed. Anyway, we now calculate exp (x) where the result is between sqrt (1/2) and sqrt (2).
You could calculate exp (x) using a Taylor polynomial. Instead you would probably use a Chebychev polynomial minimising the cutoff error with a much lower degree. With some care you can find a polynomial with a cutoff error substantially less than the lowest bit of the result.
It depends on the implementation of the compiler, C runtime and the processor. However, whoever computes the exponent is unlikely to use the Taylor expansion since better methods exist.
As per glibc, it may use its own implementation which says this in the comment (from sysdeps/ieee754/dbl-64/e_exp.c):
/* An ultimate exp routine. Given an IEEE double machine number x */
/* it computes the correctly rounded (to nearest) value of e^x */
Or it may use hardware supported processor instructions for floating point computations, as with x86 FPU. In both cases you are likely to get a correctly rounded value with full precision.
That's dependent of which C library implementation you're using. In the overy popular glibc, it isn't.

Efficiently dividing a double by a power of 2

I'm implementing a coherent noise function, and was surprised to find that using gradient noise (i.e. Perlin noise) is actually slightly faster than value noise. Profiling shows that the reason for this is the division needed to convert the random int value into a double of range -1.0 to 1.0:
static double noiseValueDouble(int seed, int x, int y, int z) {
return 1.0 - ((double)noiseValueInt(seed, x, y, z) / 1073741824.0);
}
Gradient noise requires a few multiplies more, but due to the precomputed gradient table uses the noiseValueInt directly to compute an index into the table, and doesn't require any division. So my question is, how could I make the above division more efficient, considering that the division is by a power of 2 (2^30).
Theoretically all that would need to be done is to subtract 30 from the double's exponent, but doing that by brute force (i.e. bit manipulation) would lead to all sorts of corner cases (INF, NAN, exponent overflow, etc.). An x86 assembly solution would be ok.
Declare a variable (or constant) with the inverse value and multiply by it, effectively changing the division to a multiplication:
static const double div_2_pow_30 = 1.0 / 1073741824.0;
Another way (utilizing the property that the number is a power of 2) is to modify the exponent with bit operations. Doing that will make the code dependant on doubles being stored using the IEEE standard which may be less portable.
You can modify the exponent directly use the functions frexp and ldexp. I'm not sure if this would be faster though.
I'm not sure you can trust the profiling in here. For smaller, faster functions, the effect of the profiling code itself starts to skew the results.
Run noiseValueDouble and the corresponding alternative in a loop to get better numbers.
An x86 assembler solution is a bit-fiddling solution, you may as well do the bit fiddling in C. Fast power-of-two division instructions (bit shifts) only exist for integers.
If you really want to use special instructions, MMX it up or something.
I tried compiling this with gcc:
double divide(int i) {
return 1.0 - (double)i / 1073741824.0;
}
with -O3 it is coded as an FMULS-instruction, with -O3 -mfpmath=sse -march=core2 it uses SSE-instruction set and encodes it as MULSD. I have no idea what is the fastest, but the function call itself is probably orders of magnitude slower than the actual division.

How to find out from where (x) integral of a function (from that point to infinety) starts to be lesser than some eps?

So we have some function like (pow(e,(-a*x)))/(sqrt(x)) where a, e are const floats. we have some float eps=pow (10,(-4)). We need to find out starting from which x integral of that function from that x to infinety is less than eps? We can not use functions for special default integration function just standart math like operators. point is to achive max evaluetion speed.
If you perform the u-substitution u=sqrt(x), your integral will become 2 * integral e^(-au^2) du. With one more substitution you can reduce it to a standard normal. Once you have it in standard normal form, this reduces to calculating erf(x). The substitutions can be done abstractly for any a, and the results hardcoded for simplicity and speed.
To calculate this integral you need calculate Error function. If you use gcc you can find erf(...) function in math.h, but it doesn't take params to get exact precise. But you can evaluate Error function's value by youself just using Taylor's series. With given eps it possible to calc the exact number of terms of the series.
Hmm, no one seems to understand the question. The question is: given some function f, find the smallest x such that Integral _ x ^ +inf f(x) < eps. That's the question. So basically we try x = 0, then x = 0.1 then x = 0.2 ... until the integral, for all intents and purposes, vanishes.
For example, given the bell curve for IQ of programmers on SO, at what IQ is the cumulative intelligence of programmers with higher IQ vanishingly small? If we pick x = 100 we know at least half the programmers will have a higher IQ than 100, if we pick 120, how many are left? What about 200? If we have 10,000 programmers here and eps = 1/10000 we're basically asking what IQ the top 0.01% of SO contributors have.
The question is: what is the most efficient way to find this number, given that nothing is known about f other than that is decreases fast enough that its the integral from x to infinity approaches zero as x approaches infinity?
The general answer is: you must start with a guess of some kind. If the result is too big, double your guess, and keep going until you satisfy the requirement. Then, go back to the last value you had (which didn't) and do a binary chop to find the smallest x satisfying the requirement.
To make a good guess is hard. One way is to use a Chebychev approximation of the function, integrate it analytically, solve the problem with the resulting polynomial, and use the solution as your starting guess. The assumption is that all functions look like polynomials off sufficiently high order in any given range.