Difference between FMA and naive a*b+c? - ieee-754

In the BSD Library Functions Manual of FMA(3), it says "These functions compute x * y + z."
So what's the difference between FMA and a naive code which does x * y + z? And why FMA has a better performance in most cases?

[ I don't have enough karma to make a comment; adding another answer seems to be the only possibility. ]
Eric's answer covers everything well, but a caveat: there are times when using fma(a, b, c) in place of a*b+c can cause difficult to diagnose problems.
Consider
x = sqrt(a*a - b*b);
If it is replaced by
x = sqrt(fma(a, a, -b*b));
there are values of a and b for which the argument to the sqrt function may be negative even if |a|>=|b|. In particular, this will occur if |a|=|b| and the infinitely precise product a*a is less than the rounded value of a*a. This follows from the fact that the rounding error in computing a*a is given by fma(a, a, -a*a).

a*b+c produces a result as if the computation were:
Calculate the infinitely precise product of a and b.
Round that product to the floating-point format being used.
Calculate the infinitely precise sum of that result and c.
Round that sum to the floating-point format being used.
fma(a, b, c) produces a result as if the computation were:
Calculate the infinitely precise product of a and b.
Calculate the infinitely precise sum of that product and c.
Round that sum to the floating-point format being used.
So it skips the step of rounding the intermediate product to the floating-pint format.
On a processor with an FMA instruction, a fused multiply-add may be faster because it is one floating-point instruction instead of two, and hardware engineers can often design the processor to do it efficiently. On a processor without an FMA instruction, a fused multiply-add may be slower because the software has to use extra instructions to maintain the information necessary to get the required result.

Related

What algorithm should I use for high-performance large integer division?

I am encoding large integers into an array of size_t. I already have the other operations working (add, subtract, multiply); as well as division by a single digit. But I would like match the time complexity of my multiplication algorithms if possible (currently Toom-Cook).
I gather there are linear time algorithms for taking various notions of multiplicative inverse of my dividend. This means I could theoretically achieve division in the same time complexity as my multiplication, because the linear-time operation is "insignificant" by comparison anyway.
My question is, how do I actually do that? What type of multiplicative inverse is best in practice? Modulo 64^digitcount? When I multiply the multiplicative inverse by my divisor, can I shirk computing the part of the data that would be thrown away due to integer truncation? Can anyone provide C or C++ pseudocode or give a precise explanation of how this should be done?
Or is there a dedicated division algorithm that is even better than the inverse-based approach?
Edit: I dug up where I was getting "inverse" approach mentioned above. On page 312 of "Art of Computer Programming, Volume 2: Seminumerical Algorithms", Knuth provides "Algorithm R" which is a high-precision reciprocal. He says its time complexity is less than that of multiplication. It is, however, nontrivial to convert it to C and test it out, and unclear how much overhead memory, etc, will be consumed until I code this up, which would take a while. I'll post it if no one beats me to it.
The GMP library is usually a good reference for good algorithms. Their documented algorithms for division mainly depend on choosing a very large base, so that you're dividing a 4 digit number by a 2 digit number, and then proceed via long division.
Long division will require computing 2 digit by 1 digit quotients; this can either be done recursively, or by precomputing an inverse and estimating the quotient as you would with Barrett reduction.
When dividing a 2n-bit number by an n-bit number, the recursive version costs O(M(n) log(n)), where M(n) is the cost of multiplying n-bit numbers.
The version using Barrett reduction will cost O(M(n)) if you use Newton's algorithm to compute the inverse, but according to GMP's documentation, the hidden constant is a lot larger, so this method is only preferable for very large divisions.
In more detail, the core algorithm behind most division algorithms is an "estimated quotient with reduction" calculation, computing (q,r) so that
x = qy + r
but without the restriction that 0 <= r < y. The typical loop is
Estimate the quotient q of x/y
Compute the corresponding reduction r = x - qy
Optionally adjust the quotient so that the reduction r is in some desired interval
If r is too big, then repeat with r in place of x.
The quotient of x/y will be the sum of all the qs produced, and the final value of r will be the true remainder.
Schoolbook long division, for example, is of this form. e.g. step 3 covers those cases where the digit you guessed was too big or too small, and you adjust it to get the right value.
The divide and conquer approach estimates the quotient of x/y by computing x'/y' where x' and y' are the leading digits of x and y. There is a lot of room for optimization by adjusting their sizes, but IIRC you get best results if x' is twice as many digits of y'.
The multiply-by-inverse approach is, IMO, the simplest if you stick to integer arithmetic. The basic method is
Estimate the inverse of y with m = floor(2^k / y)
Estimate x/y with q = 2^(i+j-k) floor(floor(x / 2^i) m / 2^j)
In fact, practical implementations can tolerate additional error in m if it means you can use a faster reciprocal implementation.
The error is a pain to analyze, but if I recall the way to do it, you want to choose i and j so that x ~ 2^(i+j) due to how errors accumulate, and you want to choose x / 2^i ~ m^2 to minimize the overall work.
The ensuing reduction will have r ~ max(x/m, y), so that gives a rule of thumb for choosing k: you want the size of m to be about the number of bits of quotient you compute per iteration — or equivalently the number of bits you want to remove from x per iteration.
I do not know the multiplicative inverse algorithm but it sounds like modification of Montgomery Reduction or Barrett's Reduction.
I do bigint divisions a bit differently.
See bignum division. Especially take a look at the approximation divider and the 2 links there. One is my fixed point divider and the others are fast multiplication algos (like karatsuba,Schönhage-Strassen on NTT) with measurements, and a link to my very fast NTT implementation for 32bit Base.
I'm not sure if the inverse multiplicant is the way.
It is mostly used for modulo operation where the divider is constant. I'm afraid that for arbitrary divisions the time and operations needed to acquire bigint inverse can be bigger then the standard divisions itself, but as I am not familiar with it I could be wrong.
The most common divider in use I saw in implemetations are Newton–Raphson division which is very similar to approximation divider in the link above.
Approximation/iterative dividers usually use multiplication which define their speed.
For small enough numbers is usually long binary division and 32/64bit digit base division fast enough if not fastest: usually they have small overhead, and let n be the max value processed (not the number of digits!)
Binary division example:
Is O(log32(n).log2(n)) = O(log^2(n)).
It loops through all significant bits. In each iteration you need to compare, sub, add, bitshift. Each of those operations can be done in log32(n), and log2(n) is the number of bits.
Here example of binary division from one of my bigint templates (C++):
template <DWORD N> void uint<N>::div(uint &c,uint &d,uint a,uint b)
{
int i,j,sh;
sh=0; c=DWORD(0); d=1;
sh=a.bits()-b.bits();
if (sh<0) sh=0; else { b<<=sh; d<<=sh; }
for (;;)
{
j=geq(a,b);
if (j)
{
c+=d;
sub(a,a,b);
if (j==2) break;
}
if (!sh) break;
b>>=1; d>>=1; sh--;
}
d=a;
}
N is the number of 32 bit DWORDs used to store a bigint number.
c = a / b
d = a % b
qeq(a,b) is a comparison: a >= b greater or equal (done in log32(n)=N)
It returns 0 for a < b, 1 for a > b, 2 for a == b
sub(c,a,b) is c = a - b
The speed boost is gained from that this does not use multiplication (if you do not count the bit shift)
If you use digit with a big base like 2^32 (ALU blocks), then you can rewrite the whole in polynomial like style using 32bit build in ALU operations.
This is usually even faster then binary long division, the idea is to process each DWORD as a single digit, or recursively divide the used arithmetic by half until hit the CPU capabilities.
See division by half-bitwidth arithmetics
On top of all that while computing with bignums
If you have optimized basic operations, then the complexity can lower even further as sub-results get smaller with iterations (changing the complexity of basic operations) A nice example of that are NTT based multiplications.
The overhead can mess thing up.
Due to this the runtime sometimes does not copy the big O complexity, so you should always measure the tresholds and use faster approach for used bit-count to get the max performance and optimize what you can.

IEEE-754 floating point: Divide first or multiply first for best precision?

What's better if I want to preserve as much precision as possible in a calculation with IEEE-754 floating point values:
a = b * c / d
or
a = b / d * c
Is there a difference? If there is, does it depend on the magnitudes of the input values? And, if magnitude matters, how is the best ordering determined when general magnitudes of the values are known?
It depends on the magnitude of the values. Obviously if one divides by zero, all bets are off, but if a multiplication or division results in a denormal subsequent operations can lose precision.
You may find it useful to study Goldberg's seminal paper What Every Computer Scientist Should Know About Floating-Point Arithmetic which will explain things far better than any answer you're likely to receive here. (Goldberg was one of the original authors of IEEE-754.)
Assuming that none of the operations would yield an overflow or an underflow, and your input values have uniformly distributed significands, then this is equivalent. Well, I suppose that to have a rigorous proof, one should do an exhaustive test (probably not possible in practice for double precision since there are 2^156 inputs), but if there is a difference in the average error, then it is tiny. I could try in low precisions with Sipe.
In any case, in the absence of overflow/underflow, only the exact values of the significands matter, not the exponents.
However if the result a is added to (or subtracted from) another expression and not reused, then starting with the division may be more interesting since you can group the multiplication with the following addition by using a FMA (thus with a single rounding).

What is the numerical stability of std::pow() compared to iterated multiplication?

What sort of stability issues arise or are resolved by using std::pow()?
Will it be more stable (or faster, or at all different) in general to implement a simple function to perform log(n) iterated multiplies if the exponent is known to be an integer?
How does std::sqrt(x) compare, stability-wise, to something of the form std::pow(x, k/2)? Would it make sense to choose the method preferred for the above to raise to an integer power, then multiply in a square root, or should I assume that std::pow() is fast and accurate to machine precision for this? If k = 1, is there a difference from std::sqrt()?
How would std::pow(x, k/2) or the method above compare, stability-wise, to an integer exponentiation of std::sqrt(x)?
And as a bonus, what are the speed differences likely to be?
Will it be more stable (or faster, or at all different) in general to implement a simple function to perform log(n) iterated multiplies if the exponent is known to be an integer?
The result of exponentiation by squaring for integer exponents is in general less accurate than pow, but both are stable in the sense that close inputs produce close results. You can expect exponentiation by squaring to introduce 0.5 ULP of relative error by multiplication (for instance, 1 ULP of error for computing x3 as x * x * x).
When the second argument n is statically known to be 2, then by all means implement xn as x * x. In that case it is faster and more accurate than any possible alternative.
How does std::sqrt(x) compare, stability-wise, to something of the form std::pow(x, k/2)
First, the accuracy of sqrt cannot be beat for an IEEE 754 implementation, because sqrt is one of the basic operations that this standard mandates to be as accurate as possible.
But you are not asking about sqrt, you are asking (I think) about <computation of xn> * sqrt(x) as opposed to pow(x, n + 0.5). Again, in general, for a quality implementation of pow, you can expect pow(x, n + 0.5) to be more accurate than the alternatives. Although sqrt(x) would be computed to 0.5 ULP, the multiplication introduces its own approximation of up to 0.5 ULP, and all in all, it is better to obtain the result you are interested in in a single call to a well-implemented function. A quality implementation of pow will give you 1 ULP of accuracy for its result, and the best implementations will “guarantee” 0.5 ULP.
And as a bonus, what are the speed differences likely to be?
If you know in advance that the exponent is going to be a small integer or multiple of 0.5, then you have information that the implementer of pow did not have, so you can beat them by at least the cost of the test to determine that the second argument is a small integer. Plus, the implementer of a quality implementation is aiming for a more accurate result than simple exponentiation by squaring provides. On the other hand, the implementer of pow can use extremely sophisticated techniques to minimize the average execution time despite the better accuracy: see for instance CRlibm's implementation. I put the verb “guarantee” above inside quotes when talking about the best implementations of pow because pow is one function for which CRlibm's 0.5 ULP accuracy guarantee is only “with astronomical probability”.

Should I combine multiplication and division steps when working with floating point values?

I am aware of the precision problems in floats and doubles, which why I am asking this:
If I have a formula such as: (a/PI)*180.0 (where PI is a constant)
Should I combine the division and multiplication, so I can use only one division: a/0.017453292519943295769236, in order to avoid loss of precision ?
Does this make it more precise when it has less steps to calculate the result?
Short answer
Yes, you should in general combine as many multiplications and divisions by constants as possible into one operation. It is (in general(*)) faster and more accurate at the same time.
Neither π nor π/180 nor their inverses are representable exactly as floating-point. For this reason, the computation will involve at least one approximate constant (in addition to the approximation of each of the operations involved).
Because two operations introduce one approximation each, it can be expected to be more accurate to do the whole computation in one operation.
In the case at hand, is division or multiplication better?
Apart from that, it is a question of “luck” whether the relative accuracy to which π/180 can be represented in the floating-point format is better or worse than that of 180/π.
My compiler provides addition precision with the long double type, so I am able to use it as reference for answering this question for double:
~ $ cat t.c
#define PIL 3.141592653589793238462643383279502884197L
#include <stdio.h>
int main() {
long double heop = 180.L / PIL;
long double pohe = PIL / 180.L;
printf("relative acc. of π/180: %Le\n", (pohe - (double) pohe) / pohe);
printf("relative acc. of 180/π: %Le\n", (heop - (double) heop) / heop);
}
~ $ gcc t.c && ./a.out
relative acc. of π/180: 1.688893e-17
relative acc. of 180/π: -3.469703e-17
In usual programming practice, one wouldn't bother and simply multiply by (the floating-point representation of) 180/π, because multiplication is so much faster than division.
As it turns out, in the case of the binary64 floating-point type double almost always maps to, π/180 can be represented with better relative accuracy than 180/π, so π/180 is the constant one should use to optimize accuracy: a / ((double) (π / 180)). With this formula, the total relative error would be approximately the sum of the relative error of the constant (1.688893e-17) and of the relative error of the division (which will depend on the value of a but never be more than 2-53).
Alternative methods for faster and more accurate results
Note that division is so expensive that you could get an even more accurate result faster by using one multiplication and one fma: let heop1 be the best double approximation of 180/π, and heop2 the best double approximation of 180/π - heop1. Then the best value for the result can be computed as:
double r = fma(a, heop1, a * heop2);
The fact that the above is the absolute best possible double approximation to the real computation is a theorem (in fact, it is a theorem with exceptions. The details can be found in the “Handbook of Floating-Point Arithmetic”). But even when the real constant you want to multiply a double by in order to get a double result is one of the exceptions to the theorem, the above computation is still clearly very accurate and only differs from the best double approximation for a few exceptional values of a.
If, like mine, your compiler provides more precision for long double than for double, you can also use one long double multiplication:
// this is more accurate than double division:
double r = (double)((long double) a * 57.295779513082320876798L)
This is not as good as the solution based on fma, but it is good enough that for most values of a, it produces the optimal double approximation to the real computation.
A counter-example to the general claim that operations should be grouped as one
(*) The claim that it is better to group constant is only statistically true for most constants.
If you happened to wish to multiply a by, say, the real constant 0.0000001 * DBL_MIN, you would be better off multiplying first by 0.0000001, then by DBL_MIN, and the end result (which can be a normalized number if a is larger than 1000000 or so) would be more precise than if you had multiplied by the best double representation of 0.0000001 * DBL_MIN. This is because the relative accuracy when representing 0.0000001 * DBL_MIN as a single double value is much worse than the accuracy for representing 0.0000001.

Is the floating point implementation of exp() function equivalent to a truncated Taylor series expansion?

Is the floating point implementation of exp() function in cmath equivalent to a truncated Taylor series expansion of a very high order? One possible source of the error we should keep in mind is the finiteness of the number of bits to represent the answer
Is the floating point implementation of exp() function in cmath equivalent to a truncated Taylor series expansion of a very high order?
Equivalent to? Yes. That's because any decent implementation of exp() has an error of half an ULP (unit of least precision) or so. Ignoring problems with finite precision arithmetic, one can always construct a truncated Taylor series that does the same.
However, no decent implementation of exp() will use a Taylor expansion. That would be very very slow, and wouldn't achieve the desired accuracy. It would be a downright stupid implementation. Much better is to use the fact that there is a strong relation between 2x and ex and the fact that 2x is fairly easy to compute given the almost universal power of 2 representation of floating point numbers.
Just an example how you could calculate exp (x):
If x is quite large then the result is +inf. If x is quite small then the result is 0.
Let k = round (x / ln 2). Then exp (x) = 2^k * exp (x - k ln 2). 2^k is very easy to calculate. A small problem is to calculate x - k ln 2 without any rounding error. That's quite easy: Let L1 = ln 2 rounded to say 35 bits, and L2 = ln 2 - L1. k is a smallish integer, so k * L1 has no rounding error, nor has x - k * L1; then we subtract k * L2 which is small and therefore has little rounding error.
To do this quicker (without a division), we calculate k = round (x * (1 / ln 2)). And we check whether x is close to zero, so the whole calculation isn't needed. Anyway, we now calculate exp (x) where the result is between sqrt (1/2) and sqrt (2).
You could calculate exp (x) using a Taylor polynomial. Instead you would probably use a Chebychev polynomial minimising the cutoff error with a much lower degree. With some care you can find a polynomial with a cutoff error substantially less than the lowest bit of the result.
It depends on the implementation of the compiler, C runtime and the processor. However, whoever computes the exponent is unlikely to use the Taylor expansion since better methods exist.
As per glibc, it may use its own implementation which says this in the comment (from sysdeps/ieee754/dbl-64/e_exp.c):
/* An ultimate exp routine. Given an IEEE double machine number x */
/* it computes the correctly rounded (to nearest) value of e^x */
Or it may use hardware supported processor instructions for floating point computations, as with x86 FPU. In both cases you are likely to get a correctly rounded value with full precision.
That's dependent of which C library implementation you're using. In the overy popular glibc, it isn't.