Efficiently dividing a double by a power of 2 - c++

I'm implementing a coherent noise function, and was surprised to find that using gradient noise (i.e. Perlin noise) is actually slightly faster than value noise. Profiling shows that the reason for this is the division needed to convert the random int value into a double of range -1.0 to 1.0:
static double noiseValueDouble(int seed, int x, int y, int z) {
return 1.0 - ((double)noiseValueInt(seed, x, y, z) / 1073741824.0);
}
Gradient noise requires a few multiplies more, but due to the precomputed gradient table uses the noiseValueInt directly to compute an index into the table, and doesn't require any division. So my question is, how could I make the above division more efficient, considering that the division is by a power of 2 (2^30).
Theoretically all that would need to be done is to subtract 30 from the double's exponent, but doing that by brute force (i.e. bit manipulation) would lead to all sorts of corner cases (INF, NAN, exponent overflow, etc.). An x86 assembly solution would be ok.

Declare a variable (or constant) with the inverse value and multiply by it, effectively changing the division to a multiplication:
static const double div_2_pow_30 = 1.0 / 1073741824.0;
Another way (utilizing the property that the number is a power of 2) is to modify the exponent with bit operations. Doing that will make the code dependant on doubles being stored using the IEEE standard which may be less portable.

You can modify the exponent directly use the functions frexp and ldexp. I'm not sure if this would be faster though.

I'm not sure you can trust the profiling in here. For smaller, faster functions, the effect of the profiling code itself starts to skew the results.
Run noiseValueDouble and the corresponding alternative in a loop to get better numbers.
An x86 assembler solution is a bit-fiddling solution, you may as well do the bit fiddling in C. Fast power-of-two division instructions (bit shifts) only exist for integers.
If you really want to use special instructions, MMX it up or something.

I tried compiling this with gcc:
double divide(int i) {
return 1.0 - (double)i / 1073741824.0;
}
with -O3 it is coded as an FMULS-instruction, with -O3 -mfpmath=sse -march=core2 it uses SSE-instruction set and encodes it as MULSD. I have no idea what is the fastest, but the function call itself is probably orders of magnitude slower than the actual division.

Related

Optimization of float power of 2 division

Let's say I want to divide unsigned int by 2 or 4 or 8, etc.
AFAIK compiler replaces such division with a shift.
But can I expect that instead of dividing float by 128 it instead subtracts 7 from its exponent part?
What is the best practice to ensure that exponent subtraction is used instead of floating division?
If you are multiplying or dividing by a constant, a compiler of modest quality should optimize it. On many platforms, a hardware multiply instruction may be optimal.
For multiplying (or dividing) by a power of two, std::ldexp(x, p) multiplies x by 2p, where p is an int (and divides if p is negated). I would not expect much benefit over simple multiplication on most platforms, as manual (software) exponent manipulation must include checks for overflow and underflow, so the resulting sequence of instructions is not likely to improve over a hardware multiply in most situations.

Stop double division before decimals (low precision, fast division; getting only the 'quotient')

Basically a performance related question:
I want to get only the integer quotient from a double division, i.e. for example, for a division 88.3/12.7 = 6.9527559055118110236220472440945, I only want to get '6' as a result.
A possible implementation would be of course: floor(x/y), but here, first the performance-intensive double division is done and afterwards floor throws away most of the 'work' the double division did.
So basically I want a division with doubles which 'stops' before calculating all these decimal points and just gives me the correct integer result of the division, without rounding or truncating the initial double arguments. Does anyone know an elegant implementation for this (I searched for this topic but didn't find much)?
Another implementation I can imagine is:
int(x*1000)/int(y*1000)
Where instead of 1000, the needed 'precision' can be used. A very simple implementation would be also simply subtracting y from x until the result is smaller than zero. But yeah, I was wondering what would be the best way to do it.
Also, doing simply int(x)/int(y) is no option since it could easily result in wrong results.
By the way, I know this is probably again one of these 'micro-optimization' questions which deal with a matter that does not really matter on new machines, but well, I still am kinda curious on the subject! :-)
There is no way to stop earlier, and using integer division is potentially slower.
For example, on Skylake:
idiv r/m32 L: 26-27 T: 6
divsd xmm, xmm L: 13-14 T: 4
(source)
So the double division is twice as fast and has a significantly better throughput. That is before you factor in the extra multiplications and extra cast.
On older µarchs, 32 bit integer division often has lower latency numbers listed than double division, but they varied more (division used to be more serial), with (for floats) round divisors being faster yet for integer division it's small results that are faster. This difference in characteristics can make it swing either way, depending on what you're dividing by what.
As you can see it's dangerous in this case to optimize without a specific target in mind, but I imagine newer machines are a more likely target than older machines, which means the double division is more or less the best you can do anyway (unless other optimizations apply). Dividing single precision floats is faster by itself but incurs a conversion cost, it actually ends up losing (5+10) if you add them up.

What algorithm should I use for high-performance large integer division?

I am encoding large integers into an array of size_t. I already have the other operations working (add, subtract, multiply); as well as division by a single digit. But I would like match the time complexity of my multiplication algorithms if possible (currently Toom-Cook).
I gather there are linear time algorithms for taking various notions of multiplicative inverse of my dividend. This means I could theoretically achieve division in the same time complexity as my multiplication, because the linear-time operation is "insignificant" by comparison anyway.
My question is, how do I actually do that? What type of multiplicative inverse is best in practice? Modulo 64^digitcount? When I multiply the multiplicative inverse by my divisor, can I shirk computing the part of the data that would be thrown away due to integer truncation? Can anyone provide C or C++ pseudocode or give a precise explanation of how this should be done?
Or is there a dedicated division algorithm that is even better than the inverse-based approach?
Edit: I dug up where I was getting "inverse" approach mentioned above. On page 312 of "Art of Computer Programming, Volume 2: Seminumerical Algorithms", Knuth provides "Algorithm R" which is a high-precision reciprocal. He says its time complexity is less than that of multiplication. It is, however, nontrivial to convert it to C and test it out, and unclear how much overhead memory, etc, will be consumed until I code this up, which would take a while. I'll post it if no one beats me to it.
The GMP library is usually a good reference for good algorithms. Their documented algorithms for division mainly depend on choosing a very large base, so that you're dividing a 4 digit number by a 2 digit number, and then proceed via long division.
Long division will require computing 2 digit by 1 digit quotients; this can either be done recursively, or by precomputing an inverse and estimating the quotient as you would with Barrett reduction.
When dividing a 2n-bit number by an n-bit number, the recursive version costs O(M(n) log(n)), where M(n) is the cost of multiplying n-bit numbers.
The version using Barrett reduction will cost O(M(n)) if you use Newton's algorithm to compute the inverse, but according to GMP's documentation, the hidden constant is a lot larger, so this method is only preferable for very large divisions.
In more detail, the core algorithm behind most division algorithms is an "estimated quotient with reduction" calculation, computing (q,r) so that
x = qy + r
but without the restriction that 0 <= r < y. The typical loop is
Estimate the quotient q of x/y
Compute the corresponding reduction r = x - qy
Optionally adjust the quotient so that the reduction r is in some desired interval
If r is too big, then repeat with r in place of x.
The quotient of x/y will be the sum of all the qs produced, and the final value of r will be the true remainder.
Schoolbook long division, for example, is of this form. e.g. step 3 covers those cases where the digit you guessed was too big or too small, and you adjust it to get the right value.
The divide and conquer approach estimates the quotient of x/y by computing x'/y' where x' and y' are the leading digits of x and y. There is a lot of room for optimization by adjusting their sizes, but IIRC you get best results if x' is twice as many digits of y'.
The multiply-by-inverse approach is, IMO, the simplest if you stick to integer arithmetic. The basic method is
Estimate the inverse of y with m = floor(2^k / y)
Estimate x/y with q = 2^(i+j-k) floor(floor(x / 2^i) m / 2^j)
In fact, practical implementations can tolerate additional error in m if it means you can use a faster reciprocal implementation.
The error is a pain to analyze, but if I recall the way to do it, you want to choose i and j so that x ~ 2^(i+j) due to how errors accumulate, and you want to choose x / 2^i ~ m^2 to minimize the overall work.
The ensuing reduction will have r ~ max(x/m, y), so that gives a rule of thumb for choosing k: you want the size of m to be about the number of bits of quotient you compute per iteration — or equivalently the number of bits you want to remove from x per iteration.
I do not know the multiplicative inverse algorithm but it sounds like modification of Montgomery Reduction or Barrett's Reduction.
I do bigint divisions a bit differently.
See bignum division. Especially take a look at the approximation divider and the 2 links there. One is my fixed point divider and the others are fast multiplication algos (like karatsuba,Schönhage-Strassen on NTT) with measurements, and a link to my very fast NTT implementation for 32bit Base.
I'm not sure if the inverse multiplicant is the way.
It is mostly used for modulo operation where the divider is constant. I'm afraid that for arbitrary divisions the time and operations needed to acquire bigint inverse can be bigger then the standard divisions itself, but as I am not familiar with it I could be wrong.
The most common divider in use I saw in implemetations are Newton–Raphson division which is very similar to approximation divider in the link above.
Approximation/iterative dividers usually use multiplication which define their speed.
For small enough numbers is usually long binary division and 32/64bit digit base division fast enough if not fastest: usually they have small overhead, and let n be the max value processed (not the number of digits!)
Binary division example:
Is O(log32(n).log2(n)) = O(log^2(n)).
It loops through all significant bits. In each iteration you need to compare, sub, add, bitshift. Each of those operations can be done in log32(n), and log2(n) is the number of bits.
Here example of binary division from one of my bigint templates (C++):
template <DWORD N> void uint<N>::div(uint &c,uint &d,uint a,uint b)
{
int i,j,sh;
sh=0; c=DWORD(0); d=1;
sh=a.bits()-b.bits();
if (sh<0) sh=0; else { b<<=sh; d<<=sh; }
for (;;)
{
j=geq(a,b);
if (j)
{
c+=d;
sub(a,a,b);
if (j==2) break;
}
if (!sh) break;
b>>=1; d>>=1; sh--;
}
d=a;
}
N is the number of 32 bit DWORDs used to store a bigint number.
c = a / b
d = a % b
qeq(a,b) is a comparison: a >= b greater or equal (done in log32(n)=N)
It returns 0 for a < b, 1 for a > b, 2 for a == b
sub(c,a,b) is c = a - b
The speed boost is gained from that this does not use multiplication (if you do not count the bit shift)
If you use digit with a big base like 2^32 (ALU blocks), then you can rewrite the whole in polynomial like style using 32bit build in ALU operations.
This is usually even faster then binary long division, the idea is to process each DWORD as a single digit, or recursively divide the used arithmetic by half until hit the CPU capabilities.
See division by half-bitwidth arithmetics
On top of all that while computing with bignums
If you have optimized basic operations, then the complexity can lower even further as sub-results get smaller with iterations (changing the complexity of basic operations) A nice example of that are NTT based multiplications.
The overhead can mess thing up.
Due to this the runtime sometimes does not copy the big O complexity, so you should always measure the tresholds and use faster approach for used bit-count to get the max performance and optimize what you can.

c++ sqrt guaranteed precision, upper/lower bound

I have to check an inequality containing square roots. To avoid incorrect results due to floating point inaccuracy and rounding, I use std::nextafter() to get an upper/lower bound:
#include <cfloat> // DBL_MAX
#include <cmath> // std::nextafter, std::sqrt
double x = 42.0; //just an example number
double y = std::nextafter(std::sqrt(x), DBL_MAX);
a) Is y*y >= x guaranteed using GCC compiler?
b) Will this work for other operations like + - * / or even std::cos() and std::acos()?
c) Are there better ways to get upper/lower bounds?
Update:
I read this is not guaranteed by the C++ Standard, but should work according to IEEE-754. Will this work with the GCC compiler?
In general, floating point operations will result in some ULP error. IEEE 754 requires that results for most operations be correct to within 0.5 ULP, but errors can accumulate, which means a result may not be within one ULP of the the exact result. There are limits to precision as well, so depending on the number of digits there are in resulting values, you also may not be working with values of the same magnitudes. Transcendental functions are also somewhat notorious for introducing error into calculations.
However, if you're using GNU glibc, sqrt will be correct to within 0.5 ULP (rounded), so you're specific example would work (neglecting NaN, +/-0, +/-Inf). Although, it's probably better to define some epsilon as your error tolerance and use that as your bound. For exmaple,
bool gt(double a, double b, double eps) {
return (a > b - eps);
}
Depending on the level of precision you need in calculations, you also may want to use long double instead.
So, to answer your questions...
a) Is y*y >= x guaranteed using GCC compiler?
Assuming you use GNU glibc or SSE2 intrinsics, yes.
b) Will this work for other operations like + - * / or even std::cos() and std::acos()?
Assuming you use GNU glibc and one operation, yes. Although some transcendentals are not guaranteed correctly rounded.
c) Are there better ways to get upper/lower bounds?
You need to know what your error tolerance in calculations is, and use that as an epsilon (which may be larger than one ULP).
For GCC this page suggests that it will work if you use the GCC builtin sqrt function __builtin_sqrt.
Additionally this behavior will be dependent on how you compile your code and the machine that it is run on
If the processor supports SSE2 then you should compile your code with the flags -mfpmath=sse -msse2 to ensure that all floating point operations are done using the SSE registers.
If the processor doesn't support SSE2 then you should use the long double type for the floating point values and compile with the flag -ffloat-store to force GCC to not use registers to store floating point values (you'll have a performance penalty for doing this)
Concerning
c) Are there better ways to get upper/lower bounds?
Another way is to use a different rounding mode, i.e. FE_UPWARD or FE_DOWNWARD instead of the default FE_TONEAREST. See https://stackoverflow.com/a/6867722 This may be slower, but is a better upper/lower bound.

Should I combine multiplication and division steps when working with floating point values?

I am aware of the precision problems in floats and doubles, which why I am asking this:
If I have a formula such as: (a/PI)*180.0 (where PI is a constant)
Should I combine the division and multiplication, so I can use only one division: a/0.017453292519943295769236, in order to avoid loss of precision ?
Does this make it more precise when it has less steps to calculate the result?
Short answer
Yes, you should in general combine as many multiplications and divisions by constants as possible into one operation. It is (in general(*)) faster and more accurate at the same time.
Neither π nor π/180 nor their inverses are representable exactly as floating-point. For this reason, the computation will involve at least one approximate constant (in addition to the approximation of each of the operations involved).
Because two operations introduce one approximation each, it can be expected to be more accurate to do the whole computation in one operation.
In the case at hand, is division or multiplication better?
Apart from that, it is a question of “luck” whether the relative accuracy to which π/180 can be represented in the floating-point format is better or worse than that of 180/π.
My compiler provides addition precision with the long double type, so I am able to use it as reference for answering this question for double:
~ $ cat t.c
#define PIL 3.141592653589793238462643383279502884197L
#include <stdio.h>
int main() {
long double heop = 180.L / PIL;
long double pohe = PIL / 180.L;
printf("relative acc. of π/180: %Le\n", (pohe - (double) pohe) / pohe);
printf("relative acc. of 180/π: %Le\n", (heop - (double) heop) / heop);
}
~ $ gcc t.c && ./a.out
relative acc. of π/180: 1.688893e-17
relative acc. of 180/π: -3.469703e-17
In usual programming practice, one wouldn't bother and simply multiply by (the floating-point representation of) 180/π, because multiplication is so much faster than division.
As it turns out, in the case of the binary64 floating-point type double almost always maps to, π/180 can be represented with better relative accuracy than 180/π, so π/180 is the constant one should use to optimize accuracy: a / ((double) (π / 180)). With this formula, the total relative error would be approximately the sum of the relative error of the constant (1.688893e-17) and of the relative error of the division (which will depend on the value of a but never be more than 2-53).
Alternative methods for faster and more accurate results
Note that division is so expensive that you could get an even more accurate result faster by using one multiplication and one fma: let heop1 be the best double approximation of 180/π, and heop2 the best double approximation of 180/π - heop1. Then the best value for the result can be computed as:
double r = fma(a, heop1, a * heop2);
The fact that the above is the absolute best possible double approximation to the real computation is a theorem (in fact, it is a theorem with exceptions. The details can be found in the “Handbook of Floating-Point Arithmetic”). But even when the real constant you want to multiply a double by in order to get a double result is one of the exceptions to the theorem, the above computation is still clearly very accurate and only differs from the best double approximation for a few exceptional values of a.
If, like mine, your compiler provides more precision for long double than for double, you can also use one long double multiplication:
// this is more accurate than double division:
double r = (double)((long double) a * 57.295779513082320876798L)
This is not as good as the solution based on fma, but it is good enough that for most values of a, it produces the optimal double approximation to the real computation.
A counter-example to the general claim that operations should be grouped as one
(*) The claim that it is better to group constant is only statistically true for most constants.
If you happened to wish to multiply a by, say, the real constant 0.0000001 * DBL_MIN, you would be better off multiplying first by 0.0000001, then by DBL_MIN, and the end result (which can be a normalized number if a is larger than 1000000 or so) would be more precise than if you had multiplied by the best double representation of 0.0000001 * DBL_MIN. This is because the relative accuracy when representing 0.0000001 * DBL_MIN as a single double value is much worse than the accuracy for representing 0.0000001.