Optimization of float power of 2 division - c++

Let's say I want to divide unsigned int by 2 or 4 or 8, etc.
AFAIK compiler replaces such division with a shift.
But can I expect that instead of dividing float by 128 it instead subtracts 7 from its exponent part?
What is the best practice to ensure that exponent subtraction is used instead of floating division?

If you are multiplying or dividing by a constant, a compiler of modest quality should optimize it. On many platforms, a hardware multiply instruction may be optimal.
For multiplying (or dividing) by a power of two, std::ldexp(x, p) multiplies x by 2p, where p is an int (and divides if p is negated). I would not expect much benefit over simple multiplication on most platforms, as manual (software) exponent manipulation must include checks for overflow and underflow, so the resulting sequence of instructions is not likely to improve over a hardware multiply in most situations.

Related

Stop double division before decimals (low precision, fast division; getting only the 'quotient')

Basically a performance related question:
I want to get only the integer quotient from a double division, i.e. for example, for a division 88.3/12.7 = 6.9527559055118110236220472440945, I only want to get '6' as a result.
A possible implementation would be of course: floor(x/y), but here, first the performance-intensive double division is done and afterwards floor throws away most of the 'work' the double division did.
So basically I want a division with doubles which 'stops' before calculating all these decimal points and just gives me the correct integer result of the division, without rounding or truncating the initial double arguments. Does anyone know an elegant implementation for this (I searched for this topic but didn't find much)?
Another implementation I can imagine is:
int(x*1000)/int(y*1000)
Where instead of 1000, the needed 'precision' can be used. A very simple implementation would be also simply subtracting y from x until the result is smaller than zero. But yeah, I was wondering what would be the best way to do it.
Also, doing simply int(x)/int(y) is no option since it could easily result in wrong results.
By the way, I know this is probably again one of these 'micro-optimization' questions which deal with a matter that does not really matter on new machines, but well, I still am kinda curious on the subject! :-)
There is no way to stop earlier, and using integer division is potentially slower.
For example, on Skylake:
idiv r/m32 L: 26-27 T: 6
divsd xmm, xmm L: 13-14 T: 4
(source)
So the double division is twice as fast and has a significantly better throughput. That is before you factor in the extra multiplications and extra cast.
On older µarchs, 32 bit integer division often has lower latency numbers listed than double division, but they varied more (division used to be more serial), with (for floats) round divisors being faster yet for integer division it's small results that are faster. This difference in characteristics can make it swing either way, depending on what you're dividing by what.
As you can see it's dangerous in this case to optimize without a specific target in mind, but I imagine newer machines are a more likely target than older machines, which means the double division is more or less the best you can do anyway (unless other optimizations apply). Dividing single precision floats is faster by itself but incurs a conversion cost, it actually ends up losing (5+10) if you add them up.

Is casting a signed integer to a binary floating point number cheaper than the inverse operation?

I know from articles like "Why you should never cast floats to ints" and many others like it that casting a float to a signed int is expensive. I'm also aware that certain conversion instructions or SIMD vector instructions on some architectures can speed the process. I'm curious if converting an integer to floating point is also expensive, as all the material I've found on the subject only talks about how expensive it is to convert from floating point to integer.
Before anyone says "Why don't you just test it?" I'm not talking about performance on a particular architecture, I'm interested in the algorithmic behavior of the conversion across multiple platforms adhering to the IEEE 754-2008 standard. Is there something inherent to the algorithm for conversion that affects performance in general?
Intuitively, I would think that conversion from integer to floating point would be easier in general for the following reasons:
Rounding is only necessary if the precision of the integer exceeds the precision of the binary floating point number, e.g. 32-bit integer to 32-bit float might require rounding, but 32-bit integer to 64-bit float won't, and neither will a 32-bit integer that only uses 24-bits of precision.
There is no need to check for NAN or +/- INF or +/- 0.
There is no danger of overflow or underflow.
What are reasons that conversion from int to float could result in poor cross-platform performance, if any (other than a platform emulating floating point numbers in software)? Is conversion from int to float generally cheaper than float to int?
Intel specifies in its "Architectures Optimization Reference Manual" that CVTSI2SD has 3-4 cycles latency (and 1 cycle throughput) on the basic desktop/server line since Core2. This can be accepted as a good example.
From the hardware point of view, such conversion requires some assistance which makes it fit in reasonable cycle amount, otherwise, it gets too expensive. A naive but rather good explanation follows. In all consideration, I assume a single CPU clock cycle is enough for an operation like full-width integer adding (but not radically longer!), and all results of previous cycle are applied on cycle boundary.
The first clock cycle with appropriate hardware assistance (priority encoder) gives Count Leading Zeros (CLZ) result among with detecting two special cases: 0 and INT_MIN (MSB set and all other bits clear). 0 and INT_MIN are better to be processed separately (load constant to destination register and finish). Otherwise, if the input integer was negative, it shall be negated; this usually requires one more cycle (because negation is combination of inversion and adding of a carry bit). So, 1-2 cycles are spent.
At the same time, it can calculate the biased exponent prediction, based on CLZ result. Notice we needn't take care of denormalized values or infinity. (Can we predict CLZ(-x) based on CLZ(x), if x < 0? If we can, this economizes us 1 cycle.)
Then, shift is applied (1 cycle again, with barrel shifter) to place the integer value so its highest 1 is at a fixed position (e.g. with standard 3 extension bits and 24-bit mantissa, this is bit number 26). This usage of barrel shifter shall combine of all low bits to the sticky bit (a separate custom barrel shifter instance can be needed; but this is waaaay cheaper than cache megabytes or OoO dispatcher). Now, up to 3 cycles.
Then, rounding is applied. Rounding is analyzing, in our case, of 4 lowest current value bits (mantissa LSB, guard, round and sticky), and, OTOH, the current rounding mode and target sign (extracted at cycle 1). Rounding to zero (RZ) results in ignoring guard/round/sticky bits. Rounding to -∞ (RMI) for positive value and to +∞ (RPI) for negative is the same as to zero. Rounding to ∞ of opposite sign results in adding 1 to the main mantissa. Finally, rounding-to-nearest-ties-to-even (RNE): x000...x011 -> discard; x101...x111 -> add 1; 0100 -> discard; 1100 -> add 1. If hardware is fast enough to add this result at the same cycle (I guess it's likely), we have up to 4 cycles now.
This adding on the previous step can lead in carry (like 1111 -> 10000), so, exponent can increase. The final cycle is to pack sign (from cycle 1), mantissa (to "significand") and biased exponent (calculated on cycle 2 from CLZ result and possibly adjusted with carry from cycle 4). So, 5 cycles now.
Is conversion from int to float generally cheaper than float to int?
We can estimate the same conversion e.g. from binary32 to int32 (signed). Let's assume that conversion of NaN, INF or too big value results in fixed value, say, INT_MIN (-2147483648). In that case:
Split and analyze the input value: S - sign; BE - biased exponent; M - mantissa (significand); also apply rounding mode. A "conversion impossible" (overflow or invalid) signal is generated if: BE >= 158 (this includes NaN and INF). A "zero" signal is generated if BE < 127 (abs(x) < 1) and {RZ, or (x > 0 and RMI), or (x < 0 and RPI)}; or, if BE < 126 (abs(x) < 0.5) with RNE; or, BE = 126, significand = 0 (without hidden bit) and RNE. Otherwise, signals for final +1 or -1 can be generated for cases: BE < 127 and: x < 0 and RMI; x > 0 and RPI; BE = 126 and RNE. All these signals can be calculated during one cycle using boolean logic circuitry, and lead to finalize result at the first cycle. In parallel and independently, calculate 157-BE using a separate adder for using at cycle 2.
If not finalized yet, we have abs(x) >= 1, so, BE >= 127, but BE <= 157 (so abs(x) < 2**31). Get 157-BE from cycle 1, this is needed shift amount. Apply the right shift by this amount, using the same barrel shifter, as in int -> float algorithm, to a value with (again) 3 additional bits and sticky bit gathering. Here, 2 cycles are spent.
Apply rounding (see above). 3 cycles spent, and carry can be produced. Here, we can again detect integer overflow and produce the respective result value. Forget additional bits, only 31 bits are valued now.
Finally, negate the resulting value, if x was negative (sign=1). Up to 4 cycles spent.
I'm not an experienced binary logic developer so could miss some chance to compact this sequence, but it looks rather close to Intel values. So, the conversions themselves are quite cheaper, provided hardware assistance is present (saying again, it results in no more than a few thousand gates, so is tiny for the contemporary chip production).
You can also take a look at Berkeley Softfloat library - it implements virtually the same approach with minor modifications. Start with ui32_to_f32.c source file. They use more additional bits for intermediate values, but this isn't principal.
See #Netch's excellent answer re the algorithm, but it's not just the algorithm. The FPU runs asynchronously, so the int->FP operation can start and the CPU can then execute the next instruction. But when storing FP to integer, there has to be an FWAIT (Intel).

What algorithm should I use for high-performance large integer division?

I am encoding large integers into an array of size_t. I already have the other operations working (add, subtract, multiply); as well as division by a single digit. But I would like match the time complexity of my multiplication algorithms if possible (currently Toom-Cook).
I gather there are linear time algorithms for taking various notions of multiplicative inverse of my dividend. This means I could theoretically achieve division in the same time complexity as my multiplication, because the linear-time operation is "insignificant" by comparison anyway.
My question is, how do I actually do that? What type of multiplicative inverse is best in practice? Modulo 64^digitcount? When I multiply the multiplicative inverse by my divisor, can I shirk computing the part of the data that would be thrown away due to integer truncation? Can anyone provide C or C++ pseudocode or give a precise explanation of how this should be done?
Or is there a dedicated division algorithm that is even better than the inverse-based approach?
Edit: I dug up where I was getting "inverse" approach mentioned above. On page 312 of "Art of Computer Programming, Volume 2: Seminumerical Algorithms", Knuth provides "Algorithm R" which is a high-precision reciprocal. He says its time complexity is less than that of multiplication. It is, however, nontrivial to convert it to C and test it out, and unclear how much overhead memory, etc, will be consumed until I code this up, which would take a while. I'll post it if no one beats me to it.
The GMP library is usually a good reference for good algorithms. Their documented algorithms for division mainly depend on choosing a very large base, so that you're dividing a 4 digit number by a 2 digit number, and then proceed via long division.
Long division will require computing 2 digit by 1 digit quotients; this can either be done recursively, or by precomputing an inverse and estimating the quotient as you would with Barrett reduction.
When dividing a 2n-bit number by an n-bit number, the recursive version costs O(M(n) log(n)), where M(n) is the cost of multiplying n-bit numbers.
The version using Barrett reduction will cost O(M(n)) if you use Newton's algorithm to compute the inverse, but according to GMP's documentation, the hidden constant is a lot larger, so this method is only preferable for very large divisions.
In more detail, the core algorithm behind most division algorithms is an "estimated quotient with reduction" calculation, computing (q,r) so that
x = qy + r
but without the restriction that 0 <= r < y. The typical loop is
Estimate the quotient q of x/y
Compute the corresponding reduction r = x - qy
Optionally adjust the quotient so that the reduction r is in some desired interval
If r is too big, then repeat with r in place of x.
The quotient of x/y will be the sum of all the qs produced, and the final value of r will be the true remainder.
Schoolbook long division, for example, is of this form. e.g. step 3 covers those cases where the digit you guessed was too big or too small, and you adjust it to get the right value.
The divide and conquer approach estimates the quotient of x/y by computing x'/y' where x' and y' are the leading digits of x and y. There is a lot of room for optimization by adjusting their sizes, but IIRC you get best results if x' is twice as many digits of y'.
The multiply-by-inverse approach is, IMO, the simplest if you stick to integer arithmetic. The basic method is
Estimate the inverse of y with m = floor(2^k / y)
Estimate x/y with q = 2^(i+j-k) floor(floor(x / 2^i) m / 2^j)
In fact, practical implementations can tolerate additional error in m if it means you can use a faster reciprocal implementation.
The error is a pain to analyze, but if I recall the way to do it, you want to choose i and j so that x ~ 2^(i+j) due to how errors accumulate, and you want to choose x / 2^i ~ m^2 to minimize the overall work.
The ensuing reduction will have r ~ max(x/m, y), so that gives a rule of thumb for choosing k: you want the size of m to be about the number of bits of quotient you compute per iteration — or equivalently the number of bits you want to remove from x per iteration.
I do not know the multiplicative inverse algorithm but it sounds like modification of Montgomery Reduction or Barrett's Reduction.
I do bigint divisions a bit differently.
See bignum division. Especially take a look at the approximation divider and the 2 links there. One is my fixed point divider and the others are fast multiplication algos (like karatsuba,Schönhage-Strassen on NTT) with measurements, and a link to my very fast NTT implementation for 32bit Base.
I'm not sure if the inverse multiplicant is the way.
It is mostly used for modulo operation where the divider is constant. I'm afraid that for arbitrary divisions the time and operations needed to acquire bigint inverse can be bigger then the standard divisions itself, but as I am not familiar with it I could be wrong.
The most common divider in use I saw in implemetations are Newton–Raphson division which is very similar to approximation divider in the link above.
Approximation/iterative dividers usually use multiplication which define their speed.
For small enough numbers is usually long binary division and 32/64bit digit base division fast enough if not fastest: usually they have small overhead, and let n be the max value processed (not the number of digits!)
Binary division example:
Is O(log32(n).log2(n)) = O(log^2(n)).
It loops through all significant bits. In each iteration you need to compare, sub, add, bitshift. Each of those operations can be done in log32(n), and log2(n) is the number of bits.
Here example of binary division from one of my bigint templates (C++):
template <DWORD N> void uint<N>::div(uint &c,uint &d,uint a,uint b)
{
int i,j,sh;
sh=0; c=DWORD(0); d=1;
sh=a.bits()-b.bits();
if (sh<0) sh=0; else { b<<=sh; d<<=sh; }
for (;;)
{
j=geq(a,b);
if (j)
{
c+=d;
sub(a,a,b);
if (j==2) break;
}
if (!sh) break;
b>>=1; d>>=1; sh--;
}
d=a;
}
N is the number of 32 bit DWORDs used to store a bigint number.
c = a / b
d = a % b
qeq(a,b) is a comparison: a >= b greater or equal (done in log32(n)=N)
It returns 0 for a < b, 1 for a > b, 2 for a == b
sub(c,a,b) is c = a - b
The speed boost is gained from that this does not use multiplication (if you do not count the bit shift)
If you use digit with a big base like 2^32 (ALU blocks), then you can rewrite the whole in polynomial like style using 32bit build in ALU operations.
This is usually even faster then binary long division, the idea is to process each DWORD as a single digit, or recursively divide the used arithmetic by half until hit the CPU capabilities.
See division by half-bitwidth arithmetics
On top of all that while computing with bignums
If you have optimized basic operations, then the complexity can lower even further as sub-results get smaller with iterations (changing the complexity of basic operations) A nice example of that are NTT based multiplications.
The overhead can mess thing up.
Due to this the runtime sometimes does not copy the big O complexity, so you should always measure the tresholds and use faster approach for used bit-count to get the max performance and optimize what you can.

controlling overflow and loss in precision while multiplying doubles

ques:
I have a large number of floating point numbers (~10,000 numbers) , each having 6 digits after decimal. Now, the multiplication of all these numbers would yield about 60,000 digits. But the double range is for 15 digits only. The output product has to have 6 digits of precision after decimal.
my approach:
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
I also thought of multiplying these numbers using arrays to store their digits and later converting them to decimal. But this also appears cumbersome and may not yield correct result.
Is there an alternate easier way to do this?
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
This would only achieve further loss of accuracy. In floating-point, large numbers are represented approximately just like small numbers are. Making your numbers bigger only means you are doing 19999 multiplications (and one division) instead of 9999 multiplications; it does not magically give you more significant digits.
This manipulation would only be useful if it prevented the partial product to reach into subnormal territory (and in this case, multiplying by a power of two would be recommended to avoid loss of accuracy due to the multiplication). There is no indication in your question that this happens, no example data set, no code, so it is only possible to provide the generic explanation below:
Floating-point multiplication is very well behaved when it does not underflow or overflow. At the first order, you can assume that relative inaccuracies add up, so that multiplying 10000 values produces a result that's 9999 machine epsilons away from the mathematical result in relative terms(*).
The solution to your problem as stated (no code, no data set) is to use a wider floating-point type for the intermediate multiplications. This solves both the problems of underflow or overflow and leaves you with a relative accuracy on the end result such that once rounded to the original floating-point type, the product is wrong by at most one ULP.
Depending on your programming language, such a wider floating-point type may be available as long double. For 10000 multiplications, the 80-bit “extended double” format, widely available in x86 processors, would improve things dramatically and you would barely see any performance difference, as long as your compiler does map this 80-bit format to a floating-point type. Otherwise, you would have to use a software implementation such as MPFR's arbitrary-precision floating-point format or the double-double format.
(*) In reality, relative inaccuracies compound, so that the real bound on the relative error is more like (1 + ε)9999 - 1 where ε is the machine epsilon. Also, in reality, relative errors often cancel each other, so that you can expect the actual relative error to grow like the square root of the theoretical maximum error.

IEEE-754 floating point: Divide first or multiply first for best precision?

What's better if I want to preserve as much precision as possible in a calculation with IEEE-754 floating point values:
a = b * c / d
or
a = b / d * c
Is there a difference? If there is, does it depend on the magnitudes of the input values? And, if magnitude matters, how is the best ordering determined when general magnitudes of the values are known?
It depends on the magnitude of the values. Obviously if one divides by zero, all bets are off, but if a multiplication or division results in a denormal subsequent operations can lose precision.
You may find it useful to study Goldberg's seminal paper What Every Computer Scientist Should Know About Floating-Point Arithmetic which will explain things far better than any answer you're likely to receive here. (Goldberg was one of the original authors of IEEE-754.)
Assuming that none of the operations would yield an overflow or an underflow, and your input values have uniformly distributed significands, then this is equivalent. Well, I suppose that to have a rigorous proof, one should do an exhaustive test (probably not possible in practice for double precision since there are 2^156 inputs), but if there is a difference in the average error, then it is tiny. I could try in low precisions with Sipe.
In any case, in the absence of overflow/underflow, only the exact values of the significands matter, not the exponents.
However if the result a is added to (or subtracted from) another expression and not reused, then starting with the division may be more interesting since you can group the multiplication with the following addition by using a FMA (thus with a single rounding).