Software implementation of floating point division, issues with rounding - c++

As a learning project I am implementing floating point operations (add, sub, mul, div) in software using c++. The goal is to be more comfortable with the underlying details of floating point behavior.
I am trying to match my processor operations to the exact bit, meaning IEEE 754 standard. So far it has been working great, add, sub and mult behave perfectly, I tested it on around 110 million random operations and got the same exact result to what the processor does in hardware. (Although did not take into account edge cases, overflow etc).
After that, I started moving to the last operation, division. It works fine and achieves the wanted result, but from time to time, I get the last mantissa bit wrong, not rounded up. I am having a bit of hard time understanding why.
The main reference I have been using is the great talk from John Farrier (the time stamp is at the point where it shows how to round):
https://youtu.be/k12BJGSc2Nc?t=1153
That rounding has been working really well for all operation but is giving me troubles for the division.
Let me give you a specific example.
I am trying to divide 645.68011474609375 by 493.20962524414063
The final result I get is :
mine : 0-01111111-01001111001000111100000
c++_ : 0-01111111-01001111001000111100001
As you can see everything matches except for the last bit. The way I am computing the division is based on this video:
https://www.youtube.com/watch?v=fi8A4zz1d-s
Following this, I compute 28 bits off accuracy 24 of mantissa ( hidden one + 23 mantissa) and the 3 bits for guard, round sticky plus an extra one for the possible shift.
Using the algorithm of the video, I can at maximum get a normalization shift of 1, that s why I have an extra bit at the end in case gets shifted in in the normalization, so will be available in the rounding. Now here is the result I get from the division algorithm:
010100111100100011110000 0100
------------------------ ----
^ grs^
|__ to be normalized |____ extra bit
As you can see I get a 0 in the 24th position, so I will need to shift on the left by one to get the correct normalization.
This mean I will get:
10100111100100011110000 100
Based on the video of John Farrier, in the case of 100 grs bits, I only normalize if the LSB of the mantissa is a 1. In my case is a zero, and that is why I do not round up my result.
The reason why I am a bit lost is that I am sure my algorithm is computing the right mantissa, I have double checked it with online calculators, the rounding strategy works for all the other operations. Also, computing in this way, triggers the normalization, which yields, in the end, the correct exponent.
Am I missing something ? a small detail somewhere?
One thing that strikes me as odd is the sticky bits, in the addition and multiplication you get a different degree of shifting, which leads to higher chances of the sticky bits to trigger, in this case here, I shift only by one maximum which puts the sticky bits as to be not really sticky.
I do hope I gave enough details to make my problem understood. Here you can find at the bottom my division implementation, is a bit filled with prints I am using for debugging but should give an idea of what I am doing, the code starts at line 374:
https://gist.github.com/giordi91/1388504fadcf94b3f6f42103dfd1f938
PS: meanwhile I am going through the "everything scientist should know about floating point numbers" in order to see if I missed something.

The result you get from the division algorithm is inadequate. You show:
010100111100100011110000 0100
------------------------ ----
^ grs^
|__ to be normalized |____ extra bit
The mathematically exact quotient continues:
010100111100100011110000 0100 110000111100100100011110…
Thus, the residue at the point where you are rounding exceeds ½ ULP, so it should be rounded up. I did not study your code in detail, but it looks like you may have just calculated an extra bit or two of the significand1. You actually need to know that the residue is non-zero, not just whether its next bit or two is zero. The final sticky bit should be one if any of the bits at or beyond that position in the exact mathematical result would be non-zero.
Footnote
1 “Significand” is the preferred term. “Mantissa” is a legacy term for the fraction portion of a logarithm. The significand of a floating-point value is linear. A mantissa is logarithmic.

Related

Select ULP value in float comparison

I've read several resource on the network and I understood there's no a single value or universal parameters when we compare float numbers. I've read from here several replies and I found the code from Google test to compare the floats. I want to better understand the meaning of ULP and its value. Reading comments from source code I read:
The maximum error of a single floating-point operation is 0.5 units in
the last place. On Intel CPU's, all floating-point calculations are
done with 80-bit precision, while double has 64 bits. Therefore, 4
should be enough for ordinary use.
It's not really clear why "therefore 4 should be enough". Can anyone explain why? From my understanding we are saying that we can tolerate 4*10^-6 or 4*10^-15 as difference between our numbers to say if they are the same or not, taking into account the number of significant digits of float (6/7) or double (15/16). Is it correct?
It is wrong. Very wrong. Consider that every operation can accumulate some error—½ ULP is the maximum (in round-to-nearest mode), so ¼ might be an average. So 17 operations are enough to accumulate more than 4 ULP of error just from average effects.1 Today’s computers do billions of operations per second. How many operations will a program do between its inputs and some later comparison? That depends on the program, but it could be zero, dozens, thousands, or millions just for “ordinary“ use. (Let’s say we exclude billions because then it gets slow for a human to use, so we can call that special-purpose software, not ordinary.)
But that is not all. Suppose we add a few numbers around 1 and then subtract a number that happens to be around the sum. Maybe the adds get a total error around 2 ULP. But when we subtract, the result might be around 2−10 instead of around 1. So the ULP of 2−10 is 1024 times smaller than the ULP of 1. That error that is 2 ULP relative to 1 is 2048 ULP relative to the result of the subtraction. Oops! 4 ULP will not cut it. It would need to be 4 ULP of some of the other numbers involved, not the ULP of the result.
In fact, characterizing the error is difficult in general and is the subject of an entire field of study, numerical analysis. 4 is not the answer.
Footnote
1 Errors will vary in direction, so some will cancel out. The behavior might be modeled as a random walk, and the average error might be proportional to the square root of the number of operations performed.

Is casting a signed integer to a binary floating point number cheaper than the inverse operation?

I know from articles like "Why you should never cast floats to ints" and many others like it that casting a float to a signed int is expensive. I'm also aware that certain conversion instructions or SIMD vector instructions on some architectures can speed the process. I'm curious if converting an integer to floating point is also expensive, as all the material I've found on the subject only talks about how expensive it is to convert from floating point to integer.
Before anyone says "Why don't you just test it?" I'm not talking about performance on a particular architecture, I'm interested in the algorithmic behavior of the conversion across multiple platforms adhering to the IEEE 754-2008 standard. Is there something inherent to the algorithm for conversion that affects performance in general?
Intuitively, I would think that conversion from integer to floating point would be easier in general for the following reasons:
Rounding is only necessary if the precision of the integer exceeds the precision of the binary floating point number, e.g. 32-bit integer to 32-bit float might require rounding, but 32-bit integer to 64-bit float won't, and neither will a 32-bit integer that only uses 24-bits of precision.
There is no need to check for NAN or +/- INF or +/- 0.
There is no danger of overflow or underflow.
What are reasons that conversion from int to float could result in poor cross-platform performance, if any (other than a platform emulating floating point numbers in software)? Is conversion from int to float generally cheaper than float to int?
Intel specifies in its "Architectures Optimization Reference Manual" that CVTSI2SD has 3-4 cycles latency (and 1 cycle throughput) on the basic desktop/server line since Core2. This can be accepted as a good example.
From the hardware point of view, such conversion requires some assistance which makes it fit in reasonable cycle amount, otherwise, it gets too expensive. A naive but rather good explanation follows. In all consideration, I assume a single CPU clock cycle is enough for an operation like full-width integer adding (but not radically longer!), and all results of previous cycle are applied on cycle boundary.
The first clock cycle with appropriate hardware assistance (priority encoder) gives Count Leading Zeros (CLZ) result among with detecting two special cases: 0 and INT_MIN (MSB set and all other bits clear). 0 and INT_MIN are better to be processed separately (load constant to destination register and finish). Otherwise, if the input integer was negative, it shall be negated; this usually requires one more cycle (because negation is combination of inversion and adding of a carry bit). So, 1-2 cycles are spent.
At the same time, it can calculate the biased exponent prediction, based on CLZ result. Notice we needn't take care of denormalized values or infinity. (Can we predict CLZ(-x) based on CLZ(x), if x < 0? If we can, this economizes us 1 cycle.)
Then, shift is applied (1 cycle again, with barrel shifter) to place the integer value so its highest 1 is at a fixed position (e.g. with standard 3 extension bits and 24-bit mantissa, this is bit number 26). This usage of barrel shifter shall combine of all low bits to the sticky bit (a separate custom barrel shifter instance can be needed; but this is waaaay cheaper than cache megabytes or OoO dispatcher). Now, up to 3 cycles.
Then, rounding is applied. Rounding is analyzing, in our case, of 4 lowest current value bits (mantissa LSB, guard, round and sticky), and, OTOH, the current rounding mode and target sign (extracted at cycle 1). Rounding to zero (RZ) results in ignoring guard/round/sticky bits. Rounding to -∞ (RMI) for positive value and to +∞ (RPI) for negative is the same as to zero. Rounding to ∞ of opposite sign results in adding 1 to the main mantissa. Finally, rounding-to-nearest-ties-to-even (RNE): x000...x011 -> discard; x101...x111 -> add 1; 0100 -> discard; 1100 -> add 1. If hardware is fast enough to add this result at the same cycle (I guess it's likely), we have up to 4 cycles now.
This adding on the previous step can lead in carry (like 1111 -> 10000), so, exponent can increase. The final cycle is to pack sign (from cycle 1), mantissa (to "significand") and biased exponent (calculated on cycle 2 from CLZ result and possibly adjusted with carry from cycle 4). So, 5 cycles now.
Is conversion from int to float generally cheaper than float to int?
We can estimate the same conversion e.g. from binary32 to int32 (signed). Let's assume that conversion of NaN, INF or too big value results in fixed value, say, INT_MIN (-2147483648). In that case:
Split and analyze the input value: S - sign; BE - biased exponent; M - mantissa (significand); also apply rounding mode. A "conversion impossible" (overflow or invalid) signal is generated if: BE >= 158 (this includes NaN and INF). A "zero" signal is generated if BE < 127 (abs(x) < 1) and {RZ, or (x > 0 and RMI), or (x < 0 and RPI)}; or, if BE < 126 (abs(x) < 0.5) with RNE; or, BE = 126, significand = 0 (without hidden bit) and RNE. Otherwise, signals for final +1 or -1 can be generated for cases: BE < 127 and: x < 0 and RMI; x > 0 and RPI; BE = 126 and RNE. All these signals can be calculated during one cycle using boolean logic circuitry, and lead to finalize result at the first cycle. In parallel and independently, calculate 157-BE using a separate adder for using at cycle 2.
If not finalized yet, we have abs(x) >= 1, so, BE >= 127, but BE <= 157 (so abs(x) < 2**31). Get 157-BE from cycle 1, this is needed shift amount. Apply the right shift by this amount, using the same barrel shifter, as in int -> float algorithm, to a value with (again) 3 additional bits and sticky bit gathering. Here, 2 cycles are spent.
Apply rounding (see above). 3 cycles spent, and carry can be produced. Here, we can again detect integer overflow and produce the respective result value. Forget additional bits, only 31 bits are valued now.
Finally, negate the resulting value, if x was negative (sign=1). Up to 4 cycles spent.
I'm not an experienced binary logic developer so could miss some chance to compact this sequence, but it looks rather close to Intel values. So, the conversions themselves are quite cheaper, provided hardware assistance is present (saying again, it results in no more than a few thousand gates, so is tiny for the contemporary chip production).
You can also take a look at Berkeley Softfloat library - it implements virtually the same approach with minor modifications. Start with ui32_to_f32.c source file. They use more additional bits for intermediate values, but this isn't principal.
See #Netch's excellent answer re the algorithm, but it's not just the algorithm. The FPU runs asynchronously, so the int->FP operation can start and the CPU can then execute the next instruction. But when storing FP to integer, there has to be an FWAIT (Intel).

controlling overflow and loss in precision while multiplying doubles

ques:
I have a large number of floating point numbers (~10,000 numbers) , each having 6 digits after decimal. Now, the multiplication of all these numbers would yield about 60,000 digits. But the double range is for 15 digits only. The output product has to have 6 digits of precision after decimal.
my approach:
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
I also thought of multiplying these numbers using arrays to store their digits and later converting them to decimal. But this also appears cumbersome and may not yield correct result.
Is there an alternate easier way to do this?
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
This would only achieve further loss of accuracy. In floating-point, large numbers are represented approximately just like small numbers are. Making your numbers bigger only means you are doing 19999 multiplications (and one division) instead of 9999 multiplications; it does not magically give you more significant digits.
This manipulation would only be useful if it prevented the partial product to reach into subnormal territory (and in this case, multiplying by a power of two would be recommended to avoid loss of accuracy due to the multiplication). There is no indication in your question that this happens, no example data set, no code, so it is only possible to provide the generic explanation below:
Floating-point multiplication is very well behaved when it does not underflow or overflow. At the first order, you can assume that relative inaccuracies add up, so that multiplying 10000 values produces a result that's 9999 machine epsilons away from the mathematical result in relative terms(*).
The solution to your problem as stated (no code, no data set) is to use a wider floating-point type for the intermediate multiplications. This solves both the problems of underflow or overflow and leaves you with a relative accuracy on the end result such that once rounded to the original floating-point type, the product is wrong by at most one ULP.
Depending on your programming language, such a wider floating-point type may be available as long double. For 10000 multiplications, the 80-bit “extended double” format, widely available in x86 processors, would improve things dramatically and you would barely see any performance difference, as long as your compiler does map this 80-bit format to a floating-point type. Otherwise, you would have to use a software implementation such as MPFR's arbitrary-precision floating-point format or the double-double format.
(*) In reality, relative inaccuracies compound, so that the real bound on the relative error is more like (1 + ε)9999 - 1 where ε is the machine epsilon. Also, in reality, relative errors often cancel each other, so that you can expect the actual relative error to grow like the square root of the theoretical maximum error.

Floating Point addition using integer operations

I am writing code for enumerating floating point addition in C++ using integer addition and shifts for some homework. I have googled the topic and I am able to add floating point numbers by adjusting exponents and then adding. The problem is I could not find the appropriate algorithm for rounding off result. Right now I am using truncation. It shows errors of something like 0.000x magnitude. But when I try to use this adder for complex calculations like fft's, it shows enormous errors.
So what I am looking for now is the exact algorithm that is used by my machine for rounding off floating point results. It would be great if someone can post some link for the purpose.
Thanks in advance.
Most commonly, if the bits to be rounded away represent a value less than half that of the smallest bit to be retained, they are rounded downward, the same as truncation. If they represent more than half, they are rounded upward, thus adding one in the position of the smallest retained bit. If they are exactly half, they are rounded downward if the smallest retained bit is zero and upward if the bit is one. This is called “round-to-nearest, ties to even.”
This presumes you have all the bits you are rounding away, that none have been lost yet in the course of doing arithmetic. If you cannot keep all the bits, there are techniques for keeping track of enough information about them to do the correct rounding, such as maintaining three bits called guard, round, and sticky bits.

Division by Multiplication and Shifiting

Why when you use the multiplication/shift method of division (for instance multiply by 2^32/10, then shift 32 to the right) with negative numbers you get the expected result minus one?
For instance, if you do 99/10 you get 9, as expected, but if you do -99 / 10 you get -10.
I verified that this is indeed the case (I did this manually with bits) but I can't understand the reason behind it.
If anyone can explain why this happens in simple terms I would be thankful.
Why when you use the multiplication/shift method of division (for instance multiply by 2^32/10, then shift 32 to the right) with negative numbers you get the expected result minus one?
You get the expected result, rounded down.
-99/10 is -9.9 which is -10 rounded down.
Edit: Googled a bit more, this article mentions that you're supposed to handle negatives as a special case:
Be aware that in the debug mode the optimized code can be slower, especially if you have both negative and positive numbers and you have to handle the sign yourself.