Division by Multiplication and Shifiting - bit-manipulation

Why when you use the multiplication/shift method of division (for instance multiply by 2^32/10, then shift 32 to the right) with negative numbers you get the expected result minus one?
For instance, if you do 99/10 you get 9, as expected, but if you do -99 / 10 you get -10.
I verified that this is indeed the case (I did this manually with bits) but I can't understand the reason behind it.
If anyone can explain why this happens in simple terms I would be thankful.

Why when you use the multiplication/shift method of division (for instance multiply by 2^32/10, then shift 32 to the right) with negative numbers you get the expected result minus one?
You get the expected result, rounded down.
-99/10 is -9.9 which is -10 rounded down.
Edit: Googled a bit more, this article mentions that you're supposed to handle negatives as a special case:
Be aware that in the debug mode the optimized code can be slower, especially if you have both negative and positive numbers and you have to handle the sign yourself.

Related

Software implementation of floating point division, issues with rounding

As a learning project I am implementing floating point operations (add, sub, mul, div) in software using c++. The goal is to be more comfortable with the underlying details of floating point behavior.
I am trying to match my processor operations to the exact bit, meaning IEEE 754 standard. So far it has been working great, add, sub and mult behave perfectly, I tested it on around 110 million random operations and got the same exact result to what the processor does in hardware. (Although did not take into account edge cases, overflow etc).
After that, I started moving to the last operation, division. It works fine and achieves the wanted result, but from time to time, I get the last mantissa bit wrong, not rounded up. I am having a bit of hard time understanding why.
The main reference I have been using is the great talk from John Farrier (the time stamp is at the point where it shows how to round):
https://youtu.be/k12BJGSc2Nc?t=1153
That rounding has been working really well for all operation but is giving me troubles for the division.
Let me give you a specific example.
I am trying to divide 645.68011474609375 by 493.20962524414063
The final result I get is :
mine : 0-01111111-01001111001000111100000
c++_ : 0-01111111-01001111001000111100001
As you can see everything matches except for the last bit. The way I am computing the division is based on this video:
https://www.youtube.com/watch?v=fi8A4zz1d-s
Following this, I compute 28 bits off accuracy 24 of mantissa ( hidden one + 23 mantissa) and the 3 bits for guard, round sticky plus an extra one for the possible shift.
Using the algorithm of the video, I can at maximum get a normalization shift of 1, that s why I have an extra bit at the end in case gets shifted in in the normalization, so will be available in the rounding. Now here is the result I get from the division algorithm:
010100111100100011110000 0100
------------------------ ----
^ grs^
|__ to be normalized |____ extra bit
As you can see I get a 0 in the 24th position, so I will need to shift on the left by one to get the correct normalization.
This mean I will get:
10100111100100011110000 100
Based on the video of John Farrier, in the case of 100 grs bits, I only normalize if the LSB of the mantissa is a 1. In my case is a zero, and that is why I do not round up my result.
The reason why I am a bit lost is that I am sure my algorithm is computing the right mantissa, I have double checked it with online calculators, the rounding strategy works for all the other operations. Also, computing in this way, triggers the normalization, which yields, in the end, the correct exponent.
Am I missing something ? a small detail somewhere?
One thing that strikes me as odd is the sticky bits, in the addition and multiplication you get a different degree of shifting, which leads to higher chances of the sticky bits to trigger, in this case here, I shift only by one maximum which puts the sticky bits as to be not really sticky.
I do hope I gave enough details to make my problem understood. Here you can find at the bottom my division implementation, is a bit filled with prints I am using for debugging but should give an idea of what I am doing, the code starts at line 374:
https://gist.github.com/giordi91/1388504fadcf94b3f6f42103dfd1f938
PS: meanwhile I am going through the "everything scientist should know about floating point numbers" in order to see if I missed something.
The result you get from the division algorithm is inadequate. You show:
010100111100100011110000 0100
------------------------ ----
^ grs^
|__ to be normalized |____ extra bit
The mathematically exact quotient continues:
010100111100100011110000 0100 110000111100100100011110…
Thus, the residue at the point where you are rounding exceeds ½ ULP, so it should be rounded up. I did not study your code in detail, but it looks like you may have just calculated an extra bit or two of the significand1. You actually need to know that the residue is non-zero, not just whether its next bit or two is zero. The final sticky bit should be one if any of the bits at or beyond that position in the exact mathematical result would be non-zero.
Footnote
1 “Significand” is the preferred term. “Mantissa” is a legacy term for the fraction portion of a logarithm. The significand of a floating-point value is linear. A mantissa is logarithmic.

Calculating pow with doubles gives wrong results

I'm programming a calculator on an Arduino and I'm trying to calculate pow and writing it to a string (result). This is my code:
dtostrf(exp(n*log(x)), 0, 5, result); // x ^ n
2 ^ 2 = 4.00000 // works fine
10 ^ 5 = 99999.9770 // should be 100000
What's wrong with my code and how can I always get the right result?
I mean how can I round it but still be able to use doubles ( e.g. 5.2 ^ 3.123 )
You're just hitting rounding errors. There's nothing you can do about this, except revert to an integer-based approach whenever the inputs are integers.
You could condition on whether the inputs are integers, and if so then use integer arithmetic; if not, then use doubles. But using exp and log will always introduce rounding errors, so you can't expect exact answers with that approach.
More precisely, to use integer arithmetic, you need the base to be an integer and the exponent to be a non-negative integer.
Since you are programming a calculator, speed is not your concern but the number of reliable digits is. So, you could try to use a double precision library. It uses 64-Bit-doubles but has only about 200 FLOPS at 16MHz CPU clock and much less at higher-order calculations like exp(), log(), or sin(). Thus, it will take a second after having typed in the digits and pressed the enter button but this was also the case with the old 8-Bit-based pocket caluclators.
See this Link (only in German)

d0 when taking roots of numbers

So in general, I understand the difference between specifying 3. and 3.0d0 with the difference being the number of digits stored by the computer. When doing arithmetic operations, I generally make sure everything is in double precision. However, I am confused about the following operations:
64^(1./3.) vs. 64^(1.0d0/3.0d0)
It took me a couple of weeks to find an error where I was assigning the output of 64^(1.0d0/3.0d0) to an integer. Because 64^(1.0d0/3.0d0) returns 3.999999, the integer got the value 3 and not 4. However, 64^(1./3.) = 4.00000. Can someone explain to me why it is wise to use 1./3. vs. 1.0d0/3.0d0 here?
The issue isn't so much single versus double precision. All floating point calculations are subject to imprecision compared to true real numbers. In assigning a real to an integer, Fortran truncates. You probably want to use the Fortran intrinsic nint.
this is a peculiar fortuitous case where the lower precision calculation gives the exact result. You can see this without the integer conversion issue:
write(*,*)4.d0-64**(1./3.),4.d0-64**(1.d0/3.d0)
0.000000000 4.440892E-016
In general this does not happen, here the double precision value is "better"
write(*,*)13.d0-2197**(1./3.),13.d0-2197**(1.d0/3.d0)
-9.5367E-7 1.77E-015
Here, since the s.p. calc comes out slightly high it gives you the correct value on integer conversion, while the d.p. result will get rounded down, hence be wrong, even though the floating point error was smaller.
So in general, no you should not consider use of single precision to be preferred.
in fact 64 and 125 seem to be the only special cases where the s.p. calc gives a perfect cube root while the d.p. calc does not.

What is the most efficient way to sum a fractional part of a double and increment when it "overflows"?

To make a long story short, I have a piece of code, well over a decade old, that is in use both by us and by outside customers. We have a "shift" number by which we move a shifting window. It was designed as an integer because, well, we're going over distinct positions in the data, so there was no conception of a fractional move. Now, they'd like to be able to have a non-integer shift number. Not interpolation, but simply having the program do an integer shift, but have it shift by a little more when we pass that boundary.
An example may make more sense. Let's say that we have a shift of 10. The positions will go as follows:
0, 10, 20, 30, 40, etc
Now, instead, we want to be able to set a shift of 10.4. Thus, we want the shift to work as follows:
0, 10, 20, 31, 41, etc
A shift of 10.5 would instead come out as 0, 10, 21, 31, 42, and so on.
Basically, we sum up that fractional bit and when it crosses the decimal point, we just shift by one more. Of course, as often happens with floating point operations, we run into potential issues of accuracy, but we also want to keep the speed up. One naive approaches are to either separate that fractional bit out at the start and keep summing it up, checking its value and decrementing it when it hits 1.0. This has the advantage of following how I tend to think of the operation, but it involves a conditional check every iteration and there's the usual potential for accumulative errors.
I could also see pre-calculating how many times we can add that fractional bit before we have to check to see if it exceeds 1.0 (so if our fractional bit is 0.5, we know that we only need to check every other time. Or if it's 0.3, we know that we only have to check every four or so).
The usual approach to handling repeated summing is, of course, to replace it with a multiplication, but here, we don't care so much about the actual sum as we do predicting which frames we need to "shift one more" on to make things match up at the end.
The typical task we have involves this class operating on a relatively small combination of factors, for example iterating with a shift of 96.46875 for less than 3000 times. However, there's no guarantee that this constraint will remain valid, so I've been told to account for the possibility that someone will shift the window ten million times and we'll still want to know how far to shift.
Any advice?
Consider setting shift to the double nearest the desired value and increasing it slightly (just once) with:
shift = nexttoward(shift, INFINITY); // Ensure shift is above the threshold.
Then, to calculate the current position, use:
result = floor(step * shift);
This may produce a value too large when the product of step and the error in shift nears one. (There may also be a slight rounding error in the multiplication itself.) However, that will not occur for many steps, as shown below.
The error in shift is at most 1.5 ULP (.5 ULP from the initial conversion from decimal and 1 from nexttoward). If shift is less than 1024, an ULP is less than 210–52. If step is at most 10,000,000, then the error is less than 10,000,000 • 1.5 • 210–52, which is approximately 3.41•10–6. So it remains a long way from the magnitude necessary to produce an incorrect result.
If you calculate the result cumulatively by adding shift each time, instead of by a fresh multiplication, then there may be additional errors. These likely remain too small to cause an error, but they should be evaluated.
If you reach the limits described above, there are ways to mitigate the errors further.
Why not use the floor function. Without seeing what code you've got, here's a guess at what would work
for (int i=0; i < 3000; i++)
{
cout << static_cast<int>(floor(i*shift)) << '\n';
}
One could argue against the efficiency of this approach, but if you are talking less than 3000 iterations, you are fine.

c++ Floating point subtraction error and absolute values

The way I understand it is: when subtracting two double numbers with double precision in c++ they are first transformed to a significand starting with one times 2 to the power of the exponent. Then one can get an error if the subtracted numbers have the same exponent and many of the same digits in the significand, leading to loss of precision. To test this for my code I wrote the following safe addition function:
double Sadd(double d1, double d2, int& report, double prec) {
int exp1, exp2;
double man1=frexp(d1, &exp1), man2=frexp(d2, &exp2);
if(d1*d2<0) {
if(exp1==exp2) {
if(abs(man1+man2)<prec) {
cout << "Floating point error" << endl;
report=0;
}
}
}
return d1+d2;
}
However, testing this I notice something strange: it seems that the actual error (not whether the function reports error but the actual one resulting from the computation) seems to depend on the absolute values of the subtracted numbers and not just the number of equal digits in the significand...
For examples, using 1e-11 as the precision prec and subtracting the following numbers:
1) 9.8989898989898-9.8989898989897: The function reports error and I get the highly incorrect value 9.9475983006414e-14
2) 98989898989898-98989898989897: The function reports error but I get the correct value 1
Obviously I have misunderstood something. Any ideas?
If you subtract two floating-point values that are nearly equal, the result will mostly reflect noise in the low bits. Nearly equal here is more than just same exponent and almost the same digits. For example, 1.0001 and 1.0000 are nearly equal, and subtracting them could be caught by a test like this. But 1.0000 and 0.9999 differ by exactly the same amount, and would not be caught by a test like this.
Further, this is not a safe addition function. Rather, it's a post-hoc check for a design/coding error. If you're subtracting two values that are so close together that noise matters you've made a mistake. Fix the mistake. I'm not objecting to using something like this as a debugging aid, but please call it something that implies that that's what it is, rather than suggesting that there's something inherently dangerous about floating-point addition. Further, putting the check inside the addition function seems excessive: an assert that the two values won't cause problems, followed by a plain old floating-point addition, would probably be better. After all, most of the additions in your code won't lead to problems, and you'd better know where the problem spots are; put asserts in the problems spots.
+1 to Pete Becker's answer.
Note that the problem of degenerated result might also occur with exp1!=exp2
For example, if you subtract
1.0-0.99999999999999
So,
bool degenerated =
(epx1==exp2 && abs(d1+d2)<prec)
|| (epx1==exp2-1 && abs(d1+2*d2)<prec)
|| (epx1==exp2+1 && abs(2*d1+d2)<prec);
You can omit the check for d1*d2<0, or keep it to avoid the whole test otherwise...
If you want to also handle loss of precision with degenerated denormalized floats, that'll be a bit more involved (it's as if the significand had less bits).
It's quite easy to prove that for IEEE 754 floating-point arithmetic, if x/2 <= y <= 2x then calculating x - y is an exact operation and will give the exact result correctly without any rounding error.
And if the result of an addition or subtraction is a denormalised number, then the result is always exact.