Adding fixed point integers in two's Compments - twos-complement

I was referring a past paper for a exam then I found this. I am confused about this question. Help would be great.
Add the following numbers as fixed point integers.Your calculations must be shown by using binary numbers in two's complement
-9.25+(-2.5)

The phrase fixed point integers (where a number is generally either fixed-point or an integer) leads me to believe it's really just a scaled integer.
In other words, the actual representation of those numbers would be -925 and -250 (with a scale of 100).
So I would think the process would be to convert those to binary, do the two's complement addition, then convert back to decimal, hopefully giving -1175 would would scale to -11.75).

Related

Is a floating-point value of 0.0 represented differently from other floating-point values?

I've been going back through my C++ book, and I came across a statement that says zero can be represented exactly as a floating-point number. I was wondering how this is possible unless the value of 0.0 is stored as a type other than a floating point value. I wrote the following code to test this:
#include <iomanip>
#include <iostream>
int main()
{
float value1 {0.0};
float value2 {0.1};
std::cout << std::setprecision(10) << std::fixed;
std::cout << value1 << '\n'
<< value2 << std::endl;
}
Running this code gave the following output:
0.0000000000
0.1000000015
To 10 digits of precision, 0.0 is still 0, and 0.1 has some inaccuracies (which is to be expected). Is a value of 0.0 different from other floating point numbers in the way it is represented, and is this a feature of the compiler or the computer's architecture?
How can 2 be represented as an exact number? 4? 15? 0.5? The answer is just that some numbers can be represented exactly in the floating-point format (which is based on base-2/binary) and others can't.
This is no different from in decimal. You can't represent 1/3 exactly in decimal, but that doesn't mean you can't represent 0.
Zero is special in a way, because (like the other real numbers) it's more trivial to prove this property than for some arbitrary fractional number. But that's about it.
So:
what is it about these values (0, 1/16, 1/2048, ...) that allows them to be represented exactly.
Simple mathematics. In any given base, in the sort of representation we're talking about, some numbers can be written out with a fixed number of decimal places; others can't. That's it.
You can play online with H. Schmidt's IEEE-754 Floating Point Converter for different numbers to see a bunch of different representations, and what errors come about as a result of encoding into those representations. For starters, try 0.5, 0.2 and 0.1.
It was my (perhaps naive) understanding that all floating point values contained some instability.
No, absolutely not.
You want to treat every floating point value in your program as potentially having some small error on it, because you generally don't know what sequence of calculations led to it. You can't trust it, in general. I expect someone half-taught this to you in the past, and that's what led to your misunderstanding.
But, if you do know the error (or lack thereof) involved at each step in the creation of the value (e.g. "all I've done is initialised it to zero"), then that's fine! No need to worry about it then.
Here is one way to look at the situation: with 64 bits to store a number, there are 2^64 bit patterns. Some of these are "not-a-number" representations, but most of the 2^64 patterns represent numbers. The number that is represented is represented exactly, with no error. This might seem strange after learning about floating point math; a caveat lurks ahead.
However, as huge as 2^64 is, there are infinitely many more real numbers. When a calculation produces a non-integer result, the odds are pretty good that the answer will not be a number represented by one of the 2^64 patterns. There are exceptions. For example, 1/2 is represented by one of the patterns. If you store 0.5 in a floating point variable, it will actually store 0.5. Let's try that for other single-digit denominators. (Note: I am writing fractions for their expressive power; I do not intend integer arithmetic.)
1/1 – stored exactly
1/2 – stored exactly
1/3 – not stored exactly
1/4 – stored exactly
1/5 – not stored exactly
1/6 – not stored exactly
1/7 – not stored exactly
1/8 – stored exactly
1/9 – not stored exactly
So with these simple examples, over half are not stored exactly. When you get into more complicated calculations, any one piece of the calculation can throw you off the islands of exact representation. Do you see why the general rule of thumb is that floating point values are not exact? It is incredibly easy to fall into that realm. It is possible to avoid it, but don't count on it.
Some numbers can be represented exactly by a floating point value. Most cannot.

Why is this python numerical error happening?

been playing a bit with numerical stability and found this:
>>> sum([1e4,1e20,-1e20])
16384.0
Any ideas why is this happening?
The first two numbers can't be summed accurately because Python's floating point representation doesn't support that many significant decimal digits (it supports 16 digits and your sum requires 17 to be represented accurately). Python is approximating the answer with a single bit in the least significant part of the mantissa.
The difference between the answer you get after adding the third number and the answer you are expecting represents the error in the representation of the intermediate result. After that subtraction, that single bit in the mantissa is all that is left; when the exponent is normalized, you are left with 16384. The fact that it is a power of two tips you off to what is happening.

Dividing two floats doesn't give exact result [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 9 years ago.
I had divided 9501/100.0f expecting to get result of 95.01f, but for some deviant reason the result was 95.01000000002f.
I am aware of rounding errors and also that dividing two bigger floats can give improper result, but these two numbers are relative small, and they should not give bad answer.
I have changed floats to doubles, only to see the same result.
So my answer is, why am I seeing this false output?
And eventually workaround without copying number to string and back.
Floating point numbers are not precise, and dealing with them has lots of idiosyncrasies.
What Every Computer Scientist Should Know About Floating-Point Arithmetic
I also enjoy Bruce Dawson's blog entries on floating point values.
Floating point numbers are numbers represented in binary with limited precision.
The error between expected result and actual result is caused by the fact, that the number 95.01 is infinitely periodical in binary representation.
Double has only 51 binary digits, thus there has to be some rounding before the number is stored in the double precision. Single precision has only 23 digits.
It is not possible to represent 95.01 in finite precision floatin point number without any error.
However, you may trust the first 6-9 decimal digits, thus you should format the number with some meaningfull format.
Ahh good, another one of us has become a man in the church of programming :)
Floating points are not exact, the precision will vary from machine to machine. 1.0f != 1.00000000000000000000000000000000000 and so on, it's more like 1.0000001002003400011 and so on (I just picked arbitrary numbers here).

How to convert float to double(both stored in IEEE-754 representation) without losing precision?

I mean, for example, I have the following number encoded in IEEE-754 single precision:
"0100 0001 1011 1110 1100 1100 1100 1100" (approximately 23.85 in decimal)
The binary number above is stored in literal string.
The question is, how can I convert this string into IEEE-754 double precision representation(somewhat like the following one, but the value is not the same), WITHOUT losing precision?
"0100 0000 0011 0111 1101 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010"
which is the same number encoded in IEEE-754 double precision.
I have tried using the following algorithm to convert the first string back to decimal number first, but it loses precision.
num in decimal = (sign) * (1 + frac * 2^(-23)) * 2^(exp - 127)
I'm using Qt C++ Framework on Windows platform.
EDIT: I must apologize maybe I didn't get the question clearly expressed.
What I mean is that I don't know the true value 23.85, I only got the first string and I want to convert it to double precision representation without precision loss.
Well: keep the sign bit, rewrite the exponent (minus old bias, plus new bias), and pad the mantissa with zeros on the right...
(As #Mark says, you have to treat some special cases separately, namely when the biased exponent is either zero or max.)
IEEE-754 (and floating point in general) cannot represent periodic binary decimals with full precision. Not even when they, in fact, are rational numbers with relatively small integer numerator and denominator. Some languages provide a rational type that may do it (they are the languages that also support unbounded precision integers).
As a consequence those two numbers you posted are NOT the same number.
They in fact are:
10111.11011001100110011000000000000000000000000000000000000000 ...
10111.11011001100110011001100110011001100110011001101000000000 ...
where ... represent an infinite sequence of 0s.
Stephen Canon in a comment above gives you the corresponding decimal values (did not check them, but I have no reason to doubt he got them right).
Therefore the conversion you want to do cannot be done as the single precision number does not have the information you would need (you have NO WAY to know if the number is in fact periodic or simply looks like being because there happens to be a repetition).
First of all, +1 for identifying the input in binary.
Second, that number does not represent 23.85, but slightly less. If you flip its last binary digit from 0 to 1, the number will still not accurately represent 23.85, but slightly more. Those differences cannot be adequately captured in a float, but they can be approximately captured in a double.
Third, what you think you are losing is called accuracy, not precision. The precision of the number always grows by conversion from single precision to double precision, while the accuracy can never improve by a conversion (your inaccurate number remains inaccurate, but the additional precision makes it more obvious).
I recommend converting to a float or rounding or adding a very small value just before displaying (or logging) the number, because visual appearance is what you really lost by increasing the precision.
Resist the temptation to round right after the cast and to use the rounded value in subsequent computation - this is especially risky in loops. While this might appear to correct the issue in the debugger, the accummulated additional inaccuracies could distort the end result even more.
It might be easiest to convert the string into an actual float, convert that to a double, and convert it back to a string.
Binary floating points cannot, in general, represent decimal fraction values exactly. The conversion from a decimal fractional value to a binary floating point (see "Bellerophon" in "How to Read Floating-Point Numbers Accurately" by William D.Clinger) and from a binary floating point back to a decimal value (see "Dragon4" in "How to Print Floating-Point Numbers Accurately" by Guy L.Steele Jr. and Jon L.White) yield the expected results because one converts a decimal number to the closest representable binary floating point and the other controls the error to know which decimal value it came from (both algorithms are improved on and made more practical in David Gay's dtoa.c. The algorithms are the basis for restoring std::numeric_limits<T>::digits10 decimal digits (except, potentially, trailing zeros) from a floating point value stored in type T.
Unfortunately, expanding a float to a double wrecks havoc on the value: Trying to format the new number will in many cases not yield the decimal original because the float padded with zeros is different from the closest double Bellerophon would create and, thus, Dragon4 expects. There are basically two approaches which work reasonably well, however:
As someone suggested convert the float to a string and this string into a double. This isn't particularly efficient but can be proven to produce the correct results (assuming a correct implementation of the not entirely trivial algorithms, of course).
Assuming your value is in a reasonable range, you can multiply it by a power of 10 such that the least significant decimal digit is non-zero, convert this number to an integer, this integer to a double, and finally divide the resulting double by the original power of 10. I don't have a proof that this yields the correct number but for the range of value I'm interested in and which I want to store accurately in a float, this works.
One reasonable approach to avoid this entirely issue is to use decimal floating point values as described for C++ in the Decimal TR in the first place. Unfortunately, these are not, yet, part of the standard but I have submitted a proposal to the C++ standardization committee to get this changed.

Exponent in IEEE 754

Why exponent in float is displaced by 127?
Well, the real question is : What is the advantage of such notation in comparison to 2's complement notation?
Since the exponent as stored is unsigned, it is possible to use integer instructions to compare floating point values. the the entire floating point value can be treated as a signed magnitude integer value for purposes of comparison (not twos-compliment).
Just to correct some misinformation: it is 2^n * 1.mantissa, the 1 infront of the fraction is implicitly stored.
Note that there is a slight difference in the representable range for the exponent, between biased and 2's complement. The IEEE standard supports exponents in the range of (-127 to +128), while if it was 2's complement, it would be (-128 to +127). I don't really know the reason why the standard chooses the bias form, but maybe the committee members thought it would be more useful to allow extremely large numbers, rather than extremely small numbers.
#Stephen Canon, in response to ysap's answer (sorry, this should have been a follow up comment to my answer, but the original answer was entered as an unregistered user, so I cannot really comment it yet).
Stephen, obviously you are right, the exponent range I mentioned is incorrect, but the spirit of the answer still applies. Assuming that if it was 2's complement instead of biased value, and assuming that the 0x00 and 0xFF values would still be special values, then the biased exponents allow for (2x) bigger numbers than the 2's complement exponents.
The exponent in a 32-bit float consists of 8 bits, but without a sign bit. So the range is effectively [0;255]. In order to represent numbers < 2^0, that range is shifted by 127, becoming [-127;128].
That way, very small numbers can be represented very precisely. With a [0;255] range, small numbers would have to be represented as 2^0 * 0.mantissa with lots of zeroes in the mantissa. But with a [-127;128] range, small numbers are more precise because they can be represented as 2^-126 * 0.mantissa (with less unnecessary zeroes in the mantissa). Hope you get the point.