32 bit IEEE floating point representation: exponent - ieee-754

If the magnitude of the number to be represented lies within the range 10^-200 to
10^+200, how many bit should be allocated to code the exponent?
It seems that it is already out of the range and it can't be represented. I am so confused, would anyone mind helping me?

Related

Why is this python numerical error happening?

been playing a bit with numerical stability and found this:
>>> sum([1e4,1e20,-1e20])
16384.0
Any ideas why is this happening?
The first two numbers can't be summed accurately because Python's floating point representation doesn't support that many significant decimal digits (it supports 16 digits and your sum requires 17 to be represented accurately). Python is approximating the answer with a single bit in the least significant part of the mantissa.
The difference between the answer you get after adding the third number and the answer you are expecting represents the error in the representation of the intermediate result. After that subtraction, that single bit in the mantissa is all that is left; when the exponent is normalized, you are left with 16384. The fact that it is a power of two tips you off to what is happening.

Adding fixed point integers in two's Compments

I was referring a past paper for a exam then I found this. I am confused about this question. Help would be great.
Add the following numbers as fixed point integers.Your calculations must be shown by using binary numbers in two's complement
-9.25+(-2.5)
The phrase fixed point integers (where a number is generally either fixed-point or an integer) leads me to believe it's really just a scaled integer.
In other words, the actual representation of those numbers would be -925 and -250 (with a scale of 100).
So I would think the process would be to convert those to binary, do the two's complement addition, then convert back to decimal, hopefully giving -1175 would would scale to -11.75).

Is hardcode float precise if it can be represented by binary format in IEEE 754?

for example, 0 , 0.5, 0.15625 , 1 , 2 , 3... are values converted from IEEE 754. Are their hardcode version precise?
for example:
is
float a=0;
if(a==0){
return true;
}
always return true? other example:
float a=0.5;
float b=0.25;
float c=0.125;
is a * b always equal to 0.125 and a * b==c always true? And one more example:
int a=123;
float b=0.5;
is a * b always be 61.5? or in general, is integer multiply by IEEE 754 binary float precise?
Or a more general question: if the value is hardcode and both the value and result can be represented by binary format in IEEE 754 (e.g.:0.5 - 0.125), is the value precise?
There is no inherent fuzzyness in floating-point numbers. It's just that some, but not all, real numbers can't be exactly represented.
Compare with a fixed-width decimal representation, let's say with three digits. The integer 1 can be represented, using 1.00, and 1/10 can be represented, using 0.10, but 1/3 can only be approximated, using 0.33.
If we instead use binary digits, the integer 1 would be represented as 1.00 (binary digits), 1/2 as 0.10, 1/4 as 0.01, but 1/3 can (again) only be approximated.
There are some things to remember, though:
It's not the same numbers as with decimal digits. 1/10 can be
written exactly as 0.1 using decimal digits, but not using binary
digits, no matter how many you use (short of infinity).
In practice, it is difficult to keep track of which numbers can be
represented and which can't. 0.5 can, but 0.4 can't. So when you need
exact numbers, such as (often) when working with money, you shouldn't
use floating-point numbers.
According to some sources, some processors do strange things
internally when performing floating-point calculations on numbers
that can't be exactly represented, causing results to vary in a way
that is, in practice, unpredictable.
(My opinion is that it's actually a reasonable first approximation to say that yes, floating-point numbers are inherently fuzzy, so unless you are sure your particular application can handle that, stay away from them.)
For more details than you probably need or want, read the famous What Every Computer Scientist Should Know About Floating-Point Arithmetic. Also, this somewhat more accessible website: The Floating-Point Guide.
No, but as Thomas Padron-McCarthy says, some numbers can be exactly represented using binary but not all of them can.
This is the way I explain it to non-developers who I work with (like Mahmut Ali I also work on an very old financial package): Imagine having a very large cake that is cut into 256 slices. Now you can give 1 person the whole cake, 2 people half of the slices but soon as you decide to split it between 3 you can't - it's either 85 or 86 - you can't split the cake any further. The same is with floating point. You can only get exact numbers on some representations - some numbers can only be closely approximated.
C++ does not require binary floating point representation. Built-in integers are required to have a binary representation, commonly two's complement, but one's complement and sign and magnitude are also supported. But floating point can be e.g. decimal.
This leaves open the question of whether C++ floating point can have a radix that does not have 2 as a prime factor, like 2 and 10. Are other radixes permitted? I don't know, and last time I tried to check that, I failed.
However, assuming that the radix must be 2 or 10, then all your examples involve values that are powers of 2 and therefore can be exactly represented.
This means that the single answer to most of your questions is “yes”. The exception is the question “is integer multiply by IEEE 754 binary float [exact]”. If the result exceeds the precision available, then it can't be exact, but otherwise it is.
See the classic “What Every Computer Scientist Should Know About Floating-Point Arithmetic” for background info about floating point representation & properties in general.
If a value can be exactly represented in 32-bit or 64-bit IEEE 754, then that doesn't mean that it can be exactly represented with some other floating point representation. That's because different 32-bit representations and different 64-bit representations use different number of bits to hold the mantissa and have different exponent ranges. So a number that can be exactly represented in one way, can be beyond the precision or range of some other representation.
You can use std::numeric_limits<T>::is_iec559 (where e.g. T is double) to check whether your implementation claims to be IEEE 754 compatible. However, when floating point optimizations are turned on, at least the g++ compiler (1)erroneously claims to be IEEE 754, while not treating e.g. NaN values correctly according to that standard. In effect, the is_iec559 only tells you whether the number representation is IEEE 754, and not whether the semantics conform.
(1) Essentially, instead of providing different types for different semantics, gcc and g++ try to accommodate different semantics via compiler options. And with separate compilation of parts of a program, that can't conform to the C++ standard.
In principle, this should be possible. If you restrict yourself to exactly this class of numbers with a finite 2-power representation.
But it is dangerous: what if someone takes your code and changes your 0.5 to 0.4 or your .0625 to .065 due to whatever reasons? Then your code is broken. And no, even excessive comments won't help about that - someone will always ignore them.

When working with floating point numbers, is it good to normalize between 0 and 1?

I have numbers in the range of let's say 1e10 and 1e11. Is it better to normalize those numbers to [0;1] before making any calculations and/or comparisons for the sake of accuracy? I wonder because I heard that between 0 and 1 there are as many representable numbers than from 1 to infinity. However I can't find a source for that.
You can't increase the precision of an existing floating point number. There is no "hidden" precision that can be recovered through normalization, on the contrary, normalization is more likely to reduce the precision of a number due to rounding error. That said, there are some mathematical operations that may produce a more precise result if the inputs are normalized in some way first, but that depends specifically on the operations you are performing.
Floating point numbers are stored in memory in, well, floating point, that is scientific notation. That is 1.23456789e10, 1.23456789e-10 and 1.23456789 will all hold the same number of significant digits.
It is true that, mathematically, there are infinite numbers between 0 and 1, (that would be Aleph-1?), but that is irrelevant to the discussion, because a floating point variable can only hold so many different values. For example a 4-byte floating point variable has 32 bits, so it is impossible to make more than 2^32 different floating point values.

Exponent in IEEE 754

Why exponent in float is displaced by 127?
Well, the real question is : What is the advantage of such notation in comparison to 2's complement notation?
Since the exponent as stored is unsigned, it is possible to use integer instructions to compare floating point values. the the entire floating point value can be treated as a signed magnitude integer value for purposes of comparison (not twos-compliment).
Just to correct some misinformation: it is 2^n * 1.mantissa, the 1 infront of the fraction is implicitly stored.
Note that there is a slight difference in the representable range for the exponent, between biased and 2's complement. The IEEE standard supports exponents in the range of (-127 to +128), while if it was 2's complement, it would be (-128 to +127). I don't really know the reason why the standard chooses the bias form, but maybe the committee members thought it would be more useful to allow extremely large numbers, rather than extremely small numbers.
#Stephen Canon, in response to ysap's answer (sorry, this should have been a follow up comment to my answer, but the original answer was entered as an unregistered user, so I cannot really comment it yet).
Stephen, obviously you are right, the exponent range I mentioned is incorrect, but the spirit of the answer still applies. Assuming that if it was 2's complement instead of biased value, and assuming that the 0x00 and 0xFF values would still be special values, then the biased exponents allow for (2x) bigger numbers than the 2's complement exponents.
The exponent in a 32-bit float consists of 8 bits, but without a sign bit. So the range is effectively [0;255]. In order to represent numbers < 2^0, that range is shifted by 127, becoming [-127;128].
That way, very small numbers can be represented very precisely. With a [0;255] range, small numbers would have to be represented as 2^0 * 0.mantissa with lots of zeroes in the mantissa. But with a [-127;128] range, small numbers are more precise because they can be represented as 2^-126 * 0.mantissa (with less unnecessary zeroes in the mantissa). Hope you get the point.